stories filed under: "hashes"

Content Moderation Case Study: Using Hashes And Scanning To Stop Cloud Storage From Being Used For Infringement (2014)

from the cloud-storage-scanning dept

Wed, Jan 20th 2021 3:53pm — Copia Institute

Summary: Since the rise of the internet, the recording industry has been particularly concerned about how the internet can and will be used to share infringing content. Over time, the focus of that concern has shifted as the technology (as well as copyright laws) have shifted. In the early 2000s, most of the concern was around file sharing applications, services and sites, such as Napster, Limewire, and The Pirate Bay. However, after 2010, much of the emphasis switched to so-called “cyberlockers.”

Unlike file sharing apps, that involved person-to-person sharing directly from their own computers via intermediary technologies, a cyberlocker was more of a hard drive on the internet. The issue was that some would store large quantities of music files, and then make them available for unlicensed downloading.

While some cyberlockers were built directly around this use-case, at the same time, cloud storage companies were trying to build legitimate businesses, allowing consumers and businesses to store their own files in the cloud, rather than on their own hard drive. However, technologically, there is little to distinguish a cloud storage service from a cyberlocker, and as the entertainment industry became more vocal about the issue, some services started to change their policies.

Dropbox is one of the most well-known cloud storage companies. Wishing to avoid facing comparisons to cyberlockers built off of the sharing of infringing works, the company put in place a system to make it more difficult to use the service for sharing works in an infringing manner, while still allowing the service to be useful for storing personal files.

Specifically, if Dropbox received a DMCA takedown notice for a specific file, the company would create a hash (a computer generated identifier that would be the same for all identical files), and then if you shared any file from your Dropbox to someone else (such as by creating a shareable link), Dropbox would create a hash and check it against the database of hashes of files that had previously received DMCA takedown notices.

This got some attention in 2014 when a user on Twitter highlighted that he had been blocked from sharing a file because of this, raising concerns that Dropbox was looking at everyone’s files.

Dropbox quickly clarified that it is not scanning every file, nor was it looking at everyone’s files. Rather it was using an automated process to check files that were being shared and see if they matched files that had previously been subject to a DMCA takedown notice:

“There have been some questions around how we handle copyright notices. We sometimes receive DMCA notices to remove links on copyright grounds. When we receive these, we process them according to the law and disable the identified link. We have an automated system that then prevents other users from sharing the identical material using another Dropbox link. This is done by comparing file hashes. We don’t look at the files in your private folders and are committed to keeping your stuff safe.”

Decisions to be made by Dropbox:

How proactive does the company need to be to remain on the compliant side of copyright law?
Will blocking sharing of files that might be shared for non-infringing purposes, make the service less useful to users?
What steps are necessary to avoid being accused of supporting infringement by traditional copyright industries?

Questions and policy implications to consider:

There may be legitimate, non-infringing reasons to share a file that in other contexts may be infringing.
Is it appropriate for a company to block that possibility?
What measures could be put in place to allow for those possibilities?
The recording and movie industries have a history of being aggressive litigants against technologies used for infringement. What level of response is appropriate for new startups and technology companies?
Will there be limitations on innovation to services like cloud storage imposed by the need to avoid angering certain industries?

Resolution: Dropbox has continued to use a similar setup, and for the most part has avoided being compared to traditional cyberlockers. Since 2014, the issue of DMCA takedowns leading to future blocking of files has not received all that much attention either. There have been a few articles and forum discussions about how it works, with some users looking for workarounds, but for the most part this technological setup appears to have prevented Dropbox from being considered a cyberlocker-style site for infringing file sharing.

Originally published on the Trust & Safety Foundation website.

Filed Under: cloud storage, copyright, cyberlockers, dmca, hashes, private storage, takedowns
Companies: dropbox

8 Comments

Content Moderation Knowledge Sharing Shouldn't Be A Backdoor To Cross-Platform Censorship

Content Moderation

from the too-big-of-a-problem-to-tackle-alone dept

Fri, Aug 21st 2020 12:00pm — Emma Llanso

Ten thousand moderators at YouTube. Fifteen thousand moderators at Facebook. Billions of users, millions of decisions a day. These are the kinds of numbers that dominate most discussions of content moderation today. But we should also be talking about 10, 5, or even 1: the numbers of moderators at sites like Automattic (Wordpress), Pinterest, Medium, and JustPasteIt—sites that host millions of user-generated posts but have far fewer resources than the social media giants.

There are a plethora of smaller services on the web that host videos, images, blogs, discussion fora, product reviews, comments sections, and private file storage. And they face many of the same difficult decisions about the user-generated content (UGC) they host, be it removing child sexual abuse material (CSAM), fighting terrorist abuse of their services, addressing hate speech and harassment, or responding to allegations of copyright infringement. While they may not see the same scale of abuse that Facebook or YouTube does, they also have vastly smaller teams. Even Twitter, often spoken of in the same breath as a “social media giant,” has an order of magnitude fewer moderators at around 1,500.

One response to this resource disparity has been to focus on knowledge and technology sharing across different sites. Smaller sites, the theory goes, can benefit from the lessons learned (and the R&D dollars spent) by the biggest companies as they’ve tried to tackle the practical challenges of content moderation. These challenges include both responding to illegal material and enforcing content policies that govern lawful-but-awful (and mere lawful-but-off-topic) posts.

Some of the earliest efforts at cross-platform information-sharing tackled spam and malware such as the Mail Abuse Prevention System (MAPS) — which maintains blacklists of IP addresses associated with sending spam. Employees at different companies have also informally shared information about emerging trends and threats, and the recently launched Trust & Safety Professional Association is intended to provide people working in content moderation with access to “best practices” and “knowledge sharing” across the field.

There have also been organized efforts to share specific technical approaches to blocking content across different services, namely, hash-matching tools that enable an operator to compare uploaded files to a pre-existing list of content. Microsoft, for example, made its PhotoDNA tool freely available to other sites to use in detecting previously reported images of CSAM. Facebook adopted the tool in May 2011, and by 2016 it was being used by over 50 companies.

Hash-sharing also sits at the center of the Global Internet Forum to Counter Terrorism (GIFCT), an industry-led initiative that includes knowledge-sharing and capacity-building across the industry as one of its 4 main goals. GIFCT works with Tech Against Terrorism, a public-private partnership launched by the UN Counter-Terrrorism Executive Directorate, to “shar[e] best practices and tools between the GIFCT companies and small tech companies and startups.” Thirteen companies (including GIFCT founding companies Facebook, Google, Microsoft, and Twitter) now participate in the hash-sharing consortium.

There are many potential upsides to sharing tools, techniques, and information about threats across different sites. Content moderation is still a relatively new field, and it requires content hosts to consider an enormous range of issues, from the unimaginably atrocious to the benignly absurd. Smaller sites face resource constraints in the number of staff they can devote to moderation, and thus in the range of language fluency, subject matter expertise, and cultural backgrounds that they can apply to the task. They may not have access to — or the resources to develop — technology that can facilitate moderation.

When people who work in moderation share their best practices, and especially their failures, it can help small moderation teams avoid pitfalls and prevent abuse on their sites. And cross-site information-sharing is likely essential to combating cross-site abuse. As scholar evelyn douek discusses (with a strong note of caution) in her Content Cartels paper, there’s currently a focus among major services in sharing information about “coordinated inauthentic behavior” and election interference.

There are also potential downsides to sites coordinating their approaches to content moderation. If sites are sharing their practices for defining prohibited content, it risks creating a de facto standard of acceptable speech across the Internet. This undermines site operators’ ability to set the specific content standards that best enable their communities to thrive — one of the key ways that the Internet can support people’s freedom of expression. And company-to-company technology transfer can give smaller players a leg up, but if that technology comes with a specific definition of “acceptable speech” baked in, it can end up homogenizing the speech available online.

Cross-site knowledge-sharing could also suppress the diversity of approaches to content moderation, especially if knowledge-sharing is viewed as a one-way street, from giant companies to small ones. Smaller services can and do experiment with different ways of grappling with UGC that don’t necessarily rely on a centralized content moderation team, such as Reddit’s moderation powers for subreddits, Wikipedia’s extensive community-run moderation system, or Periscope’s use of “juries” of users to help moderate comments on live video streams. And differences in the business model and core functionality of a site can significantly affect the kind of moderation that actually works for them.

There’s also the risk that policymakers will take nascent “industry best practices” and convert them into new legal mandates. That risk is especially high in the current legislative environment, as policymakers on both sides of the Atlantic are actively debating all sorts of revisions and additions to intermediary liability frameworks.

Early versions of the EU’s Terrorist Content Regulation, for example, would have required intermediaries to adopt “proactive measures” to detect and remove terrorist propaganda, and pointed to the GIFCT’s hash database as an example of what that could look like (CDT joined a coalition of 16 human rights organizations recently in highlighting a number of concerns about the structure of GIFCT and the opacity of the hash database). And the EARN-IT Act in the US is aimed at effectively requiring intermediaries to use tools like PhotoDNA—and not to implement end-to-end encryption.

Potential policymaker overreach is not a reason for content moderators to stop talking to and learning from each other. But it does mean that knowledge-sharing initiatives, especially formalized ones like the GIFCT, need to be attuned to the risks of cross-site censorship and eliminating diversity among online fora. These initiatives should proceed with a clear articulation of what they are able to accomplish (useful exchange of problem-solving strategies, issue-spotting, and instructive failures) and also what they aren’t (creating one standard for prohibited — much less illegal— speech that can be operationalized across the entire Internet).

Crucially, this information exchange needs to be a two-way street. The resource constraints faced by smaller platforms can also lead to innovative ways to tackle abuse and specific techniques that work well for specific communities and use-cases. Different approaches should be explored and examined for their merit, not viewed with suspicion as a deviation from the “standard” way of moderating. Any recommendations and best practices should be flexible enough to be incorporated into different services’ unique approaches to content moderation, rather than act as a forcing function to standardize towards one top-down, centralized model. As much as there is to be gained from sharing knowledge, insights, and technology across different services, there’s no-one-size-fits-all approach to content moderation.

Emma Llansó is the Director of CDT’s Free Expression Project, which works to promote law and policy that support Internet users’ free expression rights in the United States and around the world. Emma also serves on the Board of the Global Network Initiative, a multistakeholder organization that works to advance individuals’ privacy and free expression rights in the ICT sector around the world. She is also a member of the multistakeholder Freedom Online Coalition Advisory Network, which provides advice to FOC member governments aimed at advancing human rights online.

Filed Under: best practices, censorship, content moderation, cross-platform, gifct, hashes, knowledge sharing, maps

6 Comments

Follow Techdirt

Essential Reading

The Techdirt Greenhouse

Read the latest posts:

read all »

Techdirt Deals

Report this ad | Hide Techdirt ads

Techdirt Insider Discord

The latest chatter on the Techdirt Insider Discord channel...

Older Stuff

Thursday
13:33	Former Employees Say Mossad Members Dropped By NSO Officers To Run Off-The-Books Phone Hacks (2)
12:01	No, Creating An NFT Of The Video Of A Horrific Shooting Will Not Get It Removed From The Internet (18)
10:49	San Francisco Cops Are Running Rape Victims' DNA Through Criminal Databases Because What Even The Fuck (18)
10:44	Daily Deal: The Complete 2022 Java Coder Bundle (0)
09:31	As Expected, Trump's Social Network Is Rapidly Banning Users It Doesn't Like, Without Telling Them Why (44)
06:30	Comcast Continues To Bleed Olympics Viewers After Years Of Bumbling (19)
Wednesday
20:42	Apple Finally Defeats Dumb Diverse Emoji Lawsuit One Year Later (6)
15:39	Clearview Pitch Deck Says It's Aiming For A 100 Billion Image Database, Restarting Sales To The Private Sector (10)
13:41	Peloton Outage Prevents Customers From Using $2,500 Exercise Bikes (16)
12:09	The GOP Knows That The Dem's Antitrust Efforts Have A Content Moderation Trojan Horse; Why Don't The Dems? (16)
10:51	Hertz Ordered To Tell Court How Many Thousands Of Renters It Falsely Accuses Of Theft Every Year (24)
09:21	Even As Trump Relies On Section 230 For Truth Social, He's Claiming In Lawsuits That It's Unconstitutional (34)
06:16	Medical, Home Alarm Industries Warn Of Major Outages As AT&T Shuts Down 3G Network (25)
Tuesday
20:37	Video Game History Foundation: Nintendo Actions 'Actively Destructive To Video Game History' (29)
15:35	Massachusetts Court Says No Expectation Of Privacy In Social Media Posts Unwittingly Shared With An Undercover Cop (17)
13:30	Techdirt Podcast Episode 312: Regulating The Internet (2)
12:03	US Copyright Office Gets It Right (Again): AI-Generated Works Do Not Get A Copyright Monopoly (60)
10:42	LA Sheriff Threatens To 'Subject' City Council To 'Defamation Law' If They Won't Stop Calling His Deputies 'Gang Members' (20)
10:37	Daily Deal: codeSpark Academy Sibling Bundle (0)
09:25	Trump's Truth Social Bakes Section 230 Directly Into Its Terms, So Apparently Trump Now Likes Section 230 (128)
06:22	15 Years Late, The FCC Cracks Down On Broadband Apartment Monopolies (31)
Sunday
12:05	Funniest/Most Insightful Comments Of The Week At Techdirt (11)
Saturday
12:00	This Week In Techdirt History: February 13th - 19th (1)
Friday
19:39	Letter From High-Ranking FBI Lawyer Tells Prosecutors How To Avoid Court Scrutiny Of Firearms Analysis Junk Science (25)
15:52	Nintendo Is Beginning To Look Like The Disney Of The Video Game Industry (44)
13:49	Seattle Public Radio Station Manages To Partially Brick Area Mazdas Using Nothing More Than Some Image Files (44)
12:13	Thankfully, Jay Inslee's Unconstitutional Bill To Criminalize Political Speech Dies In The Washington Senate (8)
10:52	How Our Convoluted Copyright Regime Explains Why Spotify Chose Joe Rogan Over Neil Young (136)
10:47	Daily Deal: The Complete Blocs Website Builder Bundle (0)
09:33	Arizona Prosecutor Who Brought Bogus Gang Charges Against Protesters Files Ridiculous Defamation Suit Against Her Boss (12)

Content Moderation Case Study: Using Hashes And Scanning To Stop Cloud Storage From Being Used For Infringement (2014)

from the cloud-storage-scanning dept

Content Moderation Knowledge Sharing Shouldn't Be A Backdoor To Cross-Platform Censorship

from the too-big-of-a-problem-to-tackle-alone dept

The Techdirt Greenhouse

Thursday

Wednesday

Tuesday

Sunday

Saturday

Friday

More

Tools & Services

Company

Contact

More

from the cloud-storage-scanning dept

from the too-big-of-a-problem-to-tackle-alone dept

Techdirt Daily Newsletter

The Techdirt Greenhouse

Tools & Services

Company

Contact

More