Google Drive's Autodetector For Copyright Infringement Is Locking Up Nearly Empty Files
from the whoopsie dept
We've talked at length about the issues surrounding automated copyright infringement "bots" and how often those bots get the primary question they're tagged with wrong. Examples of this are legion: Viacom's bot takes down a Star Trek panel discussion, all kinds of bots disrupted the DNC's livestream of its convention, and one music distributor's bot firing off DMCA notices to, well, everyone. Google itself has reported that nearly 100% of the DMCA notices it gets are just bot-generated buckshot.
But Google isn't the savior here either. The company also uses automated systems for detecting copyright infringement and, at least in the case of Google Drive, those automated systems occasionally suck out loud at their job.
This week, Assistant Professor at Michigan State University, Dr. Emily Dolson, Ph.D. reported seeing some odd behavior when using Google Drive. One of the files in Dolson's Google Drive, 'output04.txt' was nearly empty—with nothing other than the digit '1' inside it.
But according to Google, this file violated the company's "Copyright Infringement policy" and was hence flagged. And what's worse is, the warning sent to the professor ended with "A review cannot be requeste for this restriction."
If your bot thinks a single digit is somehow copyright infringement, then your bot is a bad bot and should be taken behind the woodshed and humanely sent to bot-heaven where it can run and frolic with all the other bots. Now, to be fair, there is an open question in this case as to whether the filepath names that were chosen somehow were what was getting flagged. And, sure, maybe that happened. But it doesn't really change the point: a bot thought a file that contained a single integer was copyright infringement.
That being said, other Drive users have reproduced this, calling into the question the filepath theory.
Dr. Chris Jefferson, Ph.D., an AI and mathematics researcher at the University of St Andrews, was also able to reproduce the issue when uploading multiple computer-generated files to Drive. Jefferson generated over 2,000 files, each containing just a number between -1000 and 1000.
The files containing the digits 173, 174, 186, 266, 285, 302, 336, 451, 500, and 833 were shortly flagged by Google Drive for copyright infringement.
Again, this sucks. For what it's worth, Google has finally responded and, despite the notices indicating there was no way to dispute the bot's findings, has been sharing out links to do exactly that. But that isn't really the point. This is base-level stuff here: having a system that operates this poorly means you have a system that never should have been in production to begin with. Particularly, frankly, when that system is operating as personal file storage for many, many people.
Thank you for reading this Techdirt post. With so many things competing for everyone’s attention these days, we really appreciate you giving us your time. We work hard every day to put quality content out there for our community.
Techdirt is one of the few remaining truly independent media outlets. We do not have a giant corporation behind us, and we rely heavily on our community to support us, in an age when advertisers are increasingly uninterested in sponsoring small, independent sites — especially a site like ours that is unwilling to pull punches in its reporting and analysis.
While other websites have resorted to paywalls, registration requirements, and increasingly annoying/intrusive advertising, we have always kept Techdirt open and available to anyone. But in order to continue doing so, we need your support. We offer a variety of ways for our readers to support us, from direct donations to special subscriptions and cool merchandise — and every little bit helps. Thank you.
–The Techdirt Team
Filed Under: automated filters, censorship, copyright, dmca, filters, google drive, hard drives, upload filters
Companies: google
Reader Comments
The First Word
“1
Subscribe: RSS
View by: Time | Thread
And yet, if they axe this system, you better believe copyright maximalists will cry foul and say Google is trying to let piracy run rampant. It’s a can’t-win situation for Google—one that, due to its own general disinterest in properly standing up for users against false copyright/DMCA claims, it has only made worse for itself over the years.
[ link to this | view in chronology ]
Re:
Google has stood up for it's users against false DMCA claims before. Here's proof.
It's unclear as to whether this anecdote is a diamond in the rough or a drop in the ocean, though.
[ link to this | view in chronology ]
This has been said repeatedly: there is no satisfying the copyright maximalists, so don't even try. It only encourages them. Every inch you give them is another mile they will be asking next.
[ link to this | view in chronology ]
Re:
Correction: Its own general disinterest in paying humans to look into things rather than simply relying on AI to do everything.
Now, before anyone responds and tells me that it's impossible for humans to check everything, I'm not suggesting that. What I am suggesting is that Google should be willing to pay a staff to look into situations like these when the AI screws up. Instead, they've automated the process and getting a human involved requires you to be able to focus some negative media attention on the company.
I can fully understand using AI to help police its sites/services, but if you're going to make products that are used by the general public, you also need to invest in actual staff who are going to make sure that said AI is working properly and not screwing over your users.
How long would your local supermarket last if it was all automated, regularly screwed up, and all your disputes were rejected by a computer?
[ link to this | view in chronology ]
Like I said, “properly standing up for users against false copyright/DMCA claims”.
[ link to this | view in chronology ]
That particular filter only applies if the person tries to share a file, and only stops them sharing it. Sucks though if you are trying to hand in coursework.
[ link to this | view in chronology ]
This is bloody high art, and i'm going to my output04.txt t-shirt from the lobby straight away.
[ link to this | view in chronology ]
Great, someone copyrighted the number 1.
(Along with 173, 174, 186, 266, 285, 302, 336, 451, 500, and 833 apparently.)
Next, I imagine math students during tests...
"What is cos(0)?"
"I can't answer this question for fear of violating someone's copyright."
Seriously, bots can be useful, but - at least until they are somewhat reliable and able to understand context - they should never be used as anything more than an alerting tool. Currently, they are definitely not good enough for automated take-downs.
Obviously, if they can mess up such obvious cases, you can only imagine that they will also strike less obvious but still perfectly legal files.
[ link to this | view in chronology ]
This will never happen. Bots can’t understand context because contexts can change based on a number of variables. They’re good for broad-based “sledgehammer” moderation efforts, but they’ll never be able to handle the kind of narrower “scalpel” moderation that requires looking at context.
[ link to this | view in chronology ]
Re:
stares in 3 minutes of silence & white noise
[ link to this | view in chronology ]
I would be very much interested to know who claims to hold the copyright on 266, 500, and 1.
[ link to this | view in chronology ]
Re:
Don't expect the copyright office to help you with that. Hell, they're lucky if they can find their own asses with both hands, a flashlight, and a multi-color map.
But my guess is that no one holds any such copyrights. That in fact, Google's own bots were trained with numerous test case scenarios, and those were never removed from the "live" database.
[ link to this | view in chronology ]
Re: copyright on 266, 500, and 1.
Well, 266 is a pantone colour that is remarkably close to a certain litigious chocolate manufacturer's "holy IP".
Pantone 500 could be the "salmon" of doubt.
Pantone 1 is a "shade of grey"
And selectively, 173 is 160 + 13, or AD in hexadecimal, which when converted to ASCII is LF and CR - or what you get when you press the ENTER key. 186 is an Intel processor suffix; 285 is an Intel processor with one of the pins missing; 302 is the cubic inch size of some V8 engines; 451 is associated with burning paper; and 833 is LEET for bee.
[ link to this | view in chronology ]
Re:
Don't worry. I'm sure Tero Pulkinnen will start making that claim.
[ link to this | view in chronology ]
Some musings
You can't copyright a number. Can. Not. Do. (But...)
Sometimes, you don't have to. Remember DeCSS? It wasn't a copyright issue (which might, for instance, have protection through 17 usc 512 or section 230). No, it was a DMCA charge.
Of course, sometimes the Streisand Effect comes into play and you get a t-shirt or a song.
[ link to this | view in chronology ]
Re: Some musings
You may not be able to copyright a number, but you can scare people into thinking you can.
Years ago, I used to sell digital models on TurboSquid. One of models had a description that said "includes a table model with 1,747 polygons". Their system flagged and said I couldn't use "747" because it was a Boeing copyright. Yes, human would see that I'm selling a furniture model, not an airplane, but their automated system was set up to flag and avoid possible complaints from companies like Boeing.
This rule also applied to numbers like 350 (BMW), 356 (Porsche), 250 (Ferrari), and so on.
I once uploaded an F-14 aircraft model and my description includes a little history about how the aircraft served on the USS Enterprise aircraft carrier. Their system flagged it and said "Enterprise" was copyrighted by CBS/ Viacom.
(Yet there are plenty of Star Trek models for sale at TurboSquid, so it's not like this flagging is stopping anyone from selling Star Trek models.)
Can someone copyright the word "Enterprise" in every single usage? Probably not, but if a company can scare people into believing they own it, then that's good enough.
[ link to this | view in chronology ]
As the signature for one of my email accounts says: "Artificial intelligence can never overcome natural stupidity."
[ link to this | view in chronology ]
slaps the editor, hands out a d
"A review cannot be requeste for this restriction."
This is the most wrong portion of this.
I mean random numbers triggering the bot is bad, but the fact there is no redress or recourse to challenge it horrifying.
Broken tools & broken systems aside, removing the ability to challenge the "findings" when you spot what you think is an error is wrong on so many levels.
But then the entire system is lopsided in believing anyone who can claim they hold a copyright would never ever fib (despite the huge pile of cases where scammers are making bank) & that it is to onerous or impossible to challenge the claims when even a child could see its not infringing.
[ link to this | view in chronology ]
Re:
I've told this story before, but a couple years ago, I got an email telling me that I'd been banned from posting comments on YouTube. It claimed that I had violated the community standards against spam/advertising. I hadn't posted anything that could be considered either. Strangely, my channel page showed that I had no strikes for anything.
I disputed the ban and a day later received an email saying that they had looked into it and decided that the ban was appropriate. There was no mention of what I supposedly posted and no further options.
I posted on the help forum, someone said that they would mention it to a moderator, but made no promises. About a week later, I got an email that after further review, the ban had been lifted. No explanation of what triggered it in the first place, no explicit admission that they screwed up.
All I can think of is that the night before, I discovered a new channel, watched several of the videos and commented on them. All were unique, on-topic, completely non-controversial comments. Still, maybe their AI is so dumb that it considers too many comments in too short a time to be spam, without even taking the contents into consideration. And by that, I mean checking to see if I had posted the same comment on multiple videos.
I am convinced that my dispute of the ban was simply rejected by the AI without any actual review. I think they only offer a dispute option so that they can claim that users can dispute problems. I don't think filing such disputes will ever actually do anything.
[ link to this | view in chronology ]
Copyrighting numbers
You can't copyright a number? What is a CD, but a large number inscribed on a plastic disk?
And what about Cage's 4'33". Is that copyrightable?
I remember a mathematician arguing that all numbers are interesting. I wonder if that applies to copyrightability?
[ link to this | view in chronology ]
Copyrighting numbers
I don't know. It just doesn't feel right. I can't come up with a general argument so I'll resort to a thought experiment.
Let's assume that numbers are copyrightable. Some of them belong to the public domain because the ancient Sumerians published them. I'm also sure that there are a ton of orphan works, numbers for which no one knows the author. Okay. I don't see anything particularly strange yet. Do you?
Now suppose there's a copyright registration system. Assume that people somehow have digital computers identical to the one I used to post this. Let's also assume that people know how to write software and that the coders use C. How do you register a number? Do you have to literally write it out in base 10? Can you write it in English words? What happens if someone else tries to register the number in Greek? What happens if someone registers something that is at its core a number, but the person has no idea what that number actually is? For example, what happens if I try to register an image file? Do I also get the copyright for the binary representation of that file? What if I have no idea what the binary representation is? What if I have no idea what the base ten representation is? Why bother copyrighting numbers ever again if I can instead just copyright the images?
I got nothing useful out of that.
A chemist argues that all chemicals are interesting.
A chef argues that all foods are interesting.
A physicist argues that all reference frames are interesting.
A theologian argues that all entities in religious texts are interesting.
A musician argues that all sound waves are interesting.
I got nothing useful out of that too. Maybe you got something useful out of it though.
[ link to this | view in chronology ]
Even if the path names contained the names of copyrighted works, path names themselves don't qualify as copyright infringement. As such, any bot SHOULD have been designed to only flag and analyze FILES with infringing sounding names. And even then, if the contents don't match, there's no infringement.
I'm not a programmer, although I do write Windows Batch scripts for my own personal use and whenever I write one to handle any task that might involve unknown conditions, I try to think of everything that could possibly go wrong and account for it. I don't always succeed, but I like to think I at least cover the most obvious possible problems.
[ link to this | view in chronology ]
TD kind of missed the point on this article. Why can't you have document with copyrighted material that you don't own? It might make sense if you share that file publicly, but even then it's just dumb. Drive is a file utility and there are a million reasons why you might have such a file. Like, say, a file of song lyrics that you're going to sing with some friends... a file of inspiration images for some project... a pdf of a book you bought...
[ link to this | view in chronology ]
[Off topic]
I'm starting to suspect that the average person thinks copyright is "I made this. Don't copy me." They're wrong. Copyright is about more than copying. It's about control, politics, money, and the power of pathos. I'm starting to think that high schools / secondary schools should teach all students about the history of copyright. All of it. No music-industry-funded cherry-picking. Starting from the printing press censorship drama all the way until Winnie the Pooh's ascendance. I suspect that it won't go well. At some point some student is bound to wonder why copyright lasts so long and why copyright proponents have no idea what "accountability" means.
Ethics aside, sometimes I really wish I could see an accurate simulation of humans. I want to see how a society with our technology behaves when every person has no idea what copyright is and has never been taught about anything remotely related to the concept of copyright.
[ link to this | view in chronology ]
Re: [Off topic]
That's not off topic.
[ link to this | view in chronology ]
Re: [Off topic]
Interesting you should say that. I have come to believe, in part because of evidence (note I said "evidence", not "proof"), that reincarnation is actually a thing. But one of the open questions among those who believe in the possibility of reincarnation is whether we are able to pick where we reincarnate, or even if we have the option to not reincarnate at all. One thing I have thought about is that if we do have a choice, I would not want to be reborn in the United States again, but in particular I kind of hope I never have to live another life on planet Earth again. Earth is a lovely planet and there are some wonderful things here, but there are also some very terrible things here, and some of the worst kinds of people, but to my mind one of the absolute worst things about planet Earth is that the ridiculous concept of "intellectual property" (in all its various forms) ever became a thing. It is like so many other things on this planet, where someone possibly had good intentions (or at least not purely evil intentions) but had NO idea of the monster they were creating (thought they might have if they had really thought it through).
When people come up with something they think is a good idea, they ought to ask themselves, how could evil people misuse this idea? Because it seems that is invariably what happens sooner or later. Many of the ideas put into the U.S. Constitution and Bill of Rights were good ideas at the time, until people started misusing them to gain power and/or make a profit, or just to make life miserable for other people.
So if I absolutely have to be reborn somewhere, I really hope I can pick a planet where the concept of intellectual property has either never crossed the mind of anyone, or if it did, it was promptly shut down and ridiculed as the evil thing it is. That's not the only thing I'd like to see, of course (no religions that exist primarily to make people subservient to other people would be a big one) but it's very high on my list of things I'd like to never see again.
[ link to this | view in chronology ]
Value for money
GoogleDrive. You get what you pay for.
I mean, yeah it sucks, but if you needed a good service you'd be prepared to front up some cash, right?
[ link to this | view in chronology ]
Right up there with Youtube striking recordings of white noise and whatnot. Utterly unsurprising.
Can't we just outlaw stupid DMCA bots already?
[ link to this | view in chronology ]
worth vs wealth
Google, Facebook, Amazon, Microsoft, etc. are all making money off of our data. Can I have some of that money, since it is my data? Why can't we all come up? Google is making more money than they know what to do with (e.g. their multitude of failed/abandoned products and projects); Amazon's going to the moon (eventually); on and on...
Since my data is worth soooo much, share the wealth.
[ link to this | view in chronology ]
As the owner of the number 173, I demand you remove this article immediately.
[ link to this | view in chronology ]
1
[ link to this | view in chronology ]
00011100
It’s not the file name as TorrentFreak has shown.
It’s the file content and type.
My educated guess is dummy files in an image somewhere that was copyrighted creating recorded hashes.
We’ve all seen dummy files before. And the dummies that copyright them.
[ link to this | view in chronology ]