stories filed under: "text and data mining"

Why Carl Malamud's Latest Brilliant Project, To Mine The World's Research Papers, Is Based In India

from the sci-hub-to-the-rescue-again dept

Fri, Jul 19th 2019 1:33pm — Glyn Moody

Carl Malamud is one of Techdirt's heroes. We've been writing about his campaign to liberate US government documents and information for over ten years now. The journal Nature has a report on a new project of his, which is in quite a different field: academic knowledge. The idea will be familiar to readers of this site: to carry out text and data mining (TDM) on millions of academic articles, in order to discover new knowledge. It's a proven technique with huge potential to produce important discoveries. That raises the obvious question: if large-scale TDM of academic papers is so powerful, why hasn't it been done before? The answer, as is so often the case, is that copyright gets in the way. Academic publishers use it to control and impede how researchers can help humanity:

[Malamud's] unprecedented project is generating much excitement because it could, for the first time, open up vast swathes of the paywalled literature for easy computerized analysis. Dozens of research groups already mine papers to build databases of genes and chemicals, map associations between proteins and diseases, and generate useful scientific hypotheses. But publishers control -- and often limit -- the speed and scope of such projects, which typically confine themselves to abstracts, not full text.

Malamud's project gets around the limitations imposed by copyright and publishers thanks to two unique features. First, Malamud "had come into possession (he won't say how) of eight hard drives containing millions of journal articles from Sci-Hub". Drawing on Sci-Hub's huge holdings means his project doesn't need to go begging to publishers in order to obtain full texts to be mined. Secondly, Malamud is basing his project in India:

Over the past year, Malamud has -- without asking publishers -- teamed up with Indian researchers to build a gigantic store of text and images extracted from 73 million journal articles dating from 1847 up to the present day. The cache, which is still being created, will be kept on a 576-terabyte storage facility at Jawaharlal Nehru University (JNU) in New Delhi.

India was chosen because of an important court battle that concluded two years ago. As Techdirt reported then, it is legal in India to make photocopies of copyright material in an educational context. Malamud's contention is that this allows him to mine academic material in India without the permission of publishers. But he also believes that his TDM project would be legal in the US:

The data mining, he says, is non-consumptive: a technical term meaning that researchers don't read or display large portions of the works they are analysing. "You cannot punch in a DOI [article identifier] and pull out the article," he says. Malamud argues that it is legally permissible to do such mining on copyrighted content in countries such as the United States. In 2015, for instance, a US court cleared Google Books of copyright infringement charges after it did something similar to the JNU depot: scanning thousands of copyrighted books without buying the rights to do so, and displaying snippets from these books as part of its search service, but not allowing them to be downloaded or read in their entirety by a human.

The fact that TDM is "non-consumptive" means that the unhelpful attitude of academic publishers is even more unjustified than usual. They lose nothing from the analytical process, which is merely extracting knowledge. But from a sense of entitlement publishers still demand to be paid for unrestricted computer access to texts that have already been licensed by academic institutions anyway. That selfish and obstructive attitude to TDM may be about to backfire spectacularly. The Nature article notes:

No one will be allowed to read or download work from the repository, because that would breach publishers' copyright. Instead, Malamud envisages, researchers could crawl over its text and data with computer software, scanning through the world's scientific literature to pull out insights without actually reading the text.

The thing is, if anyone were by any chance interested in reading the full text, there's an obvious place to turn to. After all, the mining is carried out using papers held by Sci-Hub, so…

Follow me @glynmoody on Twitter, Diaspora, or Mastodon.

Filed Under: academic papers, carl malamud, copyright, india, journals, research, science, tdm, text and data mining
Companies: sci-hub

11 Comments

Proposed Update To Singapore's Copyright Laws Surprisingly Sensible

from the EU-should-look-and-learn dept

Tue, Jan 22nd 2019 8:28pm — Glyn Moody

Techdirt writes plenty about copyright in the US and EU, and any changes to the respective legislative landscapes. But it's important to remember that many other countries around the world are also trying to deal with the tension between copyright's basic aim to prevent copying, and the Internet's underlying technology that facilitates it. Recently, we covered the copyright reform process in South Africa, where some surprisingly good things have been happening. Now it seems that Singapore may bring in a number of positive changes to its copyright legislation. One of the reasons for that is the very thorough consultative process that was undertaken, explained here by Singapore's Ministry of Law:

The proposed changes are made, following an extensive three-year review and two rounds of public consultations conducted from August to November 2016 and May to June 2017 respectively. Three public Town Halls and ten engagement sessions with various stakeholder groups, including consumer, industry and trade associations, businesses, intellectual property practitioners and academics were held. Close to 100 formal submissions and more than 280 online feedback forms were received.

The full 70-page report (pdf) spells out the questions asked during that review, the answers received, and the government's proposals. The Ministry of Law's press release lists some of the main changes it wants to make. One of the most welcome is a new exception for text and data mining (TDM) for the purpose of analysis:

Today, people who use automated techniques to analyse text, data and other content to generate insights risk infringing copyright as they typically require large scale copying of works without permission. It is proposed that a new exception be established to allow copying of copyrighted materials for the purpose of data analysis, where the user has lawful access to the materials that are copied. This will promote applications of data analytics and big data across a gamut of industries, unlocking new business opportunities, speeding up processes, and reducing costs for all.

Importantly, Singapore's proposed new TDM exception applies to everyone -- including big businesses. That's unlike the corresponding Article 3 in the EU's awful Copyright Directive, currently working its way through the legislative process, which imposes an unnecessary restriction that more or less guarantees the European Union will be a backwater in this fast-growing area. An obvious but wise move by Singapore is the proposal for an enhanced copyright exception for educational purposes:

Non-profit schools and their students will be able to use online resources that are accessible without payment, for instruction purposes. This will be in addition to their existing exceptions which generally cover only copying of a portion of a work. The enhancement will facilitate instruction and make it easier for teachers and students to use online materials in classes. For example, teachers and students will be able to use various audio-visual materials (e.g. videos, pictures) found online for their classroom lessons and project presentations. They will also be able to share those materials, or lessons and project presentations which have included those materials, on student learning portals for other schools to view. Online resources that require payment will not be covered by this exception.

Another suggested exception is for non-profit galleries, libraries, archives, and museums (GLAMs) to make copies for exhibition purposes. Also useful for GLAMs is a new limit on the protection given to unpublished works. This will stand at life plus 70 years for literary and artistic works, just as for published versions. GLAMs will be protected from contract override, as is the text and data mining exception. That's important, because it means that copyright owners cannot nullify the new exceptions by insisting organizations sign contracts that waive them. Individual creators receive new rights too:

the report proposes that creators be given a new right to be attributed as the creator of their work, regardless of whether they still own or have sold the copyright. For example, anyone using a work publicly, such as posting it on the internet, will have to acknowledge the creator of the work. This will accord creators due recognition and allow them to build their reputation over time. Currently, they do not need to be attributed as the creator of their work when others use it.

This is essentially a moral right alongside the usual economic ones. As the Wikipedia page on the subject explains, the degree to which moral rights exist for creators of copyright works varies enormously around the world. In France, for example, moral rights are perpetual and inalienable, whereas in the US they are less to the fore. Singapore's Ministry of Law also proposes that where rights have not been explicitly signed away in a contract, they remain with the creator. Although that will prevent naive creators being tricked out of their rights, it won't apply to work created by employees: there, it's employers who will continue to retain rights. As for enforcing copyright, there is the following:

the report proposes that new enforcement measures be made available to copyright owners to deter retailers and service providers from profiting off providing access to content from unauthorised sources, such as through the sale of set-top boxes that enable access to content from unauthorised sources, also commonly known as grey boxes or illicit streaming devices. The measures, which are absent today, will make clear that acts such as the import and sale of such devices are prohibited.

This is clearly aimed at Kodi boxes, which are currently one of the main targets of the entertainment industry. To its credit, the Ministry of Law's proposal does include important additional requirements for the measures to apply:

the product can be used to access audio-visual content from an unauthorised source and additionally must be:

designed or made primarily for providing access to such content

advertised as providing access to such content, or

sold as providing access to such content, where the retailer sells a generic device with the understanding that "add-on" services such as the provision of website links, instructions or installation of subscription services will subsequently be provided

At least that makes a clear distinction between basic Kodi boxes, and those specifically built and sold with a view to providing unauthorized access to materials. That understanding of the difference is of a piece with the rest of the legislation, which is unusually intelligent. Other governments could learn from that, and from the overall thrust of the proposals to move Singapore's copyright law towards a fair use system similar to that of the US -- something that is fiercely resisted elsewhere.

Follow me @glynmoody on Twitter or identi.ca, and +glynmoody on Google+

Filed Under: copyright, moral rights, singapore, tdm, text and data mining, user rights

15 Comments

EU Announces Absolutely Ridiculous Copyright Proposal That Will Chill Innovation, Harm Creativity

from the how-hard-did-they-work-to-make-it-this-bad? dept

Wed, Sep 14th 2016 12:53pm — Mike Masnick

This is not a surprise given the earlier leaks of what the EU Commission was cooking up for a copyright reform package, but the end result is here and it's a complete disaster for everyone. And I do mean everyone. Some will argue that it's a gift to Hollywood and legacy copyright interests -- and there's an argument that that's the case. But the reality is that this proposal is so bad that it will end up doing massive harm to everyone. It will clearly harm independent creators and the innovative platforms that they rely on. And, because those platforms have become so important to even the legacy entertainment industry, it will harm them too. And, worst of all, it will harm the public greatly. It's difficult to see how this proposal will benefit anyone, other than maybe some lawyers.

Not surprisingly, the EU Commission is playing up the fact that this package does knock down some geoblocking in setting up more of a "single market" for digital content, but after Hollywood started freaking out about it, that proposal got watered down so much that plenty of content will still be geo-blocked. And there's so much other stuff in here that's just really, really bad. As expected, it includes a ridiculous ancillary copyright scheme, which should really just be called the "Google tax" for linking to copyright-covered content.

The proposal does away with the liability limitations for platforms, effectively requiring any tech platform that allows user-generated/user-uploaded content to build or license their very own ContentID system. This is ridiculous. If the idea was to punish Google, this will do the opposite. Basically no startup will be able to afford this, and it will just lock in platforms like YouTube as the only option for content creators wishing to upload video. Protecting intermediary liability has been shown, time and time again, to enable new innovation and also to enable greater creativity and free speech -- and the EU Commission basically just tossed it in the garbage because some Hollywood interests think (incorrectly) that internet companies "abuse" the protections.

The EU Commission barely hides the fact that they're doing this to try to protect legacy industries while punishing innovative ones:

The Copyright Directive aims to reinforce the position of right holders to negotiate and be remunerated for the online exploitation of their content on video-sharing platforms such as YouTube or Dailymotion. Such platforms will have an obligation to deploy effective means such as technology to automatically detect songs or audiovisual works which right holders have identified and agreed with the platforms either to authorise or remove.

Newspapers, magazines and other press publications have benefited from the shift from print to digital and online services like social media and news aggregators. It has led to broader audiences, but it has also impacted advertising revenue and made the licensing and enforcement of the rights in these publications increasingly difficult.The Commission proposes to introduce a new related right for publishers, similar to the right that already exists under EU law for film producers, record (phonogram) producers and other players in the creative industries like broadcasters.

The new right recognises the important role press publishers play in investing in and creating quality journalistic content, which is essential for citizens' access to knowledge in our democratic societies. As they will be legally recognised as right holders for the very first time they will be in a better position when they negotiate the use of their content with online services using or enabling access to it, and better able to fight piracy. This approach will give all players a clear legal framework when licensing content for digital uses, and help the development of innovative business models for the benefit of consumers.

The proposal also includes a new "exception" for text and data mining -- which sounds like it could be a good thing, but even it was designed in a manner to "protect" legacy publishers, and which will seriously harm smaller innovators and researchers. The exception is only limited to those engaging in scientific research, meaning that any other kind of research that involves data mining is at risk in the EU. Basically, the EU just gave away that entire important and growing innovative industry. Almost all of the major work in AI and machine learning these days involves data mining, and the EU just told all those companies to go find a new home.

Just looking around at various European-based organizations, they're pretty much agreed that this is a complete disaster. Here's Communia, saying "this is not how you fix copyright."

Today’s proposal buries the hope for a more modern, technologically neutral and flexible copyright framework that the Commission had hinted at in its initial plans for the Digital Single Market. The proposal largely ignores crucial changes to copyright that would have benefitted consumers, users, educators, startups, and cultural heritage institutions. It also abandons the idea of a digital single market that allows all Europeans the same rights to access knowledge and culture. Finally, it completely ignores the importance of protecting and expanding the public domain.

And here's EDRi, saying that the proposal "fails at every level."

The European Commission has proposed a Copyright Directive that could not conceivably be worse. The text that was launched today includes a proposal to potentially filter all uploads to the Internet in Europe. The draft text would destroy users’ rights and legal certainty for European hosting companies. The new Directive’s proposal for a new 20-year “ancillary” copyright for “news” outlets repeats painful mistakes made in Germany and Spain, which hurt publishers and Internet users alike.

“We need a copyright reform to make Europe fit for the 21st century. We now have a proposal that is poison for European’s free speech, poison for European business and poison for creativity”, said Joe McNamee, Executive Director of European Digital Rights. “It could not conceivably be worse.”

And here's EU Parliament member Marietje Schaake, noting how wrong this approach is:

It lacks ambition and instead reads like a defence of old business models. We need a real copyright revolution instead. Publishers might have legitimate concerns about their decreasing revenues, but a retrograde reform of copyright law is not the solution.

So the EU Commission has taken the exact wrong approach. It's one that's almost entirely about looking backwards and "protecting" old ways of doing business, rather than looking forward, and looking at what benefits the public, creators and innovators the most. If this proposal actually gets traction, it will be a complete disaster for the EU innovative community. Hopefully, Europeans speak out, vocally, about what a complete disaster this would be.

Filed Under: ancillary copyright, contentid, copyright, eu, filtering, intermediary liability, link tax, text and data mining

The Techdirt Greenhouse

Read the latest posts:

read all »

Techdirt Deals

Report this ad | Hide Techdirt ads

Techdirt Insider Discord

The latest chatter on the Techdirt Insider Discord channel...

Older Stuff

Thursday
13:33	Former Employees Say Mossad Members Dropped By NSO Officers To Run Off-The-Books Phone Hacks (2)
12:01	No, Creating An NFT Of The Video Of A Horrific Shooting Will Not Get It Removed From The Internet (18)
10:49	San Francisco Cops Are Running Rape Victims' DNA Through Criminal Databases Because What Even The Fuck (18)
10:44	Daily Deal: The Complete 2022 Java Coder Bundle (0)
09:31	As Expected, Trump's Social Network Is Rapidly Banning Users It Doesn't Like, Without Telling Them Why (44)
06:30	Comcast Continues To Bleed Olympics Viewers After Years Of Bumbling (19)
Wednesday
20:42	Apple Finally Defeats Dumb Diverse Emoji Lawsuit One Year Later (6)
15:39	Clearview Pitch Deck Says It's Aiming For A 100 Billion Image Database, Restarting Sales To The Private Sector (10)
13:41	Peloton Outage Prevents Customers From Using $2,500 Exercise Bikes (16)
12:09	The GOP Knows That The Dem's Antitrust Efforts Have A Content Moderation Trojan Horse; Why Don't The Dems? (16)
10:51	Hertz Ordered To Tell Court How Many Thousands Of Renters It Falsely Accuses Of Theft Every Year (24)
09:21	Even As Trump Relies On Section 230 For Truth Social, He's Claiming In Lawsuits That It's Unconstitutional (34)
06:16	Medical, Home Alarm Industries Warn Of Major Outages As AT&T Shuts Down 3G Network (25)
Tuesday
20:37	Video Game History Foundation: Nintendo Actions 'Actively Destructive To Video Game History' (29)
15:35	Massachusetts Court Says No Expectation Of Privacy In Social Media Posts Unwittingly Shared With An Undercover Cop (17)
13:30	Techdirt Podcast Episode 312: Regulating The Internet (2)
12:03	US Copyright Office Gets It Right (Again): AI-Generated Works Do Not Get A Copyright Monopoly (60)
10:42	LA Sheriff Threatens To 'Subject' City Council To 'Defamation Law' If They Won't Stop Calling His Deputies 'Gang Members' (20)
10:37	Daily Deal: codeSpark Academy Sibling Bundle (0)
09:25	Trump's Truth Social Bakes Section 230 Directly Into Its Terms, So Apparently Trump Now Likes Section 230 (128)
06:22	15 Years Late, The FCC Cracks Down On Broadband Apartment Monopolies (31)
Sunday
12:05	Funniest/Most Insightful Comments Of The Week At Techdirt (11)
Saturday
12:00	This Week In Techdirt History: February 13th - 19th (1)
Friday
19:39	Letter From High-Ranking FBI Lawyer Tells Prosecutors How To Avoid Court Scrutiny Of Firearms Analysis Junk Science (25)
15:52	Nintendo Is Beginning To Look Like The Disney Of The Video Game Industry (44)
13:49	Seattle Public Radio Station Manages To Partially Brick Area Mazdas Using Nothing More Than Some Image Files (44)
12:13	Thankfully, Jay Inslee's Unconstitutional Bill To Criminalize Political Speech Dies In The Washington Senate (8)
10:52	How Our Convoluted Copyright Regime Explains Why Spotify Chose Joe Rogan Over Neil Young (136)
10:47	Daily Deal: The Complete Blocs Website Builder Bundle (0)
09:33	Arizona Prosecutor Who Brought Bogus Gang Charges Against Protesters Files Ridiculous Defamation Suit Against Her Boss (12)

Why Carl Malamud's Latest Brilliant Project, To Mine The World's Research Papers, Is Based In India

from the sci-hub-to-the-rescue-again dept

Proposed Update To Singapore's Copyright Laws Surprisingly Sensible

from the EU-should-look-and-learn dept

EU Announces Absolutely Ridiculous Copyright Proposal That Will Chill Innovation, Harm Creativity

from the how-hard-did-they-work-to-make-it-this-bad? dept

The Techdirt Greenhouse

Thursday

Wednesday

Tuesday

Sunday

Saturday

Friday

More

Tools & Services

Company

Contact

More

from the sci-hub-to-the-rescue-again dept

from the EU-should-look-and-learn dept

from the how-hard-did-they-work-to-make-it-this-bad? dept

Techdirt Daily Newsletter

The Techdirt Greenhouse

Tools & Services

Company

Contact

More