Why Carl Malamud's Latest Brilliant Project, To Mine The World's Research Papers, Is Based In India

from the sci-hub-to-the-rescue-again dept

Fri, Jul 19th 2019 1:33pm — Glyn Moody

Carl Malamud is one of Techdirt's heroes. We've been writing about his campaign to liberate US government documents and information for over ten years now. The journal Nature has a report on a new project of his, which is in quite a different field: academic knowledge. The idea will be familiar to readers of this site: to carry out text and data mining (TDM) on millions of academic articles, in order to discover new knowledge. It's a proven technique with huge potential to produce important discoveries. That raises the obvious question: if large-scale TDM of academic papers is so powerful, why hasn't it been done before? The answer, as is so often the case, is that copyright gets in the way. Academic publishers use it to control and impede how researchers can help humanity:

[Malamud's] unprecedented project is generating much excitement because it could, for the first time, open up vast swathes of the paywalled literature for easy computerized analysis. Dozens of research groups already mine papers to build databases of genes and chemicals, map associations between proteins and diseases, and generate useful scientific hypotheses. But publishers control -- and often limit -- the speed and scope of such projects, which typically confine themselves to abstracts, not full text.

Malamud's project gets around the limitations imposed by copyright and publishers thanks to two unique features. First, Malamud "had come into possession (he won't say how) of eight hard drives containing millions of journal articles from Sci-Hub". Drawing on Sci-Hub's huge holdings means his project doesn't need to go begging to publishers in order to obtain full texts to be mined. Secondly, Malamud is basing his project in India:

Over the past year, Malamud has -- without asking publishers -- teamed up with Indian researchers to build a gigantic store of text and images extracted from 73 million journal articles dating from 1847 up to the present day. The cache, which is still being created, will be kept on a 576-terabyte storage facility at Jawaharlal Nehru University (JNU) in New Delhi.

India was chosen because of an important court battle that concluded two years ago. As Techdirt reported then, it is legal in India to make photocopies of copyright material in an educational context. Malamud's contention is that this allows him to mine academic material in India without the permission of publishers. But he also believes that his TDM project would be legal in the US:

The data mining, he says, is non-consumptive: a technical term meaning that researchers don't read or display large portions of the works they are analysing. "You cannot punch in a DOI [article identifier] and pull out the article," he says. Malamud argues that it is legally permissible to do such mining on copyrighted content in countries such as the United States. In 2015, for instance, a US court cleared Google Books of copyright infringement charges after it did something similar to the JNU depot: scanning thousands of copyrighted books without buying the rights to do so, and displaying snippets from these books as part of its search service, but not allowing them to be downloaded or read in their entirety by a human.

The fact that TDM is "non-consumptive" means that the unhelpful attitude of academic publishers is even more unjustified than usual. They lose nothing from the analytical process, which is merely extracting knowledge. But from a sense of entitlement publishers still demand to be paid for unrestricted computer access to texts that have already been licensed by academic institutions anyway. That selfish and obstructive attitude to TDM may be about to backfire spectacularly. The Nature article notes:

No one will be allowed to read or download work from the repository, because that would breach publishers' copyright. Instead, Malamud envisages, researchers could crawl over its text and data with computer software, scanning through the world's scientific literature to pull out insights without actually reading the text.

The thing is, if anyone were by any chance interested in reading the full text, there's an obvious place to turn to. After all, the mining is carried out using papers held by Sci-Hub, so…

Follow me @glynmoody on Twitter, Diaspora, or Mastodon.

Thank you for reading this Techdirt post. With so many things competing for everyone’s attention these days, we really appreciate you giving us your time. We work hard every day to put quality content out there for our community.

Techdirt is one of the few remaining truly independent media outlets. We do not have a giant corporation behind us, and we rely heavily on our community to support us, in an age when advertisers are increasingly uninterested in sponsoring small, independent sites — especially a site like ours that is unwilling to pull punches in its reporting and analysis.

While other websites have resorted to paywalls, registration requirements, and increasingly annoying/intrusive advertising, we have always kept Techdirt open and available to anyone. But in order to continue doing so, we need your support. We offer a variety of ways for our readers to support us, from direct donations to special subscriptions and cool merchandise — and every little bit helps. Thank you.

–The Techdirt Team

Filed Under: academic papers, carl malamud, copyright, india, journals, research, science, tdm, text and data mining
Companies: sci-hub

11 Comments

If you liked this post, you may also be interested in...

Reader Comments

Subscribe: RSS

View by: Time | Thread

Ninja (profile), 19 Jul 2019 @ 2:17pm

So basically instead of copyright promoting the progress of science and useful arts it's piracy that's doing so. Despite all the copyright.
[ link to this | view in thread ]
Anonymous Coward, 19 Jul 2019 @ 2:44pm

Re:
The journals will still argue that the only way piracy can enable this, is that the journals created the publishing infrastructure in the first place and published the articles. And they'd further add that the quality of the contents of these papers is fully because of their own hard work as middlemen/brokers.

They'd further argue that Sci-Hub is costing them revenue that is preventing them from doing wonderful things to progress science and the useful arts.

Of course, they'd argue that the lack of unicorns is due to piracy if they felt that could protect their publishing racket.

Meanwhile, Sci-Hub would have next to nothing if these journals didn't exist, so they do have a point.

arXiv andother pre-print repositories, on the other hand, WOULD exist anyway. And they'd probably be richer centers of knowledge if the likes of Elsevier didn't exist.

Interestingly, Elsevier no longer bills themselves as a journal publisher:

"Elsevier is a Dutch information and analytics company...."
[ link to this | view in thread ]
bob, 19 Jul 2019 @ 5:42pm

Re: Re:
And if the journals didn't exist first you wouldn't need sci-hub. Instead you would have some other repository to deal with. Depending on how that repository is managed you might still get a sci-hub option.

The journals were very important in the beginning but now with the internet they are not as necessary. Just a matter of time till they adjust their business operations or die at this point in time.
[ link to this | view in thread ]
Anonymous Coward, 20 Jul 2019 @ 12:48am

Re: Re:
Meanwhile, Sci-Hub would have next to nothing if these journals didn't exist, so they do have a point.

That's like saying doctors would have next to nothing if infectious disease didn't exist.

How the hell is the existence of access-restrictive assholes/infectious disease supposed to be the better scenario?
[ link to this | view in thread ]
Anonymous Coward, 20 Jul 2019 @ 4:08am

Re: Re:

Meanwhile, Sci-Hub would have next to nothing if these journals didn't exist, so they do have a point.

More a case of Sci_Hub would not exist if the journals did not price gouge those who produce the papers in the jounals and make co-operation in any field horribly expensive if the researchers play by the rules.
[ link to this | view in thread ]
Anonymous Coward, 20 Jul 2019 @ 5:01am

Researchers need to get published in the journals to get peer review, they need to get published ,and to get promoted .
peoples career path is based on their research they publish in certain journals ,librarys have to pay to subscribe to those journals so thier students and professors can keep up with advances in research and advances in science .Not all librarys can afford to pay for all the scientific journals .
We need to move to an open free publishing platform,
most scientific research is funded by the taxpayer ,
set up a web site like git hub,
eg open science.org .
ALL research funded by the government or the tax payer must be published there ,
Any scientist or professor can register free to publish research papers there .
Whether they are based in america, canada or europe .

At the moment the tax payer pays for research than public funded universitys have to pay for it .
[ link to this | view in thread ]
Anonymous Coward, 21 Jul 2019 @ 6:59am

Re:
Abolish copyright.
[ link to this | view in thread ]
Anonymous Coward, 22 Jul 2019 @ 1:21am

"democratizing access to all scientific literature", from Nature
It seems that Carl Malamud is taking on the challenge that cost us the life of Aaron Swartz.

Good luck to him.
[ link to this | view in thread ]
Anonymous Coward, 22 Jul 2019 @ 9:14am

Re:

So basically instead of copyright promoting the progress of science and useful arts it's piracy that's doing so.

Sure, if you accept "piracy" to mean "things copyright holders object to". But if you believe copyright is holding us back, it's unfair to call Malamud a pirate.
[ link to this | view in thread ]
rayashcraft (profile), 19 Oct 2021 @ 12:44am

Laboratory) simply founded Public.Resource and that's it.
[ link to this | view in thread ]
rayashcraft (profile), 19 Oct 2021 @ 12:46am

Copyright has nothing to do with it. A visiting professor (MIT Media Laboratory) simply founded Public.Resource.Org and that's it. Elsevier, one of the best information analytics systems, for instance, created their own version of Wikipedia. Research paper writers can search ScienceDirect Topics and get Topic pages that are generated automatically for academic paper writing.
Research topics https://domyhomeworkonline.net
[ link to this | view in thread ]