<<

THE PLAN TO MINE THE WORLD’S PAPERS

arl Malamud is on a crusade to generate useful scientific hypotheses. But pub- liberate information locked up behind A data store in India lishers control — and often limit — the speed C paywalls — and his campaigns have could open up vast and scope of such projects, which typically scored many victories. He has spent decades confine themselves to abstracts, not full text. copyrighted legal documents, from swathes of for Researchers in India, the United States and the building codes to court records, and then argu- easy computerized United Kingdom are already making plans to ing that such texts represent public-domain law use the JNU store instead. Malamud and Lynn SHARMA FOR SMITA that ought to be available to any citizen online. analysis. have held workshops at Indian government Sometimes, he has won those arguments in laboratories and universities to explain the idea. court. Now, the 60-year-old American tech- BY PRIYANKA PULLA “We bring in professors and explain what we nologist is turning his sights on a new objective: are doing. They get all excited and they say, ‘Oh freeing paywalled scientific literature. And he Lynn, call their facility the JNU data depot. gosh, this is wonderful’,” says Malamud. thinks he has a legal way to do it. No one will be allowed to read or download But the depot’s legal status isn’t yet clear. Over the past year, Malamud has — without work from the repository, because that would Malamud, who contacted several intellectual- asking publishers — teamed up with Indian breach publishers’ . Instead, Malamud (IP) lawyers before starting work researchers to build a gigantic store of text and envisages, researchers could crawl over its text on the depot, hopes to avoid a . “Our images extracted from 73 million journal arti- and data with computer , scanning position is that what we are doing is perfectly cles dating from 1847 up to the present day. through the world’s scientific literature to pull legal,” he says. For the moment, he is pro- The cache, which is still being created, will out insights without actually reading the text. ceeding with caution: the JNU data depot is be kept on a 576-terabyte storage facility at The unprecedented project is generating air-gapped, meaning that no one can access Jawaharlal Nehru University (JNU) in New much excitement because it could, for the first it from the Internet. Delhi. “This is not every journal ever time, open up vast swathes of the paywalled Users have to physi- Carl Malamud in front written, but it’s a lot,” Malamud says. It’s compa- literature for easy computerized analysis. Doz- cally visit the facility, of the data store of rable to the size of the core collection in the Web ens of research groups already mine papers to and only researchers 73 million articles of Science database, for instance. Malamud and build databases of genes and chemicals, map who want to mine that he plans to let his JNU collaborator, bioinformatician Andrew associations between proteins and diseases, and for non-commercial text mine.

316 | NATURE | VOL 571 | 18 JULY 2019 ©2019 Spri nger Nature Li mited. All ri ghts reserved. ©2019 Spri nger Nature Li mited. All ri ghts reserved.

FEATURE NEWS

purposes are currently allowed in. Malamud but full texts are much more useful. In 2018, the Knowledge Futures Group says it wants to says his team does plan to allow remote access a team led by computational biologist Søren mine the depot to map how academic publish- in the future. “The hope is to do this slowly Brunak at the Technical University of Den- ing has evolved over time. The group hopes to and deliberately. We are not throwing this mark in Lyngby showed that full-text searches forecast emerging areas of research and identify open right away,” he says. throw up many more gene–disease links than alternatives to conventional metrics for measur- do searches of abstracts (D. Westergaard et al. ing research impact, says team member James THE POWER OF PLoS Comput. Biol. 14, e1005962; 2018). Weis, a doctoral student at MIT Media Lab. The JNU data store could sweep aside barriers Scientists must also overcome technical that still deter scientists from using software to barriers when mining articles. It is hard to A CAREER UNLOCKING COPYRIGHT analyse research, says Max Häussler, a bioinfor- extract text from the various layouts that pub- Malamud only recently had the idea of matics researcher at the University of California, lishers use — something that the JNU team extending his activism to academic publish- Santa Cruz (UCSC). “ of academic ing. The founder of a non-profit corporation papers is close to impossible right now,” he says called Public Resource, based in Sebastopol, — even for someone like him who already has “OUR California, Malamud has focused on buy- institutional access to paywalled articles. ing up government-owned legal works and Since 2009, Häussler and his colleagues POSITION IS publishing them. These include, for instance, have been building the online UCSC Genome the state of Georgia’s annotated legal code, Browser, which links DNA sequences in the THAT WHAT European toy-safety standards and more than human genome to parts of research papers that 19,000 Indian standards for everything from mention the same sequences. To do that, the WE ARE buildings and pesticides to surgical equipment. researchers have contacted more than 40 pub- DOING IS Because these documents are often a source lishers to ask permission to use software to rifle of revenue for government agencies, some of through research to find mentions of DNA. PERFECTLY them have sued Malamud, who has argued But 15 publishers have not responded or have back that documents which have the force of denied permission. Häussler is unsure whether LEGAL.” the law cannot be locked behind copyright. he can legally mine papers without permission, In the Georgia case, a US appeals court cleared so he isn’t trying. In the past, he has found his is struggling with right now. Tools to convert him of infringement charges in 2018, but the access blocked by publishers who have spotted PDFs to plain text don’t always distinguish state appealed, and the case is with the US his software crawling over their sites. “I spend clearly between paragraphs, footnotes and Supreme Court. Meanwhile, a German court 90% of my time just contacting publishers or images, for instance. Once the JNU team ruled in 2017 that the publication of toy stand- writing software to download papers,” says has done it, however, others will be saved the ards by Public Resource, including a standard Häussler. effort. The team is close to completing the first on baby dummies (pacifiers), was illegal. Some countries have changed their laws to round of extraction from the corpus of 73 mil- But Malamud has enjoyed victories, too. affirm that researchers on non-commercial lion papers, Malamud says — although they In 2013, he filed a lawsuit in a US federal projects don’t need a copyright-holder’s permis- will need to check for errors, so he expects the court asking the Internal Revenue Service sion to mine whatever they can legally access. database won’t be ready until the end of the year. (IRS) to publish the forms it collected from The United Kingdom passed such a law in tax-exempt non-profit organizations — data 2014, and the European Union voted through A WORLD OF POSSIBILITIES that could help to hold these organizations to a similar provision this year. That doesn’t help Early enthusiasts are already gearing up to account. Here, the court ruled in Malamud’s academics in poor nations who don’t have legal use the JNU depot. One is Gitanjali Yadav, a favour, prompting the IRS to release the finan- access to papers. And even in the United King- computational biologist at Delhi’s National cial information of thousands of non-profit dom, publishers can legally place ‘reasonable’ Institute of Plant Genome Research (NIPGR) organizations in a machine-readable format. restrictions on the process, such as channelling and a lecturer at the University of Cambridge, In early 2017, aided by the Arcadia Fund, scientists through publisher-specific interfaces UK. In 2006, Yadav led an effort at NIPGR a London-based charity that promotes open and limiting the speed of electronic searching to build a database of chemicals secreted by access, Malamud turned his attention to or bulk downloading to protect servers from plants. Called EssOilDB, this database is today research articles. Under US law, works by overload. Such limits are a big problem, says scoured by groups from drug developers to US federal government employees cannot be John McNaught, deputy director of the National perfumeries looking for leads. Yadav thinks copyrighted, and Public Resource says it has Centre for Text Mining at the University of that “Carl’s compendium”, as she calls it, could found hundreds of thousands of academic arti- Manchester, UK. “A limit of, say, one article give her database a leg-up. cles that are US government works and seem every five seconds, which sounds fast for a To make EssOilDB, Yadav’s team had to to defy this rule. Malamud has called for such human, is painfully slow for a machine. It would trawl PubMed and Scholar for relevant articles to be freed from copyright assertions, take a year to download around six million arti- papers, extract data from full texts where they but it’s not clear whether that would hold up cles, and five years to download all published could, and manually visit to copy in court. He has posted his preliminary results articles concerning just biomedicine,” he says. out tables from rare journals for the rest. The online, but has put further campaigning on Wealthy pharmaceutical firms often pay depot could fast-forward this work, says Yadav, hold, because the project prompted him to take extra to negotiate special text-mining access whose team is currently writing the queries on a wider mission: democratizing access to all because their work has a commercial pur- they will use to extract the data. scientific literature. pose, says McNaught. In some cases, publish- Srinivasan Ramachandran, a bioinformatics ers allow these firms to download papers in researcher at Delhi’s Institute of Genomics and OPPORTUNITY IN INDIA bulk, thus avoiding rate limits, according to a Integrative Biology, is also excited by Mala- A trigger for this mission came from a researcher at a pharmaceutical firm who did mud’s plan. His team runs a database of genes landmark Delhi High Court judgment in not want to be identified because they were not linked to type 2 diabetes; they’ve been crawl- 2016. The case revolved around Rameshwari authorized to talk to the media. University aca- ing PubMed abstracts to find papers. Now he Photocopy Services, a shop on the campus of demics, however, frequently restrict themselves hopes the depot could widen his mining net. the University of Delhi. For years, the busi- to mining article abstracts from databases such And at the Massachusetts Institute of ness had been preparing course packs for stu- as PubMed. That provides some information, Technology (MIT) in Cambridge, a team called dents by photocopying pages from expensive

©2019 Spri nger Nature Li mited. All ri ghts reserved. ©2019 Spri nger Nature Li mited. All ri ghts reserved. 18 JULY 2019 | VOL 571 | NATURE | 317

NEWS FEATURE textbooks. With prices ranging between 500 and 19,000 rupees (US$7–277), these textbooks were out of reach for many students. In 2012, Oxford University Press, Cambridge University Press and Taylor and Francis filed a lawsuit against the university, demanding that

it buy a to reproduce a portion of each SAJJAD HUSSAIN/AFP/GETTY text. But the Delhi High Court dismissed the suit. In its judgment, the court cited section 52 of India’s 1957 , which allows the reproduction of copyrighted works for educa- tion. Another provision in the same section allows reproduction for research purposes. Malamud has a long association with India: he first travelled there as a tourist in the 1980s, and he wrote one of his first , on data- design, on a houseboat in Srinagar. And around the same time that he heard about the Rameshwari judgment, he had come into pos- session (he won’t say how) of eight hard drives containing millions of journal articles from Sci-Hub, the pirate website that distributes paywalled papers for anyone to read. Sci-Hub itself has lost two against publishers in Rameshwari Photocopy Services in New Delhi was taken to court for copying parts of textbooks, and won. US courts over its copyright infringements, but despite those judgments, some of its domains from these books as part of its search service, doesn’t necessarily allow the blanket repro- are still working today. but not allowing them to be downloaded or duction of journals as the JNU depot has done, Malamud began to wonder whether he read in their entirety by a human. says T. Prashant Reddy, a legal researcher at the could legally use the Sci-Hub drives to benefit The case was a test of non- Vidhi Centre for Legal Policy in New Delhi. Indian students. In a 2018 about his work consumptive data mining, says Joseph Gratz, That entire articles aren’t shared with users does called Code Swaraj, co-authored with Indian an IP lawyer at the law firm Durie Tangri in help, but the mass reproduction of text used to tech entrepreneur Sam Pitroda, Malamud San Francisco, California, who represented create the database puts the facility in “a legal writes that he imagined showing up on Indian Google in the case and has previously repre- grey zone”, Reddy says. campuses in the equivalent of an American sented Public Resource. Even though Google taco truck, ready to serve the articles up to was displaying snippets, the court ruled that RISKY BUSINESS those who wanted them. the text was too limited to amount to infringe- When Nature contacted 15 publishers about the Ultimately, he zeroed in on the idea of the ment. Google was scanning authorized cop- JNU data depot, the six who responded said JNU text-mining depot instead. (Malamud ies of books (from libraries in many cases), that this was the first time they had heard of has also helped to set up another mining facil- even though it did not ask permission. Copy- the project, and that they couldn’t comment on ity with 250 terabytes of data at the Indian right holders might argue that if Sci-Hub or its legality without further information. But all Institute of Technology Delhi, which isn’t in use other unauthorized sources supplied the JNU six — Elsevier, BMJ, the American Chemical yet.) But he is cagey about where the depot’s depot, the situation would be different from Society, Springer Nature, the American Asso- articles come from. Asked directly whether the Google Books case, Gratz says. But a case ciation for the Advancement of and some of the text-mining depot’s articles come involving unauthorized sources has never been the US National Academy of Sciences — stated from Sci-Hub, he said he wouldn’t comment, argued in American courts, making it hard to that researchers looking to mine their papers and named only sources that provide free-to- predict the outcome. “There are good reasons needed their authorization. (Springer Nature download versions of papers (such as Pub- why the source shouldn’t matter, but there may publishes this journal; Nature’s news team is Med Central and the ‘Unpaywall’ tool). But he be arguments that it should,” says Gratz. editorially independent of its publisher.) does say that he does not have contracts with The question of the facility’s legality in the Malamud acknowledges that there is some publishers to access the journals in the depot. United States might not even be relevant, risk in what he is doing. But he argues that it is because international researchers would be “morally crucial” to do it, especially in India. IS IT LEGAL? getting results from a depot that sits in India, Indian universities and government labs spend Malamud says that where he got the articles even if they are accessing it remotely. So Indian heavily on journal subscriptions, he says, and from shouldn’t matter anyway. The data min- law is likely to apply to the question of whether still don’t have all the publications they need. ing, he says, is non-consumptive: a technical it is legal to create the corpus, says Michael W. Data released by Sci-Hub indicate that Indians term meaning that researchers don’t read or Carroll, a professor at the American University’s are among the world’s biggest users of their display large portions of the works they are Washington College of Law in Washington DC. website, suggesting that university licences analysing. “You cannot punch in a DOI [arti- Here, India’s copyright laws might help don’t go far enough. Although open-access cle identifier] and pull out the article,” he says. Malamud — another reason why the facility movements in Europe and the United States are Malamud argues that it is legally permissible is in New Delhi. The research exemption in valuable, India needs to lead the way in liber- to do such mining on copyrighted content in section 52 means that the JNU data depot’s ating access to scientific knowledge, Malamud countries such as the United States. In 2015, actions would be considered fair under Indian says. “I don’t think we can wait for Europe and for instance, a US court cleared Google Books law, argues Arul George Scaria, an assistant the United States to solve that problem because of charges after it did professor at Delhi’s National Law University. the need is so pressing here.” ■ something similar to the JNU depot: scanning Not everyone agrees with this interpretation, thousands of copyrighted books without buy- however. Section 52 allows researchers to pho- Priyanka Pulla is a freelance journalist based ing the rights to do so, and displaying snippets tocopy a journal article for personal use, but in Bengaluru, India.

318 | NATURE | VOL 571 | 18 JULY 2019 | CLARIFIED©2019 Spri n g19er NJULYature Li 2019mited. All ri ghts reserved. ©2019 Spri nger Nature Li mited. All ri ghts reserved.

CLARIFICATION The News Feature ‘The plan to mine the world’s research papers’ (Nature 571, 316–318; 2019) used the term ‘fair use’ inappropriately — the term isn’t relevant under Indian law.

©2019 Spri nger Nature Li mited. All ri ghts reserved. ©2019 Spri nger Nature Li mited. All ri ghts reserved.