Seven Years Ago, CERN Gave Open Access A Huge Boost; Now It's Doing The Same For Open Data

Culture

from the tim-berners-lee-would-be-proud dept

Mon, Jan 4th 2021 8:22pm — Glyn Moody

Techdirt readers will be very familiar with CERN, the European Council for Nuclear Research (the acronym comes from the French version: Conseil Européen pour la Recherche Nucléaire). It's best known for two things: being the birthplace of the World Wide Web, and home to the Large Hadron Collider (LHC), the world's largest and most powerful particle accelerator. Over 12,000 scientists of 110 nationalities, from institutes in more than 70 countries, work at CERN. Between them, they produce a huge quantity of scientific papers. That made CERN's decision in 2013 to release nearly all of its published articles as open access one of the most important milestones in the field of academic publishing. Since 2014, CERN has published 40,000 open access articles. But as Techdirt has noted, open access is just the start. As well as the final reports on academic work, what is also needed is the underlying data. Making that data freely available allows others to check the analysis, and to use it for further investigation -- for example, by combining it with data from elsewhere. The push for open data has been underway for a while, and has just received a big boost from CERN:

The four main LHC collaborations (ALICE, ATLAS, CMS and LHCb) have unanimously endorsed a new open data policy for scientific experiments at the Large Hadron Collider (LHC), which was presented to the CERN Council today. The policy commits to publicly releasing so-called level 3 scientific data, the type required to make scientific studies, collected by the LHC experiments. Data will start to be released approximately five years after collection, and the aim is for the full dataset to be publicly available by the close of the experiment concerned. The policy addresses the growing movement of open science, which aims to make scientific research more reproducible, accessible, and collaborative.

The level 3 data released can contribute to scientific research in particle physics, as well as research in the field of scientific computing, for example to improve reconstruction or analysis methods based on machine learning techniques, an approach that requires rich data sets for training and validation.

CERN's open data portal already contains 2 petabytes of data -- a figure that is likely to rise rapidly, since LHR experiments typically generate massive quantities of data. However, the raw data will not in general be released. The open data policy document (pdf) explains why:

This is due to the complexity of the data, metadata and software, the required knowledge of the detector itself and the methods of reconstruction, the extensive computing resources necessary and the access issues for the enormous volume of data stored in archival media. It should be noted that, for these reasons, general direct access to the raw data is not even available to individuals within the collaboration, and that instead the production of reconstructed data (i.e. Level-3 data) is performed centrally. Access to representative subsets of raw data -- useful for example for studies in the machine learning domain and beyond -- can be released together with Level-3 formats, at the discretion of each experiment.

There will also be Level 2 data, "provided in simplified, portable and self-contained formats suitable for educational and public understanding purposes". CERN says that it may create "lightweight" environments to allow such data to be explored more easily. Virtual computing environments for the Level 3 data will be made available to aid the re-use of this primary research material. Although the data is being released using a Creative Commons CC0 waiver, acknowledgements of the data's origin are required, and any new publications that result must be clearly distinguishable from those written by the original CERN teams.

As with the move to open access in 2013, the new open data policy is unlikely to have much of a direct impact for people outside the high energy physics community. But it does represent an extremely strong and important signal that CERN believes open data must and will become the norm.

Follow me @glynmoody on Twitter, Diaspora, or Mastodon.

Filed Under: experiments, knowledge, open access, open data, science, sharing
Companies: cern

2 Comments

If you liked this post, you may also be interested in...

Reader Comments

Subscribe: RSS

View by: Time | Thread

Anonymous Coward, 4 Jan 2021 @ 9:07pm

This is a disaster! How will scientists be motivated to collect data if their great-grandchildren can't cash in on the copyrights? How will they pay for their supercolliders, supercomputers, and vacation homes? How can they keep individuals from inferior races from doing science also?

And just imagine, some of the data might be chanted by a rapper without attribution. Or used to remote-control a John Deere tractor.

Stand up and stop the madness! Send your anti-proton to CERN now!
[ link to this | view in chronology ]
Christenson, 5 Jan 2021 @ 6:26pm

More! More!
I saw two issues:
a) It's reasonable for CERN to not want random people with no qualifications implying they are associated with them, just as Techdirt wouldn't want just anyone implying they do work for Techdirt -- but it should be framed as a Trademark issue over confusion, not "must attribute this data".
b) Releasing a reasonable quantity of samples of the basic data from the sensors should be required. This allows important independent checks of the data reduction algorithms to happen. Anyone else remember an ozone hole that was made invisible by certain satellite data reduction algorithms assuming what was seen was a sensor problem?

Given the huge volume of raw data, CERN would be really smart to collocate and possibly allow guests to run their own data reduction at the time the data is taken.
[ link to this | view in chronology ]