Harvard Opens Up Its Massive Caselaw Access Project

from the good-to-see dept

Wed, Oct 31st 2018 1:36pm — Mike Masnick

Almost exactly three years ago, we wrote about the launch of an ambitious project by Harvard Law School to scan all federal and state court cases and get them online (for free) in a machine readable format (not just PDFs!), with open APIs for anyone to use. And, earlier this week, case.law officially launched, with 6.4 million cases, some going back as far as 1658. There are still some limitations -- some placed on the project by its funding partner, Ravel, which was acquired by LexisNexis last year (though, the structure of the deal will mean some of these restrictions will likely decrease over time).

Also, the focus right now is really on providing this setup as a tool for others to build on, rather than as a straight up interface for anyone to use. As it stands, you can either access data via the site's API, or by doing bulk downloads. Of course, the bulk downloads are, unfortunately, part of what's limited by the Ravel/LexisNexis data. Bulk downloads are available for cases in Illinois and Arkansas, but that's only because both of those states already make cases available online. Still, even with the Ravel/LexisNexis limitation, individual users can download up to 500 cases per day.

The real question is what will others build with the API. The site has launched with four sample applications that are all pretty cool.

H2O is a tool that law professors can use to easily create casebooks for students in various areas of law. Anything published on H2O gets a Creative Commons license and can then be shared widely. I wonder if professors like Eric Goldman, who offers an Internet Law Casebook, or James Grimmelmann, who has a different Internet Law Casebook, will eventually port them over to a platform like H2O.
A wordcloud app that currently shows the "most used words" in California cases in various years. Here, for example, are the word clouds in California cases from 1871... and 2012. See if you can tell which one's which.

Caselaw Limericks that appears to randomly generate what it believes is a rhyming limerick from the case law. Here's what I got:

Her son Julius is a confirmed thief.
He did not turn over a new leaf.
The vessel, not.
the parking lot.
Respondent concedes this in its brief.

The quality overall is... a bit mixed. But it's fun.

And, finally, in time for Halloween, Witchcraft in Law, which totals up cases that cite "witchcraft" by state.

Hopefully this inspires a lot more on the development side as well.

Thank you for reading this Techdirt post. With so many things competing for everyone’s attention these days, we really appreciate you giving us your time. We work hard every day to put quality content out there for our community.

Techdirt is one of the few remaining truly independent media outlets. We do not have a giant corporation behind us, and we rely heavily on our community to support us, in an age when advertisers are increasingly uninterested in sponsoring small, independent sites — especially a site like ours that is unwilling to pull punches in its reporting and analysis.

While other websites have resorted to paywalls, registration requirements, and increasingly annoying/intrusive advertising, we have always kept Techdirt open and available to anyone. But in order to continue doing so, we need your support. We offer a variety of ways for our readers to support us, from direct donations to special subscriptions and cool merchandise — and every little bit helps. Thank you.

–The Techdirt Team

Filed Under: caselaw, caselaw access project, legal data, public info, public records, transparency
Companies: harvard, lexisnexis, ravel

18 Comments

If you liked this post, you may also be interested in...

Reader Comments

Subscribe: RSS

View by: Time | Thread

Anonymous Coward, 31 Oct 2018 @ 2:05pm

"Lowering The Bar"...
... might have a field day with those last two apps!
[ link to this | view in thread ]
Lawrence D’Oliveiro, 31 Oct 2018 @ 2:28pm

Non-Free
From the H2O Terms of Service:
- 4a: a. You may use H2O only for personal, educational, or other types of noncommercial uses.
and
- 6b: By submitting your User Content to the H2O Services, you also agree to allow H2O to license your Content under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 License
[ link to this | view in thread ]
Thad (profile), 31 Oct 2018 @ 2:42pm

Re: Non-Free
And Techdirt's covered the legal complexities of the "noncommercial" CC licenses at length.
[ link to this | view in thread ]
Anonymous Coward, 31 Oct 2018 @ 2:51pm

Re: Non-Free

6b: By submitting your User Content to the H2O Services, you also agree to allow H2O to license your Content under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 License

That only means that people other than the copyright holder are not allowed to use the data in a book, if the obtain it under that license. The contributor is free to license or sell their own works as part of a commercial enterprise. Similarly, anybody with a commercial enterprise in mind is free to approach the copyright holder to obtain a license that permits commercial use, they just have to live with and compete with the creative commons version.
[ link to this | view in thread ]
Thad (profile), 31 Oct 2018 @ 3:22pm

Re: Re: Non-Free
Yes, but the "noncommercial" CC licenses are considered non-free, by the definition of free culture licenses. See the NonCommercial interpretation page on the CC Wiki:

NC licenses do not qualify as “open licenses” under the Open Definition, and works licensed under an NC license are not considered Free Cultural Works. This may be important if you want others to further distribute your work on Wikipedia, Wikimedia Commons, or other platforms requiring a license that meets the Open Definition or the Definition of Free Cultural Works.

[ link to this | view in thread ]
Anonymous Coward, 31 Oct 2018 @ 3:29pm

Re: Re: Re: Non-Free

This may be important if you want others to further distribute your work on Wikipedia, Wikimedia Commons, or other platforms requiring a license that meets the Open Definition or the Definition of Free Cultural Works.

If that is what the person wants, they distribute the work via one of those platforms under a suitable free license, or if they have already done so, they can submit to the project under an NC license.

Nothing in the rules stops a copyright owner distributing a work under several licenses.
[ link to this | view in thread ]
Lorenzo St. Dubois, 31 Oct 2018 @ 3:44pm

Huh?
What's so great about coleslaw?
[ link to this | view in thread ]
Thad (profile), 31 Oct 2018 @ 3:47pm

Re: Re: Re: Re: Non-Free
I...don't see anyone claiming otherwise?
[ link to this | view in thread ]
Anonymous Anonymous Coward (profile), 31 Oct 2018 @ 3:52pm

Re: Huh?
Obviously you have never had it mixed a la minute (which means there is less than a minute between adding the dressing to the cabbage and whatever other veggies you like in your cole slaw, mixing it and serving it). Crisp vs marinated and soggy.

Thing is, eventually all of whatever Harvard harvests will be re-harvested and then become freely available and unencumbered. Though the fancy apps might not be included. Seems to me some folks were doing this with Pacer, or some other unreasonably encumbered system.

500 downloads per day would only need the cooperation of 13,000 people for one day to capture the entire database. There certainly could be permutations of people and days. To think that anyone might be able to control this pubilic infomation beyond the download (incorporating in the apps is different) would be incredulous.
[ link to this | view in thread ]
keithzg (profile), 31 Oct 2018 @ 4:16pm

Re-publishing and archiving
In terms of PACER, I think the main effort has been https://free.law/recap/ ?

And yeah, I was thinking, someone should definitely coordinate this. I'd certainly run a CRON job on one of my systems to pull down another 500 downloads per day, orchestrated to avoid duplication of effort by some central server like how bitcoin mining pools work.
[ link to this | view in thread ]
keithzg (profile), 31 Oct 2018 @ 4:17pm

Limericks
Yeah, it's definitely hit-and-miss, but with some minor reformatting there are some gems. Example:

"Threes and fours, mostly rejects;
He questioned all of the suspects.
A particular bank,
A cylindrical tank,
Affirmed in all other respects."
[ link to this | view in thread ]
Anonymous Anonymous Coward (profile), 31 Oct 2018 @ 4:30pm

Re: Re-publishing and archiving
Share the code. Here included.
[ link to this | view in thread ]
Anonymous Coward, 31 Oct 2018 @ 4:35pm

But how will censorship be properly implemented?
Court records contain much information that, at least judging by Wikipedia administrator standards, is unfit for public view. The identities of rape and sexual assault accusers would be the most obvious examples of forbidden knowledge, even when, as in the case of Julian Assange, names that have been repeateldy printed in newspapers all over the world are meticulously scrubbed off Wikipedia the instant they (re)appear. To a slightly lesser extent, the same goes for "ordinary" personal information about most anyone (i.e., "doxing") obtained through publicly available online sources, including court records.

Despite lacking any actual political power, this online "encyclopedia" could be viewed as a kind of democracy in action, a way to determine what kind of information could be considered fit for public consumption and what is not. And judging by Wikipedia's standards, much of the information in court records is considered private and thus must not be seen by the public (even when it can already be easily found on the internet).

So the question is, will these court documents get reviewed and scrubbed of personally identifying (and potentially embarrassing) information, and perhaps even corrected to meet modern day standards of social etiquette (like using the "correct" pronouns), or will this be a kind of massive data leak sure to upset everyone from traditional privacy advocates to modern social-justice activists?
[ link to this | view in thread ]
Anonymous Anonymous Coward (profile), 31 Oct 2018 @ 4:45pm

Re: But how will censorship be properly implemented?
There is a difference between what a private corporation, Wikipedia, publishes, and what are public documents, no matter how hard to find. Wikipedia could be sued many, many times, and whether there is a case or not they would have to defend themselves, even if the infomation was considered public.

Public documents on the other hand, whomever posted them, are not actually actionable, though there are some in the EU that might differ with that.

Maybe if Wikipedia opened a set of public documents pages and then linked to that it might preserve some of the legal angst that would come their way if they didn't. Then again, maybe not.

I will be looking forward to hearing here on Techdirt about the lawsuits from folks in the EU against Harvard for the publication of these documents, even though those lawsuites should go nowhere.
[ link to this | view in thread ]
Anonymous Anonymous Coward (profile), 31 Oct 2018 @ 4:50pm

Re: "Lowering The Bar"...
Lowering the Bar's subtle humor deserves a link.
[ link to this | view in thread ]
cecil, 1 Nov 2018 @ 8:43am

machine readable
PDF's are machine readable. If you mean text, simply say text.
[ link to this | view in thread ]
Anonymous Coward, 1 Nov 2018 @ 8:51am

Re: machine readable
Most scanned PDFs are images, and not machine readable until an OCR operation is performed.
[ link to this | view in thread ]
Christenson, 1 Nov 2018 @ 9:50am

Next App: Which cops perjured themselves??
That is, there are any number of weasel words to say that a cops testimony is questionable. Seems to me a new app could mine the corpus for the names of all the cops, then look for when a judge determined that the cops testimony was "not credible", and spit out the name of the cop and citations.
[ link to this | view in thread ]