Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jonathan Zittrain

GRIN Transfer: A production-ready tool for libraries to retrieve digital copies from Google Books

Nov 17, 2025

Liza Daly, Matteo Cargnelutti, Catherine Brobston, John Hess, Greg Leppert, Amanda Watson, Jonathan Zittrain

Abstract:Publicly launched in 2004, the Google Books project has scanned tens of millions of items in partnership with libraries around the world. As part of this project, Google created the Google Return Interface (GRIN). Through this platform, libraries can access their scanned collections, the associated metadata, and the ongoing OCR and metadata improvements that become available as Google reprocesses these collections using new technologies. When downloading the Harvard Library Google Books collection from GRIN to develop the Institutional Books dataset, we encountered several challenges related to rate-limiting and atomized metadata within the GRIN platform. To overcome these challenges and help other libraries make more robust use of their Google Books collections, this technical report introduces the initial release of GRIN Transfer. This open-source and production-ready Python pipeline allows partner libraries to efficiently retrieve their Google Books collections from GRIN. This report also introduces an updated version of our Institutional Books 1.0 pipeline, initially used to analyze, augment, and assemble the Institutional Books 1.0 dataset. We have revised this pipeline for compatibility with the output format of GRIN Transfer. A library could pair these two tools to create an end-to-end processing pipeline for their Google Books collection to retrieve, structure, and enhance data available from GRIN. This report gives an overview of how GRIN Transfer was designed to optimize for reliability and usability in different environments, as well as guidance on configuration for various use cases.

Via

Access Paper or Ask Questions

Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability

Jun 10, 2025

Matteo Cargnelutti, Catherine Brobston, John Hess, Jack Cushman, Kristi Mukk, Aristana Scourtas, Kyle Courtney, Greg Leppert, Amanda Watson, Martha Whitehead(+1 more)

Abstract:Large language models (LLMs) use data to learn about the world in order to produce meaningful correlations and predictions. As such, the nature, scale, quality, and diversity of the datasets used to train these models, or to support their work at inference time, have a direct impact on their quality. The rapid development and adoption of LLMs of varying quality has brought into focus the scarcity of publicly available, high-quality training data and revealed an urgent need to ground the stewardship of these datasets in sustainable practices with clear provenance chains. To that end, this technical report introduces Institutional Books 1.0, a large collection of public domain books originally digitized through Harvard Library's participation in the Google Books project, beginning in 2006. Working with Harvard Library, we extracted, analyzed, and processed these volumes into an extensively-documented dataset of historic texts. This analysis covers the entirety of Harvard Library's collection scanned as part of that project, originally spanning 1,075,899 volumes written in over 250 different languages for a total of approximately 250 billion tokens. As part of this initial release, the OCR-extracted text (original and post-processed) as well as the metadata (bibliographic, source, and generated) of the 983,004 volumes, or 242B tokens, identified as being in the public domain have been made available. This report describes this project's goals and methods as well as the results of the analyses we performed, all in service of making this historical collection more accessible and easier for humans and machines alike to filter, read and use.

Via

Access Paper or Ask Questions

Interventions over Predictions: Reframing the Ethical Debate for Actuarial Risk Assessment

Jul 14, 2018

Chelsea Barabas, Karthik Dinakar, Joichi Ito, Madars Virza, Jonathan Zittrain

Abstract:Actuarial risk assessments might be unduly perceived as a neutral way to counteract implicit bias and increase the fairness of decisions made at almost every juncture of the criminal justice system, from pretrial release to sentencing, parole and probation. In recent times these assessments have come under increased scrutiny, as critics claim that the statistical techniques underlying them might reproduce existing patterns of discrimination and historical biases that are reflected in the data. Much of this debate is centered around competing notions of fairness and predictive accuracy, resting on the contested use of variables that act as "proxies" for characteristics legally protected against discrimination, such as race and gender. We argue that a core ethical debate surrounding the use of regression in risk assessments is not simply one of bias or accuracy. Rather, it's one of purpose. If machine learning is operationalized merely in the service of predicting individual future crime, then it becomes difficult to break cycles of criminalization that are driven by the iatrogenic effects of the criminal justice system itself. We posit that machine learning should not be used for prediction, but rather to surface covariates that are fed into a causal model for understanding the social, structural and psychological drivers of crime. We propose an alternative application of machine learning and causal inference away from predicting risk scores to risk mitigation.

* Accepted paper (not camera-ready version) of FATML 2018 conference, Fairness, Accountability and Transparency in Machine Learning, 2018, Proceedings of Machine Learning Research

Via

Access Paper or Ask Questions