idi logo

Institutional Data Initiative

at Harvard Law School Library

The Institutional Books Corpus
AS OF JUNE 12, 2025

983K books

386M pages

242B tokens

o200k_base

254 languages

Institutional Books 1.0 is our first release of public domain books. This set was originally digitized through Harvard Library’s participation in the Google Books project.
dotted linedotted linedotted linedotted line

As we continue to build our corpus of vetted source material, we invite libraries and other knowledge institutions to join us in our mission.

dotted linedotted linedotted linedotted line

Institutional Books is

a practice in exploration.

To better understand the corpus and its potential impact, we analyzed the dataset’s coverage across time, topic, and language.
Language Coverage

We conducted text-level language detection on the OCR-extracted text through which we identified the presence of 379 unique languages.

eng43%
deu17%
fra14%
ita4%
lat3%
spa2%
rus2%
ell1.5%
nld1.2%
heb0.9%

The results of our analysis confirm that this collection focuses mainly on Western European languages, particularly English, while offering varying levels of coverage for a long tail of languages.

The table above displays the top 10 languages based on the total number of detected o200k_base tokens. For each of the top 10 languages represented in the collection, we detected more than 1B o200k_base tokens.

Temporal Coverage

To get a sense of the collection’s temporal coverage, we analyzed each volume’s bibliographical metadata.

Of the 67% of books with a precise publication date, the majority were published in the 19th and 20th centuries.

Dot area graph
150K Volumes
100K
50K
1700
1800
1900
2000
Topic Classification

We conducted a series of experiments to classify volumes using the first level of Library of Congress’ Classification Outline.

Language and Literature
Law
Philosophy, Psychology, Religion
Science
Social Science
Agriculture
Auxiliary Sciences of History
Medicine
History of the Americas
Political Science
24%
13%
12%
11%
5%
4%
3%
3%
3%
3%

In our analysis, we found a concentration of volumes on the topics of Language and Literature; Law; Philosophy, Psychology, Religion; and Science. The table above displays the top ten topics out of twenty topics in total.

dotted line

Institutional Books is

a practice in refinement.

OPEN SOURCED JUNE 12, 2025
Our Pipeline
Our pipeline contributes experiments aimed at retrieving, analyzing and refining the source material in order to make the resulting dataset easier to filter, read, and use, for humans and machines alike.
Icon representing source material retrieval

Retrieval

Icon representing analysis of source material

Analysis

Icon representing dataset refinement

Refinement

Icon representing source material extraction
Retrieval of Source Material
Retrieving over one million books stored on Google Books’ servers required writing a custom retrieval pipeline, which we intend to release as open-source software following further refinement. For each volume, we sought to retrieve a .tar.gz file, containing scan images, OCR data, as well as bibliographic and processing-related metadata.
Extraction Diagram
Icon representing source material extraction
Analysis of Source Material
To enable effective use of the dataset, we underwent a process of analyzing the temporal, language, and topic coverage of the collection. The image below demonstrates how text-level language detection identified the existence of both French and Latin in a book that was previously cataloged as Latin.
analysis Diagram
Icon representing dataset refinement
Refinement of OCR-Extracted Text
While the quality of the OCR-Extracted text is satisfactory at the character or word level, we observed semantic and positional decontextualization that came from exporting OCR data as plain text.
As a first step towards improving the usability of the OCR-extracted text, we developed a post-processing pipeline that reassembled the OCR-extracted text using the detected type of each line as a signal.
Refinement Example
Learn more about the pipeline in our report.
Read the report
dotted line

Institutional Books is

a practice in community.

With the release of this dataset, we seek to establish a community-led process to grow, improve, and use data in ways that strengthen the knowledge ecosystem and the underlying data itself. We envision an institutional commons, supported by community and collaborative research, which incorporates improvements from the AI and research communities for collective benefit. We welcome collaboration from researchers, model makers, and technologists in the following research areas:

Evals & Benchmarks

We see opportunities for Institutional Books to improve model outputs on the axis of long context, multilingual capabilities, and more.

We welcome model makers and AI labs interested in co-developing benchmarks and evaluate the impact of Institutional Books on their models.

Data Refinement and OCR

Our goal is for Institutional Books to best represent the original source material. We invite continued refinement of the OCR-extracted plain text as well as initiatives to re-OCR the dataset and export it as structured text. We believe this process holds potential for developing better OCR pipelines for library use.

dotted line

Institutional Books is

a practice in stewardship.

IDI partners with libraries to surface collections for the public interest.

The Institutional Data Initiative is built on the belief that libraries have the expertise and the data, in the form of their collections, to influence AI’s trajectory towards the public interest. As AI is poised to change how people access knowledge, this is a powerful point of leverage that libraries can use to assert their leadership and engender collaboration in the development of beneficial AI. 

We work with libraries to release structured and refined collections around which AI development and research can transparently unfold. Our goal is to expand the diversity of information, languages, and cultures represented in current models while making information more accessible for the patrons libraries serve.

Martha Whitehead

Harvard University, University Librarian

"As stewards of the public domain and curators of diverse, trustworthy collections, we have the foundational materials needed to train inclusive AI systems.

Through initiatives like IDI, we aim to partner in shaping the ethical use of those materials in emerging systems, to ensure they reflect the breadth and depth of human knowledge for the benefit of all."

harvard library stampharvard library stamp
dotted linedotted linedotted linedotted line

As we continue to build our corpus of vetted source material, we invite libraries and other knowledge institutions to join us in our mission.

dotted linedotted linedotted linedotted line