983K books
386M pages
242B tokens
o200k_base
254 languages
983K books
386M pages
242B tokens
o200k_base
254 languages
As we continue to build our corpus of vetted source material, we invite libraries and other knowledge institutions to join us in our mission.
Institutional Books is
a practice in exploration.
We conducted text-level language detection on the OCR-extracted text through which we identified the presence of 379 unique languages.
The results of our analysis confirm that this collection focuses mainly on Western European languages, particularly English, while offering varying levels of coverage for a long tail of languages.
The table above displays the top 10 languages based on the total number of detected o200k_base tokens. For each of the top 10 languages represented in the collection, we detected more than 1B o200k_base tokens.
To get a sense of the collection’s temporal coverage, we analyzed each volume’s bibliographical metadata.
Of the 67% of books with a precise publication date, the majority were published in the 19th and 20th centuries.
We conducted a series of experiments to classify volumes using the first level of Library of Congress’ Classification Outline.
In our analysis, we found a concentration of volumes on the topics of Language and Literature; Law; Philosophy, Psychology, Religion; and Science. The table above displays the top ten topics out of twenty topics in total.
Institutional Books is
a practice in refinement.
Retrieval
Analysis
Refinement
Institutional Books is
a practice in community.
Evals & Benchmarks
We see opportunities for Institutional Books to improve model outputs on the axis of long context, multilingual capabilities, and more.
We welcome model makers and AI labs interested in co-developing benchmarks and evaluate the impact of Institutional Books on their models.
Data Refinement and OCR
Our goal is for Institutional Books to best represent the original source material. We invite continued refinement of the OCR-extracted plain text as well as initiatives to re-OCR the dataset and export it as structured text. We believe this process holds potential for developing better OCR pipelines for library use.
Institutional Books is
a practice in stewardship.
The Institutional Data Initiative is built on the belief that libraries have the expertise and the data, in the form of their collections, to influence AI’s trajectory towards the public interest. As AI is poised to change how people access knowledge, this is a powerful point of leverage that libraries can use to assert their leadership and engender collaboration in the development of beneficial AI.
We work with libraries to release structured and refined collections around which AI development and research can transparently unfold. Our goal is to expand the diversity of information, languages, and cultures represented in current models while making information more accessible for the patrons libraries serve.
Martha Whitehead
Harvard University, University Librarian
"As stewards of the public domain and curators of diverse, trustworthy collections, we have the foundational materials needed to train inclusive AI systems.
Through initiatives like IDI, we aim to partner in shaping the ethical use of those materials in emerging systems, to ensure they reflect the breadth and depth of human knowledge for the benefit of all."
As we continue to build our corpus of vetted source material, we invite libraries and other knowledge institutions to join us in our mission.