Harvard and Google to Release 1 Million Public Domain Books for AI Training

5 Sources

Harvard University, in collaboration with Google, announces the release of a dataset containing approximately 1 million public domain books for AI model training, aiming to democratize access to high-quality training data for researchers and startups.

News article

Harvard's Ambitious AI Training Dataset Initiative

Harvard University has announced a groundbreaking initiative to release a dataset of approximately 1 million public domain books for training Artificial Intelligence (AI) models 1. This project, part of the Institutional Data Initiative (IDI) hosted within the Harvard Law School Library, aims to expand and enhance the data resources available for AI training 1.

Collaboration with Tech Giants

The initiative is a collaborative effort involving Google, with the dataset comprising works from Google's extensive book-scanning project 2. Notably, the project has secured funding from both Microsoft and OpenAI, highlighting the tech industry's interest in this resource 3.

Diverse Content and Accessibility

The dataset spans a wide array of genres, languages, and time periods, featuring classical texts from renowned authors like Charles Dickens, Shakespeare, and Dante, alongside more obscure works such as Czech math textbooks and Welsh pocket dictionaries 1 4. This diversity aims to address the current limitations in AI training data, where various groups and perspectives are often underrepresented 1.

Leveling the Playing Field

Greg Leppert, IDI Executive Director, emphasized that the dataset is designed to "level the playing field" by providing access to a vast collection of high-quality training data for research labs and AI startups 3. This move is particularly significant given the current landscape where AI training data often comes with a hefty price tag, favoring deep-pocketed tech firms 3.

Addressing Legal Concerns

The use of public domain books circumvents the legal challenges faced by AI companies regarding copyright infringement. Recent lawsuits from major publishers against AI firms highlight the ongoing tensions in the industry over the use of copyrighted materials for AI training 2. Harvard's initiative provides a legally safe pool of historical texts for responsible model training 4.

Future Implications and Limitations

While this dataset represents a significant step forward, questions remain about its sufficiency for comprehensive AI model training. The lack of contemporary references and updated language in these historical texts may necessitate additional data sources for AI companies seeking to create competitive and up-to-date models 2 4.

Global Context and Similar Initiatives

This project aligns with global efforts to ensure diverse representation in AI training data. For instance, Iceland has undertaken a national effort to ensure its language and culture are represented in AI models 1. In India, plans are underway to launch an open-source forum, "IndiaAI Datasets Platform," by January 2025, aimed at hosting datasets for AI development 1.

Explore today's top stories

NVIDIA Unveils Major GeForce NOW Upgrade with RTX 5080 Performance and Expanded Game Library

NVIDIA announces significant upgrades to its GeForce NOW cloud gaming service, including RTX 5080-class performance, improved streaming quality, and an expanded game library, set to launch in September 2025.

CNET logoengadget logoPCWorld logo

10 Sources

Technology

16 hrs ago

NVIDIA Unveils Major GeForce NOW Upgrade with RTX 5080

Nvidia Develops New AI Chip for China Amid Geopolitical Tensions

Nvidia is reportedly developing a new AI chip, the B30A, based on its latest Blackwell architecture for the Chinese market. This chip is expected to outperform the currently allowed H20 model, raising questions about U.S. regulatory approval and the ongoing tech trade tensions between the U.S. and China.

TechCrunch logoTom's Hardware logoReuters logo

11 Sources

Technology

16 hrs ago

Nvidia Develops New AI Chip for China Amid Geopolitical

SoftBank's $2 Billion Investment in Intel: A Strategic Move in the AI Chip Race

SoftBank Group has agreed to invest $2 billion in Intel, buying common stock at $23 per share. This strategic investment comes as Intel undergoes a major restructuring under new CEO Lip-Bu Tan, aiming to regain its competitive edge in the semiconductor industry, particularly in AI chips.

TechCrunch logoTom's Hardware logoReuters logo

18 Sources

Business

8 hrs ago

SoftBank's $2 Billion Investment in Intel: A Strategic Move

Databricks Secures $100 Billion Valuation in Latest Funding Round, Highlighting AI Sector's Rapid Growth

Databricks, a data analytics firm, is set to raise its valuation to over $100 billion in a new funding round, showcasing the strong investor interest in AI startups. The company plans to use the funds for AI acquisitions and product development.

Reuters logoAnalytics India Magazine logoU.S. News & World Report logo

7 Sources

Business

56 mins ago

Databricks Secures $100 Billion Valuation in Latest Funding

OpenAI Launches Affordable ChatGPT Go Plan in India, Eyeing Global Expansion

OpenAI introduces ChatGPT Go, a new subscription plan priced at ₹399 ($4.60) per month exclusively for Indian users, offering enhanced features and affordability to capture a larger market share.

TechCrunch logoBloomberg Business logoReuters logo

15 Sources

Technology

8 hrs ago

OpenAI Launches Affordable ChatGPT Go Plan in India, Eyeing
TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2025 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo