5 Sources
[1]
Harvard to share dataset of 1M public domain books for AI training
Harvard University has announced that it will be releasing about 1 million public domain books as a dataset available for the training of Artificial Intelligence (AI) models. This is set to be part of the Institutional Data Initiative (IDI), a program hosted within the Harvard Law School Library to "expand and enhance the data resources available for AI training." It also includes books scanned at the Harvard Library during the Google Books project. The data currently used for training AI models is limited in terms of scale, scope and other parameters, with various groups and perspectives "massively underrepresented," Greg Leppert, IDI Executive Director said. He further noted that AI systems were only as diverse as the data used to train them, and such public domain data sets can prove crucial for future AI training. "As it stands, outliers will not be served by AI as well as they should be," Leppert explained. He cited the example of Iceland, which undertook a national effort to ensure its language and culture would be represented in AI models. The dataset in question spans genres, languages and time periods, with classical texts from writers like Charles Dickens, Shakespeare and Dante featured in the list of documents, as well as Czech math textbooks and Welsh pocket dictionaries, as per a report by Wired. The current efforts follow a similar approach by Harvard in the past with its Caselaw Access Project -- a multi-year effort which started in 2015 and, over the next three years, scanned, parsed and structured more than 360 years of United States case law into one dataset -- something Leppert called the "backbone of legal AI training sets." This initiative, however, stems from the university's participation in the Google Books Project, of which it became an early participant two decades ago. Interestingly, the Wired report noted that the IDI received funding from both Microsoft and OpenAI during the course of this exercise of creating the dataset. India on its part is set to have an open-source forum, "IndiaAI Datasets Platform", for hosting datasets, by January 2025, National e-Governance Division (NeGD) Chief Executive Nand Kumarum said in October this year. The platform, set to be one of the foundations of the government's Rs 10,000 crore IndiaAI Mission, will be something like Huggingface -- an AI community which allows individuals to use already available datasets to create models. This ia aimed at creating a space for developers to create, train and deploy their own models.
[2]
Google and Harvard drop 1 million books to train AI models
Harvard University, in collaboration with Google, will release a dataset of approximately one million public-domain books for use in training AI models, according to WIRED. This initiative, known as the Institutional Data Initiative, has secured funding from both Microsoft and OpenAI. The dataset comprises works that are no longer under copyright protection, drawn from Google's extensive book-scanning efforts. The announcement came on December 12, 2024, with the dataset, which encompasses a wide array of genres, languages, and authors including notable figures like Dickens, Dante, and Shakespeare. Harvard's executive director for the initiative, Greg Leppert, emphasized that the dataset aims to "level the playing field," enabling access for research labs and AI startups to enhance their language model development efforts. The dataset is intended for anyone looking to train large language models (LLMs), although the specific release date and method have yet to be disclosed. As AI technologies increasingly rely on vast amounts of text data, this dataset serves as a crucial resource. Foundational models like ChatGPT benefit significantly from high-quality training data. However, the necessity for data has caused challenges for companies like OpenAI, which face legal scrutiny over the unauthorized use of copyrighted materials. Lawsuits from major publishers, including the Wall Street Journal and the New York Times, highlight ongoing tensions regarding content use and copyright infringement in AI training. While the forthcoming dataset will be advantageous, it is still unclear if one million books will be sufficient to meet the demands of AI model training, especially as contemporary references and updated slang are not covered within these historical texts. AI companies will continue to seek additional data sources, particularly exclusive or up-to-date information, to distinguish their models from competitors. Developers in the AI sector are not limited to historical texts alone. Several platforms, including Reddit and X, have begun restricting access to their data as they recognize its increasing value. Reddit has entered licensing deals with companies like Google, while X maintains exclusive content arrangements for real-time data utilization. This shift in content accessibility reflects the competitive landscape where AI companies struggle to acquire adequate and relevant training data without facing legal repercussions. The execution of the Institutional Data Initiative is a step towards easing these pressures by providing a legally safe pool of historical texts, allowing for responsible model training. However, comprehensive strategies will still be necessary to ensure AI models are competitive and capable of understanding contemporary language and references. How effectively this resource will fulfill the ongoing demand for comprehensive and diverse data remains a question as investigations into data usage continue.
[3]
Harvard and Google to release 1 million public-domain books as AI training dataset
AI training data has a big price tag, one best-suited for deep-pocketed tech firms. This is why Harvard University plans to release a dataset that includes in the region of 1 million public-domain books, spanning genres, languages, and authors including Dickens, Dante, and Shakespeare, which are no longer copyright-protected due to their age. The new dataset isn't available yet, and it's not clear when or how it will be released. However, it contains books derived from Google's longstanding book-scanning project, Google Books, and thus Google will be involved in releasing "this treasure trove far and wide." Harvard first teased the Institutional Data Initiative (IDI) back in March, outlining its plans to create a "trusted conduit for legal data for AI." However, not much has been heard from it until its formal launch today, which came with confirmation that the IDI includes financial backing from Microsoft and OpenAI. The IDI's executive director Greg Leppert says the dataset's designed to "level the playing field" by opening up such a huge dataset to anyone -- from research labs to AI startups -- that want to train their large language models (LLMs).
[4]
Harvard Makes 1 Million Books Available to Train AI Models
The dataset includes books that are in the public domain and no longer protected by copyright. Data is the new oil, as they say, and perhaps that makes Harvard University the new Exxon. The school announced Thursday the launch of a dataset containing nearly one million public domain books that can be used for training AI models. Under the newly formed Institutional Data Initiative, the project has received funding from both Microsoft and OpenAI, and contains books scanned by Google Books that are old enough that their copyright protection has expired. Wired in a piece on the new project says the dataset includes a wide variety of books with "classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries." As a general rule, copyright protections last for the lifetime of the author plus an additional 70 years. Foundational language models, like ChatGPT, that behave like a verisimilitude of a real human require an immense amount of high-quality text for their trainingâ€"generally the more information they ingest, the better the models perform at imitating humans and serving up knowledge. But that thirst for data has caused problems as the likes of OpenAI have hit walls on how much new information they can findâ€"without stealing it, at least. Publishers including the Wall Street Journal and the New York Times have sued OpenAI and competitor Perplexity for ingesting their data without permission. Proponents of AI companies have made various arguments to defend their activities. They will sometimes say that humans themselves produce new works based on studying and synthesizing material from other sources, and AI isn't any different. Everyone goes to school, reads books, and then produces new work using the knowledge they gained. Remixing is legally considered fair use if the new creation is materially different. But that fails to take into account that humans cannot ingest billions of pieces of text at the speed a computer can, so it's not exactly a fair comparison. The Wall Street Journal in its lawsuit against Perplexity has said the startup "copies on a massive scale." Players in the space have also put forth the argument that any content made available on the open web is essentially fair game and that the user of a chatbot is the one accessing copyrighted content by requesting it through a prompt. Basically, a chatbot like Perplexity is akin to a web browser. It will be some time before these arguments play out in court. OpenAI has struck deals with some content providers in response to the criticisms, and Perplexity has rolled out an ad-supported partner program with publishers. But it is clear they have done so begrudgingly. At the same time as AI companies are running out of new content to utilize, commonly used web sources that are already included in training sets have quickly begun restricting access. Companies including Reddit and X have been aggressive about limiting the use of their data as they have recognized its immense value, especially in having real-time data to augment foundational models with more up-to-date information on the world. Reddit makes hundreds of millions of dollars licensing its corpus of subreddits and comments to Google for training its models. Elon Musk's X has an exclusive arrangement with his other company, xAI, to give its models access to the social network's content for training and retrieval of current information. It's kind of ironic to consider that these companies closely guard their own data, but essentially think content from media publishers has no value and should be free. One million books won't be enough to supply any AI company's training needs, especially considering these books are old and don't contain modern information, like the slang Gen Z kids are using. In order to differentiate themselves from competitors, AI companies will want to continue accessing other dataâ€"especially the exclusive kindâ€"so they are not all creating models that are the same. The Institutional Data Initiative's dataset can at least offer some assistance to AI companies trying to train their initial foundational models without getting into any legal trouble.
[5]
Harvard adds copyright-free fuel to the AI fire.
With funding from Microsoft and OpenAI, the university's Institutional Data Initiative (IDI) is releasing a dataset for training AI that contains nearly one million public-domain books -- around five times larger than the controversial Books3 dataset. The aim is to "level the playing field" for smaller AI developers who don't have access to the massive datasets used by tech giants, according to IDI executive director Greg Leppert.
Share
Copy Link
Harvard University, in collaboration with Google, announces the release of a dataset containing approximately 1 million public domain books for AI model training, aiming to democratize access to high-quality training data for researchers and startups.
Harvard University has announced a groundbreaking initiative to release a dataset of approximately 1 million public domain books for training Artificial Intelligence (AI) models 1. This project, part of the Institutional Data Initiative (IDI) hosted within the Harvard Law School Library, aims to expand and enhance the data resources available for AI training 1.
The initiative is a collaborative effort involving Google, with the dataset comprising works from Google's extensive book-scanning project 2. Notably, the project has secured funding from both Microsoft and OpenAI, highlighting the tech industry's interest in this resource 3.
The dataset spans a wide array of genres, languages, and time periods, featuring classical texts from renowned authors like Charles Dickens, Shakespeare, and Dante, alongside more obscure works such as Czech math textbooks and Welsh pocket dictionaries 1 4. This diversity aims to address the current limitations in AI training data, where various groups and perspectives are often underrepresented 1.
Greg Leppert, IDI Executive Director, emphasized that the dataset is designed to "level the playing field" by providing access to a vast collection of high-quality training data for research labs and AI startups 3. This move is particularly significant given the current landscape where AI training data often comes with a hefty price tag, favoring deep-pocketed tech firms 3.
The use of public domain books circumvents the legal challenges faced by AI companies regarding copyright infringement. Recent lawsuits from major publishers against AI firms highlight the ongoing tensions in the industry over the use of copyrighted materials for AI training 2. Harvard's initiative provides a legally safe pool of historical texts for responsible model training 4.
While this dataset represents a significant step forward, questions remain about its sufficiency for comprehensive AI model training. The lack of contemporary references and updated language in these historical texts may necessitate additional data sources for AI companies seeking to create competitive and up-to-date models 2 4.
This project aligns with global efforts to ensure diverse representation in AI training data. For instance, Iceland has undertaken a national effort to ensure its language and culture are represented in AI models 1. In India, plans are underway to launch an open-source forum, "IndiaAI Datasets Platform," by January 2025, aimed at hosting datasets for AI development 1.
Summarized by
Navi
[5]
NVIDIA announces significant upgrades to its GeForce NOW cloud gaming service, including RTX 5080-class performance, improved streaming quality, and an expanded game library, set to launch in September 2025.
10 Sources
Technology
16 hrs ago
10 Sources
Technology
16 hrs ago
Nvidia is reportedly developing a new AI chip, the B30A, based on its latest Blackwell architecture for the Chinese market. This chip is expected to outperform the currently allowed H20 model, raising questions about U.S. regulatory approval and the ongoing tech trade tensions between the U.S. and China.
11 Sources
Technology
16 hrs ago
11 Sources
Technology
16 hrs ago
SoftBank Group has agreed to invest $2 billion in Intel, buying common stock at $23 per share. This strategic investment comes as Intel undergoes a major restructuring under new CEO Lip-Bu Tan, aiming to regain its competitive edge in the semiconductor industry, particularly in AI chips.
18 Sources
Business
8 hrs ago
18 Sources
Business
8 hrs ago
Databricks, a data analytics firm, is set to raise its valuation to over $100 billion in a new funding round, showcasing the strong investor interest in AI startups. The company plans to use the funds for AI acquisitions and product development.
7 Sources
Business
56 mins ago
7 Sources
Business
56 mins ago
OpenAI introduces ChatGPT Go, a new subscription plan priced at ₹399 ($4.60) per month exclusively for Indian users, offering enhanced features and affordability to capture a larger market share.
15 Sources
Technology
8 hrs ago
15 Sources
Technology
8 hrs ago