The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved
Curated by THEOUTPOST
On Sat, 9 Nov, 4:01 PM UTC
2 Sources
[1]
New Data Shows AI Companies Love 'Premium Publisher' Content
Imad is a senior reporter covering Google and internet culture. Hailing from Texas, Imad started his journalism career in 2013 and has amassed bylines with The New York Times, The Washington Post, ESPN, Tom's Guide and Wired, among others. OpenAI, Google, Meta and Anthropic all rely deeply on content from premium publishers to train the large language models, or LLMs, at the heart of their AI efforts, even as these companies have regularly underplayed their use of such copyrighted content, according to new research released this week from online publishing giant Ziff Davis. Ziff Davis owns CNET, as well a host of other brands, including IGN, PCMag, Mashable and Everyday Health. A paper detailing the research and authored by Ziff Davis' George Wukoson, lead attorney on AI, and Chief Technology Officer Joey Fortuna, reports that AI companies intentionally filtered out low-quality content in favor of high-quality, human-made content to train their models. Given that AI companies want their models to perform well, it makes sense they'd favor quality content in their training data. AI companies used websites' domain authority, or essentially their ranking in Google search, to make those distinctions. Generally, sources that filter higher on Google tend to be of higher quality and trustworthiness. The companies behind popular AI chatbots like ChatGPT and Gemini have been secretive about where they're sourcing the information that powers the answers the bots are giving you. That's not helpful for consumers, who don't get visibility into the sources, their reliability, and whether the training data might be biased or perpetuate harmful stereotypes. But it's also a point of significant dispute with publishers, who say AI companies are basically pirating the copyrighted work they own, without permission or compensation. Though OpenAI has licensed content from some publishers as it transforms from a nonprofit into a for-profit company, other media companies are suing the maker of ChatGPT for copyright infringement. "Major LLM developers no longer disclose their training data as they once did. They are now more commercial and less transparent," Wukoson and Fortuna wrote. OpenAI, Google, Meta and Anthropic didn't immediately respond to requests for comment. Publishers including The New York Times have sued Microsoft and OpenAI for copyright infringement, while Wall Street Journal and New York Post publisher Dow Jones is suing Perplexity, another generative AI startup, on similar grounds. Big Tech has seen tremendous valuations amid the AI revolution. Google is currently valued at about $2.2 trillion, and Meta is valued at about $1.5 trillion, in part because of their work with generative AI. Investors currently value startups OpenAI and Anthropic at $157 billion and $40 billion, respectively. News publishers, meanwhile, are struggling and have been forced into waves of layoffs over the past few years. News publishers are struggling in a highly competitive online media environment, trying to navigate through the noise of online search, AI-generated "slop" and social media to find audiences. Meta CEO Mark Zuckerberg said creators and publishers "overestimate the value of their specific content," in an interview with The Verge earlier this year. Meanwhile, some AI companies have inked licensing deals with publishers to feed their LLMs with up-to-date news articles. OpenAI signed a deal with the Financial Times, DotDash Meredith, Vox and others earlier this year. Meta and Microsoft have also cut deals with publishers. Ziff Davis hasn't signed a similar deal. Based on an analysis of disclosures made by AI companies for their older models, Wukoson and Fortuna found that URLs from top-end publishers such as Axel Springer (Business Insider, Politico), Future PLC (TechRadar, Tom's Guide), Hearst (San Francisco Chronicle, Men's Health), News Corp (The Wall Street Journal), The New York Times Company, The Washington Post and others, made up 12.04% of training data, at least for the OpenWebText2 dataset. OpenWebText2 was used to train GPT-3, which is the underlying technology for ChatGPT, though the latest version of ChatGPT isn't directly built on top of GPT-3 and is its own thing. Neither OpenAI, Google, Anthropic nor Meta have disclosed training data used to train their most recent models. Each of several trends discussed in the research paper "reflects decisions made by LLM companies to prioritize high-quality web text datasets in training LLMs, resulting in revolutionary technological breakthroughs driving enormous value for those companies," Wukoson and Fortuna wrote.
[2]
Google, OpenAI Heavily Weight News Content in AI Training Without Payment
AI giants like Google, OpenAI, and Meta are placing greater emphasis on content from reputable news sources when training large language models, according to a new study by Ziff Davis. The findings could help the public understand where chatbots get their information and give media companies like Ziff Davis more leverage when seeking copyright protection or payment for their material when it's gobbled up by AI. "Our work shows that key LLM training datasets are disproportionately composed of high-quality content owned by commercial publishers of news and media websites," the study says. "Major LLM companies have quantifiably prioritized this content in training the most important LLMs over the short history of the technology." Ziff Davis is the parent company of PCMag. The study was conducted by the company's lead AI attorney, George Wukoson, and its chief technology officer, Joey Fortuna. It examined open-source replicas of datasets that AI companies have admitted to using, including Common Crawl, C4, OpenWebText, and OpenWebText2. OpenAI admits to giving more weight to datasets it deems high-quality, including news media, copyrighted books, and links embedded in popular Reddit posts. This is a way of ranking all the content LLMs scrape from the web with the goal of producing better answers for users. For example, it gave WebText2 22% weight in training GPT-3 despite accounting for 3.8% of tokens. Nearly 13.5% of the URLs embedded in WebText2 come from a group of 15 top media publishers, including News Corp, The New York Times, Gannett, Ziff Davis, Vox Media, Axel Springer, Alden Capital, Hearst, The Washington Post, BuzzFeed, Future, IAC, and Bustle. The contents of the datasets also change over time. For example, OpenAI placed a high emphasis on content from The Washington Post in OpenWebText, but decreased its prominence for the release of OpenWebText2. Ziff Davis says the findings quantify how important the news media is to the future of AI chatbots, with no obligation to pay them for it. This "long-running exploitation of high-quality publisher content (extremely lucrative for the LLM companies) [implies] lost licensing revenue from some of the world's most highly valued companies." Without payment for content, publishers could be put out of business, threatening the continuous flow of high-quality information in the AI era. The report comes after a federal judge dismissed a lawsuit against OpenAI from Raw Story and AlterNet, which said the AI company used its content to train LLMs without permission, Reuters reports. A related case filed by The New York Times is ongoing. OpenAI has also signed licensing deals with many top media companies. OpenAI's latest product launch, ChatGPT search, now cites some of its sources in addition to summarizing the content within them.
Share
Share
Copy Link
New research reveals that major AI companies like OpenAI, Google, and Meta prioritize high-quality content from premium publishers to train their large language models, sparking debates over copyright and compensation.
A new study by Ziff Davis has revealed that major AI companies, including OpenAI, Google, Meta, and Anthropic, heavily rely on content from premium publishers to train their large language models (LLMs). This practice has raised concerns about copyright infringement and fair compensation for content creators 1.
The research, conducted by Ziff Davis' George Wukoson and Joey Fortuna, found that AI companies intentionally filter out low-quality content in favor of high-quality, human-made content for their training data. They use websites' domain authority, essentially their ranking in Google search, to make these distinctions 1.
Analysis of older model disclosures showed that URLs from top-end publishers made up 12.04% of training data in the OpenWebText2 dataset, which was used to train GPT-3 1.
This revelation has significant implications for both publishers and AI companies. Publishers argue that AI companies are using their copyrighted work without permission or compensation. Several media companies, including The New York Times, have sued AI companies for copyright infringement 1.
On the other hand, AI companies have seen tremendous valuations amid the AI revolution. Google is currently valued at about $2.2 trillion, and Meta at about $1.5 trillion, partly due to their work with generative AI. Startups OpenAI and Anthropic are valued at $157 billion and $40 billion, respectively 1.
The use of copyrighted content for AI training has led to legal challenges. While a federal judge recently dismissed a lawsuit against OpenAI from Raw Story and AlterNet, a related case filed by The New York Times is ongoing 2.
Some AI companies have started to address these concerns by signing licensing deals with publishers. OpenAI, for instance, has inked deals with the Financial Times, DotDash Meredith, and Vox, among others 1.
The lack of transparency from AI companies about their training data sources has been a point of contention. This opacity not only affects publishers but also raises concerns for consumers who don't have visibility into the reliability and potential biases of the information powering AI chatbots 1.
As the AI industry continues to evolve, the findings of this study could provide media companies with more leverage when seeking copyright protection or compensation for their content used in AI training. It also highlights the importance of high-quality journalism in the AI era and the potential threats to the continuous flow of reliable information if publishers are not fairly compensated 2.
OpenAI has formed a significant content partnership with Hearst, allowing integration of Hearst's newspaper and magazine content into OpenAI's AI products, including ChatGPT. This move marks a growing trend of collaboration between AI companies and traditional media publishers.
12 Sources
12 Sources
AI firms are encountering a significant challenge as data owners increasingly restrict access to their intellectual property for AI training. This trend is causing a shrinkage in available training data, potentially impacting the development of future AI models.
3 Sources
3 Sources
OpenAI has signed a groundbreaking deal with Condé Nast to incorporate content from prestigious publications like Vogue and The New Yorker into its AI models. This partnership aims to enhance AI-generated content and improve information access.
13 Sources
13 Sources
Major Canadian news organizations have filed a lawsuit against OpenAI, claiming copyright infringement and seeking billions in damages for the unauthorized use of their content in training AI models like ChatGPT.
22 Sources
22 Sources
HarperCollins has reached an agreement with an unnamed AI company to use select nonfiction books for AI model training, offering authors $2,500 per book. The deal highlights growing tensions between publishers, authors, and AI firms over copyright and compensation.
7 Sources
7 Sources