Curated by THEOUTPOST
On Sun, 4 May, 8:00 AM UTC
4 Sources
[1]
Publisher opt-outs of AI training cut Google's DeepMind training data in half.
During its Search antitrust trial yesterday, a DOJ attorney produced a document showing that "80 billion of 160 billion 'tokens' -- snippets of content -- after filtering out the material that publishers had opted out of allowing Google to use for training its AI," according to Bloomberg. But that opt-out only applies to DeepMind models, Bloomberg reports -- when asked if "the search org has the ability to train on the data that publishers had opted out of training," DeepMind VP Eli Collins replied, "Correct -- for use in search."
[2]
Google's AI Is Scraping Even Sites That Ask to Be Ignored
Don't want a tech conglomerate to train its AI model on your website? Too bad -- Google will do it anyway, thanks to a very convenient workaround. At least, that's more or less what the Silicon Valley behemoth just admitted to in court. As Bloomberg reports, Google said that while it does give publishers the option to opt out of large language model training done by its AI lab, DeepMind, it doesn't extend to AI efforts by other parts of the company -- including the unit in charge of its dominant search engine, which has its own AI products like the much-maligned AI Overviews. The admission was made by Eli Collins, a vice president at DeepMind, when he was called as a witness during a federal antitrust trial in Washington. Diana Aguilar, a Department of Justice lawyer, grilled Collins about the glaring loophole being used to develop the company's chatbot, Gemini. "Once you take the Gemini and put it inside the search org, the search org has the ability to train on the data that publishers had opted out of training, correct?" Aguilar asked, per Bloomberg. "Correct -- for use in search," Collins confirmed. The scale of this scraping is staggering. An internal document from 2024 cited by Aguilar showed that Google had collected a total of 160 billion tokens -- short units of text -- in AI training data. Half of the tokens were stated to have been removed since they came from publishers who opted out of AI training. But based on Collin's new testimony, those 80 billion tokens are still being used to train AI at Google, just not at DeepMind itself. In another example of Google slipperiness, there actually is one way to opt-out of having your website trawled by an AI: by opting out of being indexed in Google's search engine entirely. That's a death sentence for any website, a choice that's really no choice at all. Google implies this is merely a consequence of how the widely used "robots.txt" file works, which instructs web crawlers -- the bots that collect data for search engines and now AI training efforts -- on what parts of a website they can access. "Google has a separate way for publishers to manage their content in Search via the well-established robots.txt web standard," a Google spokesperson said in a statement, per Bloomberg. Last year, a federal judge ruled that Google holds an illegal monopoly over the search engine market, abusing its dominance to shut out competitors -- like by paying companies billions of dollars to set Google as the default search engine on their devices and services -- and unfairly raising ad prices. US regulators are still deciding how to break up the monopoly. Some of the options being considered include forcing Google to sell its popular Chrome browser -- with its AI competitor OpenAI circling like a vulture -- or banning the default search engine agreements made with other companies, or forcing Google to share some of its data. Now, the federal suit is also highlighting how Google leverages its search engine dominance -- constantly maintaining a roughly 90 percent market share in the US -- to get what it wants with its AI initiatives. If by telling websites the only way to avoid its AI data scraping is by not showing up in a Google search, cutting them off from that 90 percent of web traffic, that might be evidence of a monopoly. The education website Chegg argued as much in a recent lawsuit, claiming that Google was using its monopoly to pressure it to let Google train its AI tools on its content for free.
[3]
Google May Train AI on Content for Search Even If Publishers Opt Out
Google Search manages content via the robots.txt web standard Google Search products can reportedly use content from publishers even if they have opted out of artificial intelligence (AI) training. As per the report, a Google DeepMind executive revealed the information during a testimony in the company's ongoing antitrust case against the US Justice Department. The executive reportedly highlighted that such content is not used in the AI models developed by DeepMind. The Mountain View-based tech giant reportedly explained that content for search is managed by a separate mechanism that uses the robots.txt web standard. According to a Bloomberg report, Eli Collins, the Vice President of Product at Google DeepMind, confirmed that the rules for adhering to publishers' decision to opt out from AI training are different for AI models from DeepMind and the company's Search products. Attorney representing the Department of Justice in the antitrust case, Diana Aguilar, reportedly produced a document highlighting that 80 billion out of 160 billion tokens used to train Google's AI models came from content that publishers had opted out of AI training. Collins reportedly responded that DeepMind's models do not use the content once a publisher has opted out of AI training. However, when Aguilar reportedly questioned if the Gemini AI model could use the same content if it was put inside the Search product, Collins confirmed that as "correct," as long as the use case was within Search. Notably, this would include Gemini models powering Google's AI Overviews and recently launched AI Mode. This means traditional opt-out methods aren't enough to keep Google from using content from publishers. The tech giant had updated its privacy policy in June 2023 to reflect that it will use all publicly available Internet data to train its language models. Here, publicly available Internet data refers to any website that does not have a paywall or mandatory sign-up pages, restricting its access to the public. A Google spokesperson later told Bloomberg that the rules for Search-based AI tools are different, as publishers can "only decline having their data used in Search AI if they opt out of being indexed for search." Publishers can do this by disabling the robots.txt web standard that allows Google's crawler bots to access the content to index it in search results. However, this would also ensure that these web pages do not show up when a user uses Google's search engine to search for a topic. This effectively leaves publishers with no option but to accept the company training its AI models on said data. The ongoing antitrust case is attempting to prove that Google has a monopoly in the search and AI space. Amit Mehta, a US District Judge presiding over the case, is being urged by the Department of Justice to force the tech giant to sell Google Chrome and to share the data that it uses to generate search results. However, no such measure has been suggested for the company's AI products.
[4]
Google can train search AI with web content after AI opt-out
During a trial examining Google's search dominance, a Google VP testified that the company trains its AI models on web content, even if publishers opt-out. This data usage, particularly for AI Overviews, raises concerns about revenue loss for publishers. The Justice Department is pushing for measures to restore competition, including potential restrictions on Google's AI practices.Google can train its search-specific AI products, like AI Overviews, on content across the web even when the publishers have chosen to opt out of training Google's AI products, a vice-president of product at the company testified in court on Friday. That's because Google's controls for publishers to opt out of AI training covers work by Google DeepMind, the company's AI lab, said Eli Collins, a DeepMind vice president. Other organisations at the company can further train the models for their products. "Once you take the Gemini" AI model "and put it inside the search org, the search org has the ability to train on the data that publishers had opted out of training, correct?" asked Diana Aguilar, a Department of Justice lawyer. "Correct -- for use in search," Collins responded. Google summarises answers to search queries using its AI at the top of results, which may result in users not clicking on independent websites for answers -- a trend that's hurting their revenue, website publishers have said. Google is using data from those same sites to generate the information powering AI answers. Publishers can only decline having their data used in search AI if they opt out of being indexed for search, Google clarified. "Google has a separate way for publishers to manage their content in Search via the well-established robots.txt web standard," a Google spokesperson said in a statement. Robots.txt is the file embedded within websites that tells bots made by AI companies and web indexers whether they can crawl a site. Google called Collins to the witness stand as part of a three-week trial in federal court in Washington, held to determine how Google should restore competition to online search. Last year, US District Judge Amit Mehta ruled that the tech giant illegally monopolised the search market and is now weighing a set of changes proposed by antitrust enforcers to address its control. The Justice Department is urging the court to force Google to sell its widely-used Chrome browser and to share key data it uses to generate search results. The agency is also asking Judge Mehta to bar Google from paying to be the default search engine on other apps and devices -- a restriction that would extend to its AI offerings, including Gemini, which the government argues have benefited from the company's unlawful dominance in search. Aguilar, the DOJ lawyer, asked Collins whether he knew how much more additional data Google's search organisation had access to beyond the content that Google DeepMind had trained its AI models on. When Collins answered that he did not know, Aguilar produced a document from August 26, 2024 titled, "Search GenAI<> Gemini v3." According to that document, Google removed 80 billion of 160 billion "tokens" -- snippets of content -- after filtering out the material that publishers had opted out of allowing Google to use for training its AI. The document also listed search "sessions data," or data collected during a period of time in which a user interacted with Google Search, as well as YouTube videos, as data that could augment Google's AI models. After viewing the document, Mehta asked Collins for clarification. "The 80 billion out of 160 billion tokens, 50% is removed by publishers opting out?" "That is correct," Collins responded. Later, Google's lawyer sought to show that the tech company's dominance of search did not prevent other AI companies from competing fiercely to provide accurate, real-time results within their chatbot services. If a user asked an AI chatbot for sports scores, for example, Collins testified that the chatbot would likely return the correct answer because the company that made the bot had a commercial arrangement with a sports score provider -- it wouldn't need to rely on a web index. But Google has explored how its AI models could be much improved by the data it has already gathered through years of operating the world's most popular search engine, testimony also showed. At another point during the cross-examination of Collins, the DOJ lawyer Aguilar showed the Google VP a briefing document meant for Demis Hassabis, chief executive officer of Google DeepMind. In a comment, Hassabis had mused about training an unidentified Google AI model with a wealth of search data -- including search rankings -- to see how much more the AI model was improved by the data, compared to one that wasn't trained with it. "Did Google end up building a model using search data?" Aguilar asked Collins. "Not that I'm aware," he responded. "But at least Mr. Hassabis has thought it would be interesting to look at?" she pressed. "Yes," Collins said.
Share
Share
Copy Link
Google's DeepMind VP reveals that the company's search organization can train AI on publisher content even after opt-outs, sparking debates on data usage and monopolistic practices.
In a recent antitrust trial, Google's DeepMind Vice President Eli Collins revealed that the company's search organization can train its AI models on web content even when publishers have opted out of AI training 1. This admission has sparked concerns about Google's data usage practices and its potential monopolistic behavior in the AI and search markets.
Collins confirmed that while publishers can opt out of AI training for DeepMind models, this doesn't extend to other parts of Google, including its search organization 2. This means that Google's search-specific AI products, such as AI Overviews and the recently launched AI Mode, can still use content from publishers who have opted out of AI training 3.
An internal document from 2024 cited during the trial showed that Google had collected 160 billion tokens for AI training data. Half of these tokens were removed due to publisher opt-outs, but based on Collins' testimony, these 80 billion tokens may still be used to train AI within Google's search organization 2.
This revelation has raised concerns about revenue loss for publishers. As Google summarizes answers to search queries using AI at the top of results, users may not click through to independent websites, potentially hurting publishers' ad revenue 4. The irony is that Google is using data from these same sites to generate AI-powered answers.
Google maintains that publishers can manage their content in Search via the robots.txt web standard 1. However, opting out of being indexed for search is seen as a "death sentence" for websites, effectively leaving publishers with no real choice but to allow their content to be used for AI training 2.
The ongoing antitrust case aims to prove that Google has a monopoly in the search and AI space. The U.S. Department of Justice is urging the court to take measures such as forcing Google to sell its Chrome browser, share key search data, and restrict its ability to pay for default search engine status on devices and services 4.
The trial has also revealed Google's exploration of using its vast search data to improve AI models. A document shown in court indicated that Google's CEO of DeepMind, Demis Hassabis, had considered training an AI model with search data, including rankings, to assess the improvement over models not trained with such data 4.
This case highlights the complex relationship between AI development, web content, and publisher rights. It raises questions about the future of AI training practices, the value of web content in the AI era, and the balance between technological advancement and fair competition in the digital landscape.
As the trial continues, the outcome could have significant implications for how tech giants like Google use web data for AI training and potentially reshape the landscape of search and AI technologies.
Reference
[3]
[4]
The DOJ's antitrust case against Google, initially focused on search engine dominance, now emphasizes the company's potential AI monopoly. The trial explores how Google's vast search data could give it an unfair advantage in the emerging AI market.
7 Sources
7 Sources
New research reveals that major AI companies like OpenAI, Google, and Meta prioritize high-quality content from premium publishers to train their large language models, sparking debates over copyright and compensation.
2 Sources
2 Sources
The US Department of Justice has proposed significant remedies to address Google's monopoly in search and search text advertising, including potential divestiture of Chrome and Android, data sharing with competitors, and restrictions on AI development.
18 Sources
18 Sources
Apple's efforts to train its AI models using web content are meeting opposition from prominent publishers. The company's web crawler, Applebot, has been increasingly active, raising concerns about data usage and copyright issues.
3 Sources
3 Sources
AI firms are encountering a significant challenge as data owners increasingly restrict access to their intellectual property for AI training. This trend is causing a shrinkage in available training data, potentially impacting the development of future AI models.
3 Sources
3 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved