Curated by THEOUTPOST
On Tue, 16 Jul, 4:03 PM UTC
27 Sources
[1]
Nvidia, Apple, and others allegedly trained AI using 173,000 YouTube videos -- professional creators frustrated by latest AI training scandal: Report
Some of the world's wealthiest companies, including Apple and Nvidia, are among countless parties who allegedly trained their AI using scraped YouTube videos as training data. The YouTube transcripts were reportedly accumulated through means that violate YouTube's Terms of Service and have some creators seeing red. The news was first discovered in a joint investigation by Proof News and Wired. While major AI companies and producers often keep their AI training data secret, heavyweights like Apple, Nvidia, and Salesforce have revealed their use of "The Pile", an 800GB training dataset created by EleutherAI, and the YouTube Subtitles dataset within it. The YouTube Subtitles training data is made up of 173,536 YouTube plaintext transcripts scraped from the site, including 12,000+ videos which have been removed since the dataset's creation in 2020. Affected parties whose work was purportedly scraped for the training data include education channels like Crash Course (1,862 videos taken for training) and Philosophy Tube (146 videos taken), YouTube megastars like MrBeast (two videos) and Pewdiepie (337 videos), and TechTubers like Marques Brownlee (seven videos) and Linus Tech Tips (90 videos). Proof News created a tool you can use to survey the entirety of the YouTube videos allegedly used without consent. EleutherAI is a respectably-sized force in the AI training space. The non-profit AI research lab is one of many aiming to "democratize" AI for the masses, with its website stating a goal to "ensure that the ability to study foundation models is not restricted to a handful of companies". The Pile and YouTube Subtitles datasets were created for this purpose, to provide high-quality training data to even the scrappiest of at-home AI coders. However, this idyllic dream of supporting the little guy with The Pile has become another fuel source for major corporations to train AI, rather than DIYers. However, YouTube Subtitles violates YouTube's Terms of Service based on its use of YouTube's content without permission and its use of "automated means" to access the data. In the research paper about The Pile and YouTube Subtitles, EleutherAI acknowledges its violation of TOS but claims that the tools used to scrape YouTube data were already widespread enough that no additional harm was caused. Many of those affected have reacted strongly against the use of their content. Abigail Thorn, producer of YouTube channel Philosophy Tube and actress on House of the Dragon, shared on X (formerly Twitter), "When I was told about this I lay on the floor and cried, it's so violating, it made me want to quit writing forever. The reason I got back up was because I know my audience come to my show for real connection and ideas, not cheapfake AI garbage." She continued, "I'd like to see YouTube do more to prevent theft like this from happening." Thorn and other YouTubers confirm that no one ever requested to initially scrape or later use any of the videos as training data. Who to lay fault on is made difficult by the fact that no one will accept blame or responsibility for the use of the transcripts. Apple and other major tech companies who used the training data avoid blame because they weren't the ones doing the scraping, although conversations must be had within such companies about the ethical sourcing of training data. EleutherAI, creators of the dataset, have not responded to any publications' requests for comment and reject any wrongdoing or harm in their initial research paper on Pile. The tech industry is spending on AI hardware at an unhealthy rate, with the AI market needing to turn $600 billion in profit per year to keep up with its insane hardware purchasing. As companies seek to spend less on AI, more instances of illicitly obtained data become more likely, like this YouTube theft and Google's Gemini reading files without permission. Before long, it may not be shocking to see web content end with "You have exceeded the GPT rate limit. Don't forget to smash that like button!"
[2]
Apple, Nvidia, Anthropic Used Thousands of Swiped YouTube Videos to Train AI
AI companies are generally secretive about their sources of training data, but an investigation by Proof News found some of the wealthiest AI companies in the world have used material from thousands of YouTube videos to train AI. Companies did so despite YouTube's rules against harvesting materials from the platform without permission. Our investigation found that subtitles from 173,536 YouTube videos, siphoned from more than 48,000 channels, were used by Silicon Valley heavyweights, including Anthropic, Nvidia, Apple, and Salesforce. The dataset, called YouTube Subtitles, contains video transcripts from educational and online learning channels like Khan Academy, MIT, and Harvard. The Wall Street Journal, NPR, and the BBC also had their videos used to train AI, as did The Late Show With Stephen Colbert, Last Week Tonight With John Oliver, and Jimmy Kimmel Live. Proof News also found material from YouTube megastars, including MrBeast (289 million subscribers, two videos taken for training), Marques Brownlee (19 million subscribers, seven videos taken), Jacksepticeye (nearly 31 million subscribers, 377 videos taken), and PewDiePie (111 million subscribers, 337 videos taken). Some of the material used to train AI also promoted conspiracies such as the "flat-earth theory." Proof News created a tool to search for creators in the YouTube AI training dataset. "No one came to me and said, 'We would like to use this,'" said David Pakman, host of The David Pakman Show, a left-leaning politics channel with more than 2 million subscribers and more than 2 billion views. Nearly 160 of his videos were swept up into the YouTube Subtitles training dataset. Four people work full time on Pakman's enterprise, which posts multiple videos each day in addition to producing a podcast, TikTok videos, and material for other platforms. If AI companies are paid, Pakman said, he should be compensated for the use of his data. He pointed out that some media companies have recently penned agreements to be paid for use of their work to train AI. "This is my livelihood, and I put time, resources, money, and staff time into creating this content," Pakman said. "There's really no shortage of work."
[3]
Apple, NVIDIA, Anthropic allegedly used unauthorized YouTube data for AI training: Report
An investigation by Proof News, co-published with Wired, reveals that several tech giants, including Apple, NVIDIA, Anthropic, and others, have employed contentious methods to fuel their AI models, gathering data from books, websites, photos, and social media posts without the creators' knowledge. In case you missed it, Last year, Zoom clarified in updated terms that AI training data won't be used without explicit user consent, addressing concerns over potential privacy invasion. Proof News uncovered that these companies utilized subtitles from 173,536 YouTube videos, sourced from over 48,000 channels, despite YouTube's policies against such data harvesting. The YouTube Subtitles dataset comprises transcripts from educational channels such as Khan Academy, MIT, and Harvard, along with content from media outlets like The Wall Street Journal and NPR, and entertainment shows including The Late Show and Last Week Tonight. The collection, totaling 5.7GB in size, comprises 489 million words and encompasses videos from prominent YouTubers like MrBeast and PewDiePie, and even content promoting conspiracy theories like the flat-earth theory. Creators like David Pakman, whose videos nearly 160 were included in the dataset, expressed frustration over the unauthorized use of their content. Pakman, who produces daily content for his political channel, emphasized the financial and creative investments involved in his work and called for compensation from AI companies using his data. Critics, including Dave Wiskus of Nebula, argue that using creators' work without consent is unethical and could potentially harm artists and content creators. Concerns also extend to the dataset's content, which includes profanity and biases that may influence AI models trained on it. Big tech companies like Apple and NVIDIA acknowledged using the Pile dataset, which includes YouTube Subtitles, to train AI models. Apple utilized it for their OpenELM model, released shortly before announcing new AI features for iPhones and MacBooks. Anthropic defended its use of the Pile dataset, stating it included only a small subset of YouTube subtitles and that their use was distinct from direct YouTube platform use, referring queries to the dataset's authors. Salesforce also confirmed using the Pile for AI research purposes, releasing an AI model for public use in 2022. They acknowledged the dataset's inclusion of profanity and biases against certain groups, highlighting potential vulnerabilities and safety concerns. In their previous interviews, YouTube CEO Neal Mohan and Google CEO Sundar Pichai have both affirmed that using video content, including transcripts, to train AI violates YouTube's terms of service. AI companies, striving for data excellence, often guard data sources for model enhancement, raising ethical concerns about using creators' content without consent and advocating for regulation and fair compensation. The use of such datasets underscores ongoing debates about data ethics and copyright in AI development. As AI technologies evolve, questions persist about fair compensation for content used and the responsibility of tech giants in safeguarding creators' rights. This investigation highlights the complex landscape where technological advancement intersects with ethical and legal considerations, prompting calls for greater transparency and accountability in AI data sourcing and usage.
[4]
Investigation finds companies are training AI models with YouTube content without permission
YouTube video transcripts funneled into model training data without alerting content creators Artificial intelligence models require as much useful data as possible to perform but some of the biggest AI developers are relying partly on transcribed YouTube videos without permission from the creators in violation of YouTube's own rules, as discovered in an investigation by Proof News and Wired. The two outlets revealed that Apple, Nvidia, Anthropic, and other major AI firms have trained their models with a dataset called YouTube Subtitles incorporating transcripts from nearly 175,000 videos across 48,000 channels, all without the video creators knowing. The YouTube Subtitles dataset comprises the text of video subtitles, often with translations into multiple languages. The dataset was built by EleutherAI, which described the dataset's goal as lowering barriers to AI development for those outside big tech companies. It's only one component of the much larger EleutherAI dataset called the Pile. Along with the YouTube transcripts, the Pile has Wikipedia articles, speeches from the European Parliament, and, according to the report, even emails from Enron. However, the Pile has a lot of fans among the major tech companies. For instance, Apple employed the Pile to train its OpenELM AI model, while the Salesforce AI model released two years ago trained with the Pile and has since been downloaded more than 86,000 times. The YouTube Subtitles dataset encompasses a range of popular channels across news, education, and entertainment. That includes content from major YouTube stars like MrBeast and Marques Brownlee. All of them have had their videos used to train AI models. Proof News set up a search tool that will search through the collection to see if any particular video or channel is in the mix. There are even a few TechRadar videos in the collection, as seen below. The YouTube Subtitles dataset seems to contradict YouTube's terms of service, which explicitly fobird automated scraping of its videos and associated data. That's exactly what the dataset relied on, however, with a script downloading subtitles through YouTube's API. The investigation reported that the automated download culled the videos with nearly 500 search terms. The discovery provoked a lot of surprise and anger from the YouTube creators Proof and Wired interviewed. The concerns about the unauthorized use of content are valid, and some of the creators were upset at the idea their work would be used without payment or permission in AI models. That's especially true for those who found out the dataset includes transcripts of deleted videos, and in one case, the data comes from a creator who has since removed their entire online presence. The report didn't have any comment from EleutherAI. It did point out that the organization describes its mission as democratizing access to AI technologies by releasing trained models. That may conflict with the interests of content creators and platforms, if this dataset is anything to go by. Legal and regulatory battles over AI were already complex. This kind of revelation will likely make the ethical and legal landscape of AI development more treacherous. It's easy to suggest a balance between innovation and ethical responsibility for AI, but producing it will be a lot harder.
[5]
YouTubers Furious After Apple and Anthropic Steal Their Data to Train AI
"This is my livelihood, and I put time, resources, money, and staff time into creating this content." A giant dataset of YouTube subtitles has, per a new investigation, been used to train countless AI models without the permission of the tens of thousands of creators whose work was scraped. As Wired reports with the help of the data-driven Proof News project, a dataset known as "YouTube Subtitles" has been used by everyone from Apple and Anthropic to Nvidia and Salesforce to train AI models since it was released in 2020. Compiled by the open-source nonprofit EleutherAI, the YouTube Subtitles dataset doesn't include any actual video, but instead subtitle data from 173,536 videos gleaned from more than 48,000 channels. Among those channels were everything from MIT and Harvard to MrBeast and the BBC, among many others. Of all the channel owners that Proof managed to speak with for the story, none had been made aware ahead of time that ElutherAI had used subtitles from their videos. One of the impacted creators, the progressive vlogger David Pakman, was mighty peeved when he learned from Proof about his videos being included in the dataset. "No one came to me and said, 'We would like to use this,'" the commentator, who had nearly 16o videos used in the dataset, told Wired. "This is my livelihood, and I put time, resources, money, and staff time into creating this content." According to AI policy researcher Jai Vipra of Brazil's Fundação Getulio Vargas Law School, the YouTube Subtitles dataset is a "gold mine" because it can teach models how to replicate human speech. To science vlogger Dave Farina of the popular "Professor Dave Explains" series, however, that gold mine comes at a cost to creators. "It's still the sheer principle of it," Farina told Wired. "If you're profiting off of work that I've done that will put me out of work or people like me out of work, then there needs to be a conversation on the table about compensation or some kind of regulation." When Proof reached out to YouTube owner Google, EleutherAI, and the companies that had used the dataset, only a Google spokesperson chose to respond publicly to say that the company has taken "action over the years to prevent abusive, unauthorized scraping." It's a provocative state of affairs -- and it's hard to tell at this juncture how to fix it if companies won't even speak on the record about it.
[6]
Apple, Nvidia, and other tech companies trained AI with thousands of YouTube videos
With the generative artificial intelligence boom underway, tech companies are looking for training data to improve their models -- and some are taking without permission. Apple, Nvidia, and Anthropic are among the tech companies found to have trained AI models with subtitles from tens of thousands of YouTube videos despite the platform's rules against downloading and using its content without permission, according an investigation by Proof News that was co-published with Wired. The investigation found that the companies were using a dataset called YouTube Subtitles that included transcripts of 173,536 YouTube videos from over 48,000 channels. Videos in the dataset span from educational channels such as Khan Academy and MIT, to news sites including The Wall Street Journal, to some of the platform's top creators like MrBeast and Marques Brownlee. "Apple has sourced data for their AI from several companies," Brownlee wrote in a post on X addressing the investigation. "One of them scraped tons of data/transcripts from YouTube videos, including mine." Brownlee added that while "Apple technically avoids 'fault' here because they're not the ones scraping," "this is going to be an evolving problem for a long time." Proof News also created a tool for creators to search for their content in the dataset, which included a handful of videos from Quartz. The YouTube Subtitles dataset does not include imagery from videos, but does include some translated subtitles in languages such as German and Arabic. The dataset was created by Eleuther AI, "a non-profit AI research lab" that is focused on "promoting open science norms," and is part of the nonprofit's compilation of material from other places, including the European Parliament and English Wikipedia, called the Pile, according to Proof News. "The Pile dataset referred to in the research paper was trained in 2021 for academic and research purposes," a spokesperson for Salesforce, one of the companies named in the investigation for using the dataset, said in a statement shared with Quartz. "The dataset was publicly available and released under a permissive license." Neither Apple, Nvidia, nor Anthropic immediately responded to a request for comment. In April, YouTube chief executive Neal Mohan told Bloomberg that companies using YouTube videos, including transcripts or video bits, to train AI models such as OpenAI's text-to-video generator, Sora, would be a "clear violation" of the platform's policies. However, the New York Times reported days later that OpenAI had transcribed over a million hours of YouTube videos to train its GPT-4 model.
[7]
Apple was among the companies that trained its AI on YouTube videos
Once again, EleutherAI's data frustrates professional content creators. Large language models at Apple, Salesforce, Anthropic, and other major technology players were trained on tens of thousands of YouTube videos without the creators' consent and potentially in violation of YouTube's terms, according to a new report appearing in both Proof News and Wired. Further ReadingThe companies trained their models in part by using "the Pile," a collection by nonprofit EleutherAI that was put together as a way to offer a useful dataset to individuals or companies that don't have the resources to compete with Big Tech, though it has also since been used by those bigger companies. The Pile includes books, Wikipedia articles, and much more. That includes YouTube captions collected by YouTube's captions API, scraped from 173,536 YouTube videos across more than 48,000 channels. That includes videos from big YouTubers like MrBeast, PewDiePie, and popular tech commentator Marques Brownlee. On X, Brownlee called out Apple's usage of the dataset, but acknowledged that assigning blame is complex when Apple did not collect the data itself. He wrote: Apple has sourced data for their AI from several companies One of them scraped tons of data/transcripts from YouTube videos, including mine Apple technically avoids "fault" here because they're not the ones scraping But this is going to be an evolving problem for a long time It also includes the channels of numerous mainstream and online media brands, including videos written, produced, and published by Ars Technica and its staff and by numerous other Condé Nast brands like Wired and The New Yorker. Coincidentally, one of the videos used in the dataset was an Ars Technica-produced short film wherein the joke was that it was already written by AI. Proof News' article also mentions that it was trained on videos of a parrot, so AI models are parroting a parrot, parroting human speech, as well as parroting other AIs, parroting humans. As AI-generated content continues to proliferate on the Internet, it will be increasingly challenging to put together datasets to train AI that don't include content already produced by AI. Further ReadingTo be clear, some of this is not new news. The Pile is often used and referenced in AI circles and has been known to be used by tech companies for training in the past. It has been cited in multiple lawsuits by intellectual property owners against AI and tech companies in the past. Defendants in those lawsuits, including OpenAI, say that this kind of scraping is fair use. The lawsuits have not yet been resolved in court. However, Proof News did some digging to identify specifics about the use of YouTube captions and went so far as to create a tool that you can use to search the Pile for individual videos or channels. The work exposes just how robust the data collection is and calls attention to how little control owners of intellectual property have over how their work is used if it's on the open web. Reactions from creators Proof News also reached out to several of these creators for statements, as well as to the companies that used the dataset. Most creators were surprised their content had been used this way, and those who provided statements were critical of EleutherAI and the companies that used its dataset. For example, David Pakman of The David Pakman Show said: No one came to me and said, "We would like to use this"... This is my livelihood, and I put time, resources, money, and staff time into creating this content. There's really no shortage of work. Julia Walsh, CEO of the production company Complexly is responsible for SciShow and other Hank and John Green educational content, said: We are frustrated to learn that our thoughtfully produced educational content has been used in this way without our consent. There's also the question of whether the scraping of this content violates YouTube's terms, which prohibit accessing videos by "automated means." EleutherAI founder Sid Black said he used a script to download the captions via YouTube's API, just like a web browser does. Further ReadingAnthropic is one of the companies that has trained models on the dataset, and for its part, it claims there's no violation here. Spokesperson Jennifer Martinez said: The Pile includes a very small subset of YouTube subtitles... YouTube's terms cover direct use of its platform, which is distinct from use of The Pile dataset. On the point about potential violations of YouTube's terms of service, we'd have to refer you to The Pile authors. A Google spokesperson told Proof News that Google has taken "action over the years to prevent abusive, unauthorized scraping" but didn't provide a more specific response. This is not the first time that AI and tech companies have been subject to criticism for training models on YouTube videos without permission. Notably, OpenAI (the company behind ChatGPT and the video generation tool Sora) is believed to have used YouTube data to train its models, though not all allegations of this have been confirmed. In an interview with The Verge's Nilay Patel, Google CEO Sundar Pichai suggested that the use of YouTube videos to train OpenAI's Sora would have violated YouTube's terms. Granted, that usage is distinct from scraping captions via the API.
[8]
Tech Firms Including Apple Caught Using YouTube Data to Train AI Models
Apple, Nvidia, Anthropic, and Salesforce have all been caught using YouTube data to build their AI models with. An investigation by Proof News and co-published with Wired found that YouTube subtitles data has been ripped from the video-sharing platform without permission and used to train AI models. It does not involve video imagery. The data was used to train (Large-Language Models) LLMs, like ChatGPT, but it raises the issue of tech companies pilfering YouTube data to train models. YouTube has expressly stated that such usage of videos to train AI is an infraction of the platform's terms of service (ToS). But it is widely acknowledged that YouTube is a data goldmine for generative AI at a time when the race for text-to-video models is hotting up. Roughly 180,000 YouTube videos were found in the dataset being used by Apple et al. The data was compiled by a nonprofit and is called The Pile. It does not just contain YouTube data but also Wikipedia articles, books, and Enron emails. "The Pile includes a very small subset of YouTube subtitles," Jennifer Martinez, a spokesperson for Anthropic, tells Proof News. "YouTube's terms cover direct use of its platform, which is distinct from use of The Pile dataset. On the point about potential violations of YouTube's terms of service, we'd have to refer you to The Pile authors." Apple, Nvidia, and others haven't commented. Neither has YouTube. After some early burns, tech firms do not want to talk about where they get training data from to build generative AI models. With OpenAI's video generator Sora on the horizon, CTO Mira Murati has repeatedly refused to reveal the training data for the much-hyped app. "I'm not going to go into the details of the data that was used, but it was publicly available or licensed data," she told The Wall Street Journal in March. YouTube CEO Sundar Pichai told The Verge that using video content from the platform -- include subtitles -- is a violation of ToS. "We have terms and conditions, and we would expect people to abide by those terms and conditions when you build a product, so that's how I felt about it," Pichai said.
[9]
Apple, NVIDIA and Anthropic reportedly used YouTube transcripts without permission to train AI models
The dataset includes transcripts of YouTube videos from the platform's biggest creators. Some of the world's largest tech companies trained their AI models on a dataset that included transcripts of more than 173,000 YouTube videos without permission, a new investigation from Proof News has found. The dataset, which was created by a nonprofit company called EleutherAI, contains transcripts of YouTube videos from more than 48,000 channels and was used by Apple, NVIDIA and Anthropic among other companies. The findings of the investigation spotlight AI's uncomfortable truth: the technology is largely built on the backs of data siphoned from creators without their consent or compensation. The dataset doesn't include any videos or images from YouTube, but contains video transcripts from the platforms biggest creators including Marques Brownlee and MrBeast, as well as large news publishers like The New York Times, the BBC, and ABC News. Subtitles from videos belonging to Engadget are also part of the dataset. "Apple has sourced data for their AI from several companies," Brownlee posted on X. "One of them scraped tons of data/transcripts from YouTube videos, including mine," he added. "This is going to be an evolving problem for a long time." YouTube, Apple, NVIDIA, Anthropic and EleutherAI did not respond to a request for comment from Engadget. So far, AI companies haven't been transparent about the data used to train their models. Earlier this month, artists and photographers criticized Apple for failing to reveal the source of training data for Apple Intelligence, the company own spin on generative AI coming to millions of Apple devices this year. YouTube, the world's largest repository of videos, in particular, is a goldmine of not only transcripts but also audio, video, and images, making it an attractive dataset for training AI models. Earlier this year, OpenAI's chief technology officer, Mira Murati, evaded questions from The Wall Street Journal about whether the company used YouTube videos to train Sora, OpenAI's upcoming AI video generation tool. "I'm not going to go into the details of the data that was used, but it was publicly available or licensed data," Murati said at the time. Both YouTube CEO Neal Mohan and Alphabet CEO Sundar Pichai have said that companies using data from YouTube to train their AI models was a violation of the platform's terms of service. If you want to see if subtitles from your YouTube videos or from your favorite channels are part of the dataset, head over the Proof News' lookup tool.
[10]
Apple and AI companies accused of scanning YouTube data without permission -- here's what we know
Leading AI labs and big tech companies have been accused of using captions from tens of thousands of YouTube videos without permission to train artificial intelligence models. Google has strict rules in place banning the harvesting of material from YouTube without permission. A new investigation by Proof News found Apple, Nvidia and Anthropic were among those using the subtitles from more than 170,000 videos. The captions were part of 'the Pile', a massive dataset compiled by non-profit EleutherAI. Originally intended to give smaller companies and individuals a quick way to train their models, big tech and AI companies have also adopted this vast reservoir of information. While Apple, Nvidia and Anthropic didn't directly scrape the YouTube videos themselves, the AI models they operate, including Claude and Apple Intelligence, were trained on the information because they used 'the Pile' as a source. Several studies have now found that two things are essential in making more advanced AI models -- data and computing power. Increasing one or both leads to better responses, improved performance and scale. But data is an increasingly scarce and expensive commodity. Companies like OpenAI and Google have a combination of their own massive data repositories and deals with major publishing companies or Reddit. Meta has Facebook, Instagram, Threads and WhatsApp -- although it is facing pushback from users. Apple has a vast amount of user data but its own privacy policies makes this less useful in initial model training. This lack of available data is leading companies to look for new sources of information to train next-generation models and not all of those sources are willing to part with data, or even aware that the information they're creating is being used to train AI. There are several lawsuits against AI image and music generation companies underway at the moment over whether there is a copyright fair use for training data. While Apple and Anthropic are not directly responsible for the use of these YouTube captions in their model training dataset, the inclusion does raise questions about data provenance and just how hard big tech is checking when assessing rights. It wasn't just small creator videos included. The BBC, NPR, Wall Street Journal, Mr Beast and Marques Brownlee all had videos in the dataset. A total of 48,000 channels and 173,536 videos were in the YouTube Subtitles dataset. Some of the videos included conspiracy theories and parody which could impact the integrity of the final model. This isn't the first time YouTube has been at the center of an AI training data controversy, with OpenAI CTO Mira Murati unable to confirm or deny whether YouTube was used in training their advanced -- but as yet unreleased -- AI video model Sora. Speaking to Wired, Dave Wiskus, CEO of Nebula described it as "theft" and "disrespectful" to use data without consent, especially as studios are already using genative AI to "replace as many of the artists" as they can. Anthropic said in a statement to Ars Technica that the Pile is just a small subset of YouTube subtitles and that YouTube's terms only cover direct use of its platform. This is distinct from the use of the Pile dataset. "On the point about potential violations of YouTube's terms of service, we'd have to refer you to The Pile authors." Google says it has taken action over the years to prevent abuse but has given no additional detail of what that might be or even whether this violates the terms. However, Google isn't entirely blameless having been caught out scanning user documents saved in Google Drive with its Gemini AI even when the user hasn't given permission. Creators are annoyed at the discovery but with the question of data provenance and copyright when used in training models still very much up for debate -- their likely only recourse is if Google decides it violates the YouTube terms. This instance of potential misuse of data will likely be bundled into the wider story of whether training data is under fair use or requires specific licensing. I suspect we won't get a final decision on that for years.
[11]
Apple and Salesforce AI training datasets co-opt MrBeast, Marques Brownlee videos
A dataset of 173,536 YouTube videos called The Pile also included content from Harvard, NPR, and 'The Late Show With Stephen Colbert.' A new investigation claims that tech companies used subtitles from more than 48,000 YouTube channels -- including from top creators like MrBeast and Marques Brownlee and higher learning institutions like MIT and Harvard -- to train their AI models, even though YouTube prohibits the harvesting of platform content without permission. The investigation, conducted by Proof News and published in conjunction with Wired, found that companies like Anthropic, Nvidia, Apple, and Salesforce used a dataset of 173,536 YouTube videos including those from Khan Academy, MIT, Harvard, The Wall Street Journal, NPR, the BBC and late night shows like The Late Show With Stephen Colbert, Last Week Tonight With John Oliver, and Jimmy Kimmel Live. Marques Brownlee posted an Instagram Reel noting that, in his opinion, "the real story is Apple and a whole bunch of other tech companies are training their AI models using data that they buy from third party data scraping companies some of which get their data in slightly illegal ways... Apple can technically say they're not at fault for this." Wired says that representatives for the non-profit AI research lab that scraped and disseminated the YouTube dataset, EleutherAI, did not respond to the publication's requests for comment. The dataset is part of a compilation the nonprofit calls The Pile, which also includes material from the European Parliament, English Wikipedia, and emails from the employees of the Enron Corporation released during the federal investigation into the company in the early 2000s. Wired reports that most of the collections that make up The Pile are accessible to "anyone on the internet with enough space and computing power to access them." These include Apple, Nvidia, Salesforce, Bloomberg and Databricks, all of which have publicly acknowledged their use of The Pile to train AI models. Jennifer Martinez, a spokesperson for AI startup Anthropic, said in a statement that while the company had used The Pile to train its generative AI assistant, "YouTube's terms cover direct use of its platform, which is distinct from use of the Pile dataset. On the point about potential violations of YouTube's terms of service, we'd have to refer you to the Pile authors." In his Instagram Reel, Brownlee added, "The double whammy is that I actually pay for more accurate manual transcriptions on every video that we put out... so that means the stolen transcriptions specifically are paid content that's being stolen more than once." His concerns echo those of creators across the world who are concerned that their work will be consumed or exploited by AI without compensation or permission. Many are currently suing tech companies for unapproved use of their work. Wired reports that The Pile is still available on file-sharing services but has been removed from its official download site. Proof News has created a tool to search for creators in the YouTube AI training dataset.
[12]
Apple trained AI models on YouTube content without consent
A number of tech giants, including Apple, trained AI models on YouTube videos without the consent of the creators, according to a new report today. They did this by using subtitle files downloaded by a third party from more than 170,000 videos. Creators affected include tech reviewer Marquees Brownlee (MKBHD), MrBeast, PewDiePie, Stephen Colbert, John Oliver, and Jimmy Kimmel ... The subtitle files are effectively transcripts of the video content. Wired reports. An investigation by Proof News found some of the wealthiest AI companies in the world have used material from thousands of YouTube videos to train AI. Companies did so despite YouTube's rules against harvesting materials from the platform without permission. Our investigation found that subtitles from 173,536 YouTube videos, siphoned from more than 48,000 channels, were used by Silicon Valley heavyweights, including Anthropic, Nvidia, Apple, and Salesforce. The downloads were reportedly performed by a non-profit called EleutherAI, which says it helps developers train AI models. While the aim appears to have been to provide training materials to small developers and academics, the dataset has also been used by several tech giants, including Apple. According to a research paper published by EleutherAI, the dataset is part of a compilation the nonprofit released called the Pile [...] Most of the Pile's datasets are accessible and open for anyone on the internet with enough space and computing power to access them. Academics and other developers outside of Big Tech made use of the dataset, but they weren't the only ones. Apple, Nvidia, and Salesforce -- companies valued in the hundreds of billions and trillions of dollars -- describe in their research papers and posts how they used the Pile to train AI. Documents also show Apple used the Pile to train OpenELM, a high-profile model released in April, weeks before the company revealed it will add new AI capabilities to iPhones and MacBooks. Wired says Apple hadn't responded to a request for comment at the time of writing. It's important to emphasize here that Apple didn't download the data itself, but this was instead performed by EleutherAI. It is this organization which appears to have broken YouTube's terms and conditions. All the same, while Apple and the other companies named likely used a publicly-available dataset in good faith, it's a good illustration of the legal minefield created by scraping the web to train AI systems. There have been multiple examples of AI systems plagiarizing entire paragraphs of text when asked about niche topics, and the dangers of using material without permission are only increased when companies use datasets compiled by third parties. We've reached out to Apple for comment, and will update with any response.
[13]
Apple, Anthropic, other tech companies under scanner for using YouTube videos to train AI, report says
The dataset, called YouTube Subtitles, reportedly included video transcripts from educational and online learning channels With the advancement of Artificial intelligence, the requirement of huge datasets is also increasing, which some companies seem to be using illegally. As reported by Proof news companies such as Apple, Nvidia, Anthropic, and Salesforce used subtitles from YouTube videos to train generative AI models. The dataset, called YouTube Subtitles, reportedly included video transcripts from educational and online learning channels such as Khan Academy, MIT, and Harvard. Additionally, the Wall Street Journal, NPR and the BBC also had their videos used to train AI, as did "The Late Show With Stephen Colbert," "Last Week Tonight With John Oliver," and "Jimmy Kimmel Live." The illegal data market According to Proof news, Apple used data from the Wall Street Journal, NPR and the BBC to train its OpenELM model, released in April just before its WWDC event. In addition to this Bloomberg, Databrick, and Anthropic, also used the dataset to train AI models. Salesforce used the Pile to build an AI model it claimed was for "academic and research", but later released this for public use in 2022. It has been downloaded over 85,000 times. But what is the 'Pile' and why does its misuse matter? So, EleutherAI is a YouTube Subtitles dataset, which is part of a larger compilation called the Pile. This includes material from Wikipedia and the European Parliament and is generally accessible to anyone with internet access and the know-how to find it. However, its misuse can lead to the data breach of sensitive or personal data. In addition to this the creation of the dataset may also have violated YouTube's terms of service, where the platform prohibits using "automated means" to access its videos. "The Pile had been used to train Claude, Anthropic's generative AI assistant," a spokesperson from Anthropic explained. However, representatives for Nvidia, Apple, Bloomberg, and Databricks declined to comment on their use of the Pile. Furthermore, EleutherAI also refused to respond to Proof News' request for comment. The safety nets A case against EleutherAI, was voluntarily dismissed by the plaintiffs. The Pile has since been removed from its official download site, but it's still available on file sharing services. Early reports suggested that YouTube Subtitles, which was published in 2020, also contained subtitles from more than 12,000 videos that have since been deleted from YouTube.
[14]
Apple and Salesforce AI training datasets co-opt MrBeast, Marques Brownlee videos
A dataset of 173,536 YouTube videos called The Pile also included content from Harvard, NPR, and 'The Late Show With Stephen Colbert.' A new investigation claims that tech companies used subtitles from more than 48,000 YouTube channels -- including from top creators like MrBeast and Marques Brownlee and higher learning institutions like MIT and Harvard -- to train their AI models, even though YouTube prohibits the harvesting of platform content without permission. The investigation, conducted by Proof News and published in conjunction with Wired, found that companies like Anthropic, Nvidia, Apple, and Salesforce used a dataset of 173,536 YouTube videos including those from Khan Academy, MIT, Harvard, The Wall Street Journal, NPR, the BBC and late night shows like The Late Show With Stephen Colbert, Last Week Tonight With John Oliver, and Jimmy Kimmel Live. Marques Brownlee posted an Instagram Reel noting that, in his opinion, "the real story is Apple and a whole bunch of other tech companies are training their AI models using data that they buy from third party data scraping companies some of which get their data in slightly illegal ways... Apple can technically say they're not at fault for this." Wired says that representatives for the non-profit AI research lab that scraped and disseminated the YouTube dataset, EleutherAI, did not respond to the publication's requests for comment. The dataset is part of a compilation the nonprofit calls The Pile, which also includes material from the European Parliament, English Wikipedia, and emails from the employees of the Enron Corporation released during the federal investigation into the company in the early 2000s. Wired reports that most of the collections that make up The Pile are accessible to "anyone on the internet with enough space and computing power to access them." These include Apple, Nvidia, Salesforce, Bloomberg and Databricks, all of which have publicly acknowledged their use of The Pile to train AI models. Jennifer Martinez, a spokesperson for AI startup Anthropic, said in a statement that while the company had used The Pile to train its generative AI assistant, "YouTube's terms cover direct use of its platform, which is distinct from use of the Pile dataset. On the point about potential violations of YouTube's terms of service, we'd have to refer you to the Pile authors." In his Instagram Reel, Brownlee added, "The double whammy is that I actually pay for more accurate manual transcriptions on every video that we put out... so that means the stolen transcriptions specifically are paid content that's being stolen more than once." His concerns echo those of creators across the world who are concerned that their work will be consumed or exploited by AI without compensation or permission. Many are currently suing tech companies for unapproved use of their work. Wired reports that The Pile is still available on file-sharing services but has been removed from its official download site. Proof News has created a tool to search for creators in the YouTube AI training dataset.
[15]
Apple trained its AI on YouTube videos without consent | Digital Trends
Apple is the latest in a long line of generative AI developers -- a list that's nearly as old as the industry -- that has been caught scraping copyrighted content from social media in order to train its artificial intelligence systems. According to a new report from Proof News, Apple has been using a dataset containing the subtitles of 173,536 YouTube videos to train its AI. However, Apple isn't alone in that infraction, despite YouTube's specific rules against exploiting such data without permission. Other AI heavyweights have been caught using it as well, including Anthropic, Nvidia, and Salesforce. Recommended Videos The data set, known as YouTube Subtitles, contains the video transcripts from more than 48,000 YouTube channels, from Khan Academy, MIT, and Harvard to The Wall Street Journal, NPR, and the BBC. Even transcripts from late-night variety shows like "The Late Show With Stephen Colbert," "Last Week Tonight with John Oliver," and "Jimmy Kimmel Live" are part of the YouTube Subtitles database. Videos from YouTube influencers like Marques Brownlee and MrBeast, as well as a number of conspiracy theorists, were also lifted without permission. The data set itself, which was compiled by the startup EleutherAI, does not contain any video files, though it does include a number of translations into other languages including Japanese, German, and Arabic. EleutherAI reportedly obtained its data from a larger dataset, dubbed Pile, which was itself created by a nonprofit who pulled their data from not just YouTube but also European Parliament records and Wikipedia. Bloomberg, Anthropic and Databricks also trained models on the Pile, the companies' relative publications indicate. "The Pile includes a very small subset of YouTube subtitles," Jennifer Martinez, a spokesperson for Anthropic, said in a statement to Proof News. "YouTube's terms cover direct use of its platform, which is distinct from use of The Pile dataset. On the point about potential violations of YouTube's terms of service, we'd have to refer you to The Pile authors." Technicalities aside, AI startups helping themselves to the contents of the open internet has been an issue since ChatGPT made its debut. Stability AI and Midjourney are currently facing a lawsuit by content creators over allegations that they scraped their copyrighted works without permission. Google itself, which operates YouTube, was hit with a class-action lawsuit last July and then another in September, which the company argues would "take a sledgehammer not just to Google's services but to the very idea of generative AI." Me: What data was used to train Sora? YouTube videos? OpenAI CTO: I'm actually not sure about that... (I really do encourage you to watch the full @WSJ interview where Murati did answer a lot of the biggest questions about Sora. Full interview, ironically, on YouTube:... pic.twitter.com/51O8Wyt53c — Joanna Stern (@JoannaStern) March 14, 2024 What's more, these same AI companies have severe difficulty actually citing where they obtain their training data. In a March 2024 interview with The Wall Street Journal's Joanna Stern, OpenAI CTO Mira Murati stumbled repeatedly when asked whether her company utilized videos from YouTube, Facebook, and other social media platforms to train their models. "I'm just not going to go into the details of the data that was used," Murati said. And this past July, Microsoft AI CEO Mustafa Suleyman made the argument that an ethereal "social contract" means anything found on the web is fair game. "I think that with respect to content that's already on the open web, the social contract of that content since the '90s has been that it is fair use," Suleyman told CNBC. "Anyone can copy it, re-create with it, reproduce with it. That has been freeware, if you like, that's been the understanding."
[16]
Recent Investigation Suggests Apple Trained Its AI Models From YouTube Videos Without Authorization, Includes MKBHD Videos
Earlier, OpenAI, Meta, and Google were criticized for transcribing YouTube videos to train their AI models, violating the copyrights of content creators. Now, a new report seems to have surfaced that highlights Apple following in the footsteps of other tech giants in training their LLM models through transcripts of the video content without the consent of the video creators, including some well-known tech reviewers. Lately, tech giants have been using YouTube videos to train AI models without creators' consent, which has stirred many concerns. Now, Apple, along with other big companies, has found itself in the middle of the controversy for violating creators' copyrights by using their content without their permission. Wired reported that third parties downloaded the videos as subtitle files, which were then used to train LLM models. It claimed that over 170,000 videos were utilized, which includes content from well-known YouTubers, including MKBHD, Jimmy Kimmel, PewDiePie, and MrBeast, among many other content creators. The report highlights that these big AI companies have been using the content for their training process despite this material extraction technique violating YouTube's rules of independent applications of their videos and automated access without permission. An investigation by Proof News found some of the wealthiest AI companies in the world have used material from thousands of YouTube videos to train AI. Companies did so despite YouTube's rules against harvesting materials from the platform without permission. Our investigation found that subtitles from 173,536 YouTube videos, siphoned from more than 48,000 channels, were used by Silicon Valley heavyweights, including Anthropic, Nvidia, Apple, and Salesforce. Although the act of transcribing the videos was not performed by Apple, a nonprofit agency called EleutherAI used it for educational purposes, to train developers, and to serve another academic purpose; the company still ended up in controversy for using the dataset without consent. The compilations are openly available for academics and developers, but tech giants have used them to train their high-profile models. Apple is said to be using the data compilation, namely, Pile, by the third party for training OpenELM that was launched in April. Such a situation raises questions regarding consent and ethical AI practices the implications for which could be multi-faceted if not dealt with precaution. We are yet to hear Apple's take on the ongoing concerns.
[17]
Apple might be using your favourite YouTubers to train AI
As the race continues for tech companies to create the most advanced AI model they can produce, some shifty methods may be going into training them. A new report from Proof News shows that Apple is apparently using YouTube creators to train its AI. "No one came to me and said, 'We would like to use this,'" said YouTuber David Pakman, who hosts a channel with 2 million subscribers and more than 2 billion views. "This is my livelihood, and I put time, resources, money, and staff time into creating this content," he said. Other creators described the practise as "theft," and believe it will be used to harm and exploit artists. It's not just Apple that might be making use of these videos either, as Nvidia and Anthropic are also accused in the report.
[18]
Nvidia, Apple AI Scraped Dataset With 173K YouTube Videos, Taylor Swift Lyrics
Companies like Nvidia, Apple, Anthropic, and Salesforce are training their AI tools on transcripts of YouTube videos they don't own and aren't licensed to use, according to a new investigation from nonprofit news studio Proof and Wired. In the world of generative AI, it might not come as a surprise that tech firms are scraping and using as much data as they can find to train AI models for profit, leaving creators and artists unpaid and in the dark. This group of four tech firms has been using an AI dataset called YouTube Subtitles, which consists of 173,000 YouTube video transcripts from nearly 50,000 channels, for AI model ingestion and training. Videos from popular influencers and TV shows uploaded to YouTube were swiped and included in the dataset, like those of MrBeast, John Oliver, Jimmy Kimmel, and Stephen Colbert, to name a few. PCMag found two of our videos included in the dataset by using Proof's database search tool: one is an explainer video on two-factor authentication from 2019, and another is an unboxing video of the Samsung Galaxy Z Fold 2 in 2020. The search tool also reveals that the dataset includes copyrighted music videos on YouTube, like those on Katy Perry's Vevo channel and Taylor Swift's official YouTube channels, PCMag found. The dataset also includes pro-conspiracy videos promoting the flat Earth theory, Proof notes. The YouTube Subtitles dataset is part of a larger 800GB dataset called "The Pile," which was first released in 2021 by AI startup EleutherAI. In a research paper about The Pile, EleutherAI says, "The YouTube Subtitles dataset was created by us...using a very popular unofficial API that is both widely used and easily obtainable." An image in the research paper shows a breakdown of The Pile, which includes PubMed, FreeLaw, Wikipedia, HackerNews, and GitHub, among other destinations. Roughly a third of the data comes from academic sources. Another third of the dataset was scraped from "the internet" at large, and YouTube specifically is shown at the bottom-right corner of the diagram. EleutherAI argues that the "processing applied and the difficulty of identifying particular files in the Pile" means its dataset "does not constitute significantly increased harm beyond that which has already been done by the widespread publication of these datasets." Apple's OpenELM is trained on The Pile, as is a Salesforce AI model, according to Proof. Anthropic also confirmed its Claude AI was trained with The Pile. Nvidia did not immediately respond to a request for comment. But earlier this year, a group of authors sued Nvidia for using The Pile, whose "Books3" section contains their novels. The authors are alleging that Nvidia is committing copyright infringement by using their work in the dataset to train its NeMo AI without payment or their consent. As for YouTube, CEO Neal Mohan previously said that training OpenAI's Sora on YouTube videos would be a "clear violation." But shortly after, news broke that YouTube parent company Google had been training its AI tools on YouTube videos (the company claims it had permission to do this via existing creator agreements). YouTube's Terms of Service state that users can't download or use any videos or YouTube content unless "expressly authorized by the Service [...] with prior written permission from YouTube" and "if applicable, the respective rights holders," but it's unclear to what extent this applies to AI data scraping. YouTube's polices also state that no one can "view or listen to Content other than for personal, non-commercial use," which data scraping to develop AI tools for profit could potentially violate. It's also unclear why music, which is also "publicly available," might get more copyright protections from sites like YouTube than other types of creative work, like videos, short films, or text-based writing work. PCMag has reached out to YouTube for comment. "The Pile includes a very small subset of YouTube subtitles," Anthropic spokesperson Jennifer Martinez told Proof. "YouTube's terms cover direct use of its platform, which is distinct from use of The Pile dataset. On the point about potential violations of YouTube's terms of service, we'd have to refer you to The Pile authors."
[19]
Why Apple, Nvidia and others using YouTube to train their AI models are really not at fault despite breaking Google rules - Times of India
It seems YouTube is the most popular 'teacher' for AI models of big tech companies. The names reportedly include Nvidia, Salesforce Anthropic, and Apple. These companies are said to have used YouTube videos to train their AI systems. According to a report from Wired, based on an investigation by Proof News, some of the richest AI companies in the world have used material from thousands of YouTube videos to train their AI models."Our investigation found that subtitles from 173,536 YouTube videos, siphoned from more than 48,000 channels, were used by Silicon Valley heavyweights, including Anthropic, Nvidia, Apple, and Salesforce," said the report. The report claims that the dataset, called YouTube Subtitles, contains video transcripts from educational and online learning channels like Khan Academy, MIT, and Harvard. The Wall Street Journal, NPR, and the BBC also reportedly had their videos used to train AI. Among the YouTubers, the names include MKBHD, Pewdiepie and MrBeast, Why Apple, Nvidia and others cannot be blamed However, it seems that these companies may not be really to be blamed as according to a research paper published by EleutherAI, quoted in the report, the dataset used by these companies is part of a compilation the nonprofit released called the Pile. The developers of the Pile included material from not just YouTube but also the European Parliament. This means that the subtitles that Apple and others used came from this big data collection set. As the group called EleutherAI collected the subtitles and put them into The Pile. This collection was then put online for anyone to use, like a free library. Apple and others probably thought it was okay to use this data because it was freely available. Popular Youtubers on the controversy Marques Brownlee, aka MKBHD, took to social media to express his dejection over the news. "Apple technically avoids "fault" here because they're not the ones scraping," Brownlee wrote, in a post on X. "But this is going to be an evolving problem for a long time." Brownlee shared how the transcriptions allegedly used for the AI training by Apple and others are paid work of his. He wrote, "Fun fact, I pay a service (by the minute) for more accurate transcriptions of my own videos, which I then upload to YouTube's back-end. So companies that scrape transcripts are stealing *paid* work in more than one way. Not great." This situation shows there are tricky problems with AI training. It's not clear who owns the rights to use online content for AI training. There aren't good rules yet about how companies should get data to train AI. We need to find a way to balance making better AI with protecting people's work. What YouTube says on data harvesting YouTube says that using videos like this breaks their rules. In an interview earlier this year, YouTube's boss, Neal Mohan, said that using their videos to train AI isn't allowed. Google's leader, Sundar Pichai, agreed with this view. The TOI Tech Desk is a dedicated team of journalists committed to delivering the latest and most relevant news from the world of technology to readers of The Times of India. TOI Tech Desk's news coverage spans a wide spectrum across gadget launches, gadget reviews, trends, in-depth analysis, exclusive reports and breaking stories that impact technology and the digital universe. Be it how-tos or the latest happenings in AI, cybersecurity, personal gadgets, platforms like WhatsApp, Instagram, Facebook and more; TOI Tech Desk brings the news with accuracy and authenticity.
[20]
Report claims that Anthropic, Nvidia, Apple and Salesforce used YouTube transcripts to train AI - SiliconANGLE
Report claims that Anthropic, Nvidia, Apple and Salesforce used YouTube transcripts to train AI A new report released today claims that companies including Anthropic PBC, Nvidia Corp., Apple Inc. and Salesforce Inc. have used subtitles from YouTube videos to help train their artificial intelligence service without permission, raising questions about the ethical implications of using publicly available material and facts without consent. The report from Proof News claims that the companies allegedly used subtitles from 173,536 YouTube videos taken from over 48,000 channels to train their AI. Anthropic, Nvidia, Apple and Salesforce are not alleged to have scraped the content, but instead are claimed to have used a dataset from a non-profit called EleutherAI. EleutherAI is a non-profit artificial intelligence research group that focuses on the interpretability and alignment of large models. Founded in 2020, the group aims to democratize access to advanced AI technologies by developing and releasing open-source AI models like GPT-Neo and GPT-J. The organization also advocates for open science norms in natural language processing and ensures that independent researchers can study and audit AI technologies, promoting transparency and ethical AI development. The dataset from EleutherAI used by the four companies is called "YouTube Subtitles" and is said to contain video transcripts from education and online learning channels, along with transcripts from several media outlets and YouTube stars. The transcripts from YouTubers in the dataset include those from Mr. Beast, electric car maker killer Marques Brownlee, PewDiePie and left-wing political commentator David Pakman. Some of those who had their content in the dataset are offended, with Pakman, in particular, claiming somehow that the use of his transcripts risks his livelihood and staff. David Wiskus, the chief executive officer of streaming service Nebula, goes as far as to claim that the use of the data is "theft." Despite the data being publicly available, the crime seemingly being nothing more than the data being read by large language models and there being little risk that AI is going to replace YouTubers anytime soon, the seeming storm in a teacup comes as legal action has been taken over publicly available data being used to train AI models. Microsoft Corp. and OpenAI were sued by for their use of nonfiction authors' work in AI training in November. The class-action lawsuit, led by a New York Times reporter, claimed that OpenAI allegedly scraped the content of hundreds of thousands of nonfiction books to train their AI models. The New York Times also accused OpenAI, Google LLC and Meta Holdings Inc. in April of skirting legal boundaries for AI training data. While some are calling the use of AI training data a grey area, whether it's legal or not is yet to be extensively tested in court. And should a case end up in court, the test likely to apply is whether facts, including publicly stated utterances, can be copyrighted. The closest case law in the U.S. pertaining to the repetition of facts covers two cases - Feist Publications Inc. v. Rural Telephone Service Co., 499 U.S. 340 (1991) and International News Service v. Associated Press (1918). In both cases, the U.S. Supreme Court ruled that facts cannot be copyrighted.
[21]
Report: Anthropic, Nvidia, Apple and Salesforce used YouTube transcripts to train AI - SiliconANGLE
Report: Anthropic, Nvidia, Apple and Salesforce used YouTube transcripts to train AI A new report released today claims that companies including Anthropic PBC, Nvidia Corp., Apple Inc. and Salesforce Inc. have used subtitles from YouTube videos to help train their artificial intelligence service without permission, raising questions about the ethical implications of using publicly available material and facts without consent. The report from Proof News claims that the companies allegedly used subtitles from 173,536 YouTube videos taken from over 48,000 channels to train their AI. Anthropic, Nvidia, Apple and Salesforce are not alleged to have scraped the content, but instead are claimed to have used a dataset from a non-profit called EleutherAI. EleutherAI is a non-profit artificial intelligence research group that focuses on the interpretability and alignment of large models. Founded in 2020, the group aims to democratize access to advanced AI technologies by developing and releasing open-source AI models like GPT-Neo and GPT-J. The organization also advocates for open science norms in natural language processing and ensures that independent researchers can study and audit AI technologies, promoting transparency and ethical AI development. The dataset from EleutherAI used by the four companies is called "YouTube Subtitles" and is said to contain video transcripts from education and online learning channels, along with transcripts from several media outlets and YouTube stars. The transcripts from YouTubers in the dataset include those from Mr. Beast, electric car maker killer Marques Brownlee, PewDiePie and left-wing political commentator David Pakman. Some of those who had their content in the dataset are offended, with Pakman, in particular, claiming somehow that the use of his transcripts risks his livelihood and staff. David Wiskus, the chief executive officer of streaming service Nebula, goes as far as to claim that the use of the data is "theft." Despite the data being publicly available, the crime seemingly being nothing more than the data being read by large language models and there being little risk that AI is going to replace YouTubers anytime soon, the seeming storm in a teacup comes as legal action has been taken over publicly available data being used to train AI models. Microsoft Corp. and OpenAI were sued by for their use of nonfiction authors' work in AI training in November. The class-action lawsuit, led by a New York Times reporter, claimed that OpenAI allegedly scraped the content of hundreds of thousands of nonfiction books to train their AI models. The New York Times also accused OpenAI, Google LLC and Meta Holdings Inc. in April of skirting legal boundaries for AI training data. While some are calling the use of AI training data a grey area, whether it's legal or not is yet to be extensively tested in court. And should a case end up in court, the test likely to apply is whether facts, including publicly stated utterances, can be copyrighted. The closest case law in the U.S. pertaining to the repetition of facts covers two cases - Feist Publications Inc. v. Rural Telephone Service Co., 499 U.S. 340 (1991) and International News Service v. Associated Press (1918). In both cases, the U.S. Supreme Court ruled that facts cannot be copyrighted.
[22]
Apple Intelligence may have been unwittingly trained using data pilfered from YouTube, report reveals
Some big-name YouTubers have had their videos scraped in the name of AI. As Apple works to ready Apple Intelligence for a beta launch later this year, a new report claims that the company used YouTube videos as a source of data when training their AI models. Apple is just one company thought to have used data collected by a third party when training AI, with Nvidia and Anthropic also among those thought to have used the same information. The dataset, called YouTube Subtitles, was collected by EleutherAI and created by taking the transcripts from videos created by some of the biggest names on the platform including MKBHD and MrBeast. While the dataset was not created using the actual videos themselves but rather their transcripts, it's still thought that the act is against YouTube's terms of service. The Wired report notes that the dataset is part of a compilation the outfit released called the Pile which is accessible and open to anyone on the internet. An investigation found that subtitles from 173,536 YouTube videos across 48,000 channels were used to train data with Apple one of the companies that benefited. It's thought that Apple used the Pile, to train OpenELM, a model that was announced in April just weeks before Apple announced that Apple Intelligence would launch alongside iOS 18. The offering is made up of multiple new AI-powered features that generate text and images across multiple apps and services. Understandably, YouTubers are less than happy with the news. "No one came to me and said, 'We would like to use this,'" said David Pakman, the host of The David Pakman Show. Others suggested that the use of subtitle data in this way was theft, noting that the same technology could well be used to take creators' jobs in the future. Apple Intelligence will launch later this year, albeit in beta, alongside iOS 18 and software updates for the Mac, iPad, Mac, Apple Watch, Apple TV, and Apple Vision Pro.
[23]
Apple, Anthropic and other companies used YouTube videos to train AI
More than 170,000 YouTube videos are part of a massive dataset that was used to train AI systems for some of the biggest technology companies, according to an investigation by Proof News and copublished with Wired. Apple, Anthropic, Nvidia, and Salesforce are among the tech firms that used the "YouTube Subtitles" data that was ripped from the video platform without permission. The training dataset is a collection of subtitles taken from YouTube videos belonging to more than 48,000 channels -- it does not include imagery from the videos.
[24]
Supplier used controversial sources for training Apple Intelligence
Apple Intelligence may have been trained less legally and ethically than Apple believed Apple has made a big deal out of paying for the data used to train its Apple Intelligence, but one firm it used is accused of allegedly ripping off YouTube videos. All generative AI works by amassing enormous datasets called Large Language Models (LLMs), and very often, the source of that data is controversial. So much so that Apple has repeatedly claimed that its sources are ethical, and it's known to have paid millions to publishers, and licensed images from photo library firms. According to Wired, however, one firm whose data Apple has used, appears to have been less scrupulous about its sources. EleutherAI reportedly created a dataset it calls the Pile, which Apple has reported using for its LLM training. Part of the Pile, though, is called YouTube Subtitles, which consist of subtitles downloaded from YouTube videos without permission. It's apparently also a breach of YouTube terms and conditions, but that may be a more gray area than it should be. Alongside Apple, firms who have used the Pile include Anthropic, whose spokesperson claimed that there is a difference between using YouTube subtitles and using the videos. "The Pile includes a very small subset of YouTube subtitles," said Jennifer Martinez. "YouTube's terms cover direct use of its platform, which is distinct from use of the Pile dataset." "On the point about potential violations of YouTube's terms of service," she continued, "we'd have to refer you to the Pile authors." Salesforce also confirmed that it had used the Pile in its building of an AI model for "academic and research purposes." Salesforce's vice president of AI research stressed that the Pile's dataset is "publicly available." Reportedly, developers at Salesforce also found that the Pile dataset includes profanity, plus "biases against gender and certain religious groups." Salesforce and Anthropic are so far the only firms that have commented on their use of the Pile. Apple, Nvidia, Bloomberg, and Databricks are known to have used it, but they have not responded. The organization Proof News claims to have found that subtitles from 173,536 YouTube videos from over 48,000 channels were used in the Pile. The videos used include seven by Marques Brownlee (MKBHD) and 337 from PewDiePie. Proof News has produced an online tool to help YouTubers see whether their work has been used. However, it's not only YouTube subtitles that have been gathered without permission. It's claimed that Wikipedia has been used, as has documentation from the European Parliament. Academics and even mathematicians have previously used thousands of Enron staff emails for statistical analysis. Now, it's claimed that the Pile used the text of those emails for its training. It's previously been argued that Apple's generative AI might be the sole one that was trained legally and ethically. But despite Apple's intentions, Apple Intelligence has seemingly been trained on YouTube subtitles it had no right to.
[25]
Apple, Nvidia, Anthropic Accused Of Using YouTube Videos To Train AI Models Without Creators' Consent: 'This Is Going to Be An Evolving Problem For A Long Time,' Says MKBHD - Apple (NASDAQ:AAPL), Salesforce (NYSE:CRM)
Apple Inc. AAPL has been accused of using Alphabet Inc.'s GOOGL, GOOG subsidiary YouTube videos to train its AI models without the creators' consent. What Happened: Tech YouTuber Marques Brownlee, also known as MKBHD, took to social media to voice his concerns about Apple's use of YouTube content for AI training. Brownlee revealed that Apple sourced data from various companies, one of which scraped data and transcripts from YouTube videos, including his own. The companies are not at fault for the scraping, but this issue is likely to persist, Brownlee noted. "Apple technically avoids "fault" here because they're not the ones scraping But this is going to be an evolving problem for a long time," Brownlee wrote. See Also: Bill Gates Recalls Barack Obama's Mission To Capture Osama bin Laden In Pakistan Using Fake Polio Vaccination Campaign: '...A Lot Of Damage Had Been Done' MKBHD wrote in another post, "Fun fact, I pay a service (by the minute) for more accurate transcriptions of my own videos, which I then upload to YouTube's back-end. So companies that scrape transcripts are stealing paid work in more than one way. Not great." 9to5Mac's report, which Brownlee shared, disclosed that several tech giants, including Apple, trained their AI models using subtitle files downloaded by a third party from over 170,000 videos. This data included transcripts of videos from creators like Brownlee, MrBeast, PewDiePie, Stephen Colbert, John Oliver, and Jimmy Kimmel. Proof News investigation revealed that EleutherAI's dataset, known as the Pile, was used by major companies like NVIDIA Corp. NVDA and Salesforce Inc CRM for AI training. Companies pursued this practice despite YouTube's regulations prohibiting the unauthorized harvesting of materials from the platform. Apple, Nvidia, Google, and Anthropic did not immediately respond to Benzinga's request for comment. Why It Matters: The issue of unauthorized content scraping for AI training has been a growing concern in the tech industry. Recently, OpenAI and Anthropic were reported to be ignoring web scraping rules, stirring controversy. These companies have allegedly bypassed the robots.txt protocol, which is designed to prevent automated scraping of websites. In response to such practices, Reddit Inc. RDDT recently updated its platform to block automated content scraping. This policy change led to a nearly 9% surge in Reddit's stock value, highlighting the market's sensitivity to data privacy issues. Earlier, Meta Platforms Inc. META also faced challenges with data scraping, which led to legal actions against a Chinese company. This incident underscores the widespread nature of the problem across various social media platforms. Additionally, Elon Musk has cited AI scraping as a reason for implementing tweet paywalls on X, Inc. (formerly Twitter Inc.). Users now need an account to read tweets, and those who wish to view more than 600 posts per day must pay for Twitter Blue access. Read Next: Apple Vision Pro Sales Fall Short, iPhone 16 Camera Upgrades And More: This Week In Appleverse Image Via Shutterstock This story was generated using Benzinga Neuro and edited by Kaustubh Bagalkote Market News and Data brought to you by Benzinga APIs
[26]
Probe reveals 174K YouTube vids' subtitles used for AI
Comment FYI: It's not just Reddit posts, books, articles, webpages, code, music, images, and so forth being used by multi-billion-dollar businesses for training neural networks. AI labs have been teaching models using subtitles scraped from at least tens of thousands of YouTube videos, much to the surprise of the footage creators. Those transcripts were compiled into what is termed the YouTube Subtitles dataset and incorporated into a larger repository of training material called the Pile, nonprofit nu-journo outfit Proof News highlighted this week. The YouTube Subtitles collection contains information from 173,536 YouTube videos including those of channels operated by Harvard University, the BBC, and web-celebs like Jimmy "MrBeast" Donaldson. The dataset is a 5.7GB slice of Pile, a larger 825GB silo created by nonprofit outfit EleutherAI. The Pile includes data pulled from GitHub, Wikipedia, Ubuntu IRC, Stack Exchange, bio-medical and other scientific papers, internal Enron emails, and many other sources. Overall, the YouTube Subtitles dataset is one of the smallest collections in the Pile. Big names such as Apple, Salesforce, Nvidia, and others have incorporated the Pile, including the video subtitles, into their AI models during training. We're told the makers of those YouTube videos weren't aware this was happening. (There's also nothing stopping tech giants from using YouTube data in other dataset collections; the Pile is just one possible source.) It wasn't a secret EleutherAI had gathered up subtitles from YouTube videos, as the organization not only made the Pile publicly available, it detailed the thing in a research paper in 2020. The code that scraped the YouTube Subtitles dataset is on GitHub for all to see. The script can be told to pull in subtitles for videos that match certain search terms; in the Pile's case, those terms ranged from things like "quantum chromodynamics" to "flat earth." The actual videos used to form the dataset aren't mentioned in either the 2020 paper nor on GitHub. Only now are people looking through the training data, since superseded by other collections, identifying the videos that were scraped, and tipping off YouTube creators. An online search tool for inspecting the subtitle training material has been offered here. What's interesting is that Google-owned YouTube's terms of service, today at least, explicitly ban the use of scrapers and other automated systems unless they are public search engines that obey YT's robots.txt rules or have specific permission from YouTube. The terms also seemingly prohibit the downloading and use of things like subtitles in AI training unless, again, YouTube and applicable rights holders give permission. So on the one hand, there is the potential for the automated scraping of subtitles to be against YouTube's rules, but there's also wiggle room for it to be totally fine. Well, as far as YouTube is concerned; creators feeling their work is being unethically exploited, however legal, by rich companies is another thing. It's something that everyone is dancing around. The PR folks at Google - which is itself in the AI game - have simply said, in response to this week's reporting, that the internet giant puts a lot of effort into thwarting unauthorized scraping, and declined to talk about individual organizations' use of its YouTube data. AI labs that used the Pile to build their models argued they simply incorporated a broad public dataset and that they weren't the ones doing any scraping; the training database conveniently acts as rocket fuel and a legal blast shield for their machine learning activities, in their view. "Apple has sourced data for their AI from several companies," tech reviewer Marques Brownlee said on Xitter in light of the findings. "One of them scraped tons of data/transcripts from YouTube videos, including mine." Brownlee noted that "Apple technically avoids 'fault' here because they're not the ones scraping." The Register has asked EleutherAI, Apple, Nvidia, and others named in the report for further details and explanations. Using people's work to train AI without explicit permission has sparked big lawsuits. Microsoft and OpenAI were sued in April by a cohort of US newspapers, and two AI music generators got complaints from Sony, Warner Brothers, and Universal. A few things seem certain. Artificial intelligence developers can and will get their hands on all manner of information for training - as training data drives their neural networks' performance - and they don't always need explicit permission from creative types to do it, as permission may already have been quietly granted through platform T&Cs. And at least some of these development labs are highly reluctant to reveal where exactly they get their training data, for various reasons as you can imagine, including commercial secrecy. This is something we expect to see rumble on and on, with more and more revelations of info being exploited, no matter how legal or ethical, much to the exasperation of the people creating that material in the first place and being displaced by this technological work. ®
[27]
Dhruv Rathee, Marques Brownlee, PewDiePie YouTube video subtitles used to train AI models
Dhruv Rathee, Marques Brownlee, and PewDiePie YouTube video subtitles were used to train AI models, according to a tool shared by the Proof News outlet. Anthropic, Nvidia, Apple, and Salesforce were among the leading tech firms that used a YouTube video subtitle dataset to train their AI models, according to the outlet The outlet said it found subtitles from 173,536 YouTube videos that were pulled from over 48,000 channels, but warned that the tool could result in false negatives. Some of the videos that were used to train AI included uploads by tech reviewer Marques Brownlee, apart from content creators such as PewDiePie and Dhruv Rathee, as well as news publications and talk shows worldwide. Based on a search using the tool, a 2020 video by The Hindu was also seen in the results. AI is learning from what you said on Reddit, Stack Overflow or Facebook. Are you OK with that? (For top technology news of the day, subscribe to our tech newsletter Today's Cache) Most of the videos were from 2020 or earlier, suggesting a cut-off of sorts. Brownlee criticised companies that scraped video transcripts for AI training content. "Fun fact, I pay a service (by the minute) for more accurate transcriptions of my own videos, which I then upload to YouTube's back-end. So companies that scrape transcripts are stealing *paid* work in more than one way. Not great.," posted Brownlee on X on Tuesday. Anthropic and Salesforce confirmed using training datasets that included the scraped video subtitles, but did not accept any wrongdoing, per the outlet. Nvidia, Apple, Databricks, and Bloomberg did not confirm or deny the allegations. The question of scraping YouTube videos -- or their transcripts -- to train AI models is a contentious one. Earlier in the year, when OpenAI official Mira Murati was asked about whether the ChatGPT-maker used YouTube videos for AI training, she struggled with the question and could not answer clearly. Read Comments
Share
Share
Copy Link
Major tech companies, including Apple, Nvidia, and Anthropic, are facing allegations of using thousands of YouTube videos to train their AI models without proper authorization, sparking controversy and frustration among content creators.
A recent investigation has revealed that several major technology companies, including Apple, Nvidia, and Anthropic, have allegedly used thousands of YouTube videos to train their artificial intelligence (AI) models without obtaining proper permission from content creators 1. This revelation has sparked a new controversy in the AI industry, raising questions about data ethics and intellectual property rights.
According to reports, approximately 173,000 YouTube videos were utilized for AI training purposes 2. These videos, totaling over 10 years of content, were allegedly scraped from the platform without the knowledge or consent of their creators. The dataset, known as "YT-temporalnet," was compiled by researchers at the University of Texas at Austin and has been used by various tech companies for AI development 3.
Among the companies implicated in this controversy are:
While some companies have acknowledged using the dataset, others have remained silent on the matter. Anthropic, for instance, has stated that they no longer use the dataset in question 4.
The unauthorized use of YouTube content for AI training has left many content creators feeling frustrated and violated. Professional YouTubers invest significant time, effort, and resources into producing high-quality videos, and the use of their content without permission or compensation raises serious ethical concerns 5.
This incident has brought to the forefront the ongoing debate about the legality and ethics of using publicly available data for AI training. While some argue that content posted on public platforms like YouTube is fair game for AI training, others contend that explicit permission should be obtained, especially when the data is used for commercial purposes 2.
As AI technology continues to advance, the industry faces increasing scrutiny over its data practices. This incident may lead to calls for more transparent and ethical AI training methods, as well as potential legal challenges from content creators seeking protection for their intellectual property 4.
Reference
[1]
[3]
[4]
Apple and Salesforce have responded to allegations that they used YouTube videos without permission to train their AI models. Both companies deny these claims, stating that their AI systems were not trained on such content.
14 Sources
14 Sources
A leaked document suggests that Runway, a Google-backed AI startup, may have used publicly available YouTube videos and copyrighted content to train its Gen-3 AI video generation tool without proper authorization.
4 Sources
4 Sources
Major AI companies are purchasing unused video footage from digital content creators to train their AI algorithms, offering a new revenue stream for creators and addressing the growing need for unique training data.
3 Sources
3 Sources
YouTube has launched a new feature allowing creators to choose whether their content can be used by third-party companies for AI model training, addressing concerns about unauthorized use of creative material.
8 Sources
8 Sources
New research reveals that major AI companies like OpenAI, Google, and Meta prioritize high-quality content from premium publishers to train their large language models, sparking debates over copyright and compensation.
2 Sources
2 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved