Curated by THEOUTPOST
On Fri, 26 Jul, 8:00 AM UTC
4 Sources
[1]
A leaked document indicates Runway's Gen-3 AI video generation tool may have been trained on YouTube videos and copyrighted content without permission
Here's a question that can throw a generative AI company into a twist: "What content has been used to train your models?." While some opt to dodge the question, and others bullishly front out the issue entirely, the question of whether an AI company has scraped content for its own business purposes without permission is a thorny one. At best, you're likely to get a mealy-mouthed explanation of "curated datasets", and at worst, a polemic about whether everything on the internet is essentially fair game. Now a document obtained by 404media appears to show that part of the data used to train Runway's latest AI video generation tool, Gen-3, may have come from the YouTube channels of thousands of popular media companies, including Pixar, Netflix, Disney and Sony. While 404media doesn't go into details as to how the document was obtained, nor could it verify that every video mentioned within was used to train Gen-3, it's potentially an insight into the sort of practices that an AI company might use to scrape copyrighted material to train its models. A former Runway employee spoke to 404media about the methodology involved. The 14 spreadsheets contained within the leaked document are said to feature terms like "beach" or "rain", with the names of Runway employees next to them. According to the source, these names were said to be employees tasked with finding videos or channels related to these keywords, who would then go on to use a YouTube video downloader tool via a proxy to scrape them from the site without being blocked by Google. It's not just YouTube content that looks to have been scraped, either. A spreadsheet containing 14 links to non-YouTube sources, including a link to a website dedicated to streaming popular cartoons and animated movies, with thousands of copyright complaints logged against it. Essentially, pirated media looks to have been at least under consideration for training data, if not directly scraped and used. 404media actually went one step further, and attempted to use Gen-3 to generate video using prompts that contained keywords based on the terms found in the spreadsheet, and was able to create clips that looked to be very much in the same style as the associated content. Runway was itself part-funded by Google, among others, so scraping content without permission from creators on its platforms, if true, is likely to land it in significant hot water. Never mind the potential wider legal repercussions. Still, while the issue of AI content theft is a thorny one, the model does still appear to have issues. Ars Technica tried creating some videos recently with Gen-3 Alpha, and it gave a cat a pair of human hands. I'm not sure what content was used to train that particular version of the model, but I'd suggest that no matter the methodology used here, it could do with some work one way or the other.
[2]
Leak Shows That Google-Funded AI Video Generator Runway Was Trained on Stolen YouTube Content, Pirated Films
A popular and powerful text-to-video AI generator developed by Runway was trained on copious amounts of pirated content and ripped off YouTube videos, according to a gigantic internal spreadsheet obtained by 404 Media. Last month, the company's Gen-3 Alpha video generation tool drew huge amounts of attention, with publications -- including Futurism -- lauding the almost photorealistic clips it could generate. At the time, Runway claimed that Gen-3 Alpha was "trained jointly on videos and images," but stopped far short of elaborating on the source of the data. Now, according to the document obtained by 404 Media, there may be a good reason for that coyness. The spreadsheet is chock full of popular content drawn from major YouTube channels, including those belonging to Disney, Netflix, and Sony, in addition to links to websites that are known to host pirated content. While 404 Media couldn't confirm that Gen-3 Alpha was trained on all of the listed assets, it seems circumstantially very likely -- and, as such, a striking new piece of evidence that AI companies are shamelessly stealing content to feed AI models with a complete disregard for copyright -- a consistently recurring pain point in the world of generative AI. While questions remain as to which videos actually made it into the training data, 404 Media was effortlessly able to generate believable videos of well-known YouTube personalities. Runway even reportedly went as far as to hide its tracks by using a proxy to avoid being blocked by YouTube. "The channels in that spreadsheet were a company-wide effort to find good quality videos to build the model with," an unnamed former employee told 404 Media. "This was then used as input to a massive web crawler which downloaded all the videos from all those channels, using proxies to avoid getting blocked by Google." Runway raised a whopping $141 million in funding last year, including from YouTube owner Google, Salesforce, and chipmaker NVIDIA -- for a heady valuation of $1.5 billion. And it's not just Runway that has come under fire for using copyrighted material without obtaining the necessary licenses to train its AI models. Earlier this year, OpenAI CTO Mira Murati claimed in an interview with the Wall Street Journal that she didn't know if training data for the company's upcoming Sora video generator included videos from YouTube, Instagram, or Facebook -- a bizarre admission that drew plenty of skepticism. A couple of weeks later, the New York Times revealed that OpenAI had ignored corporate policies to skirt copyright laws, relying on tools that transcribe YouTube videos to train its AI chatbots. Meanwhile, YouTube CEO Neal Mohan warned AI companies that training AI models on YouTube videos would be a "clear violation" of the video platform's terms of use. In other words, this latest report is yet more evidence that AI companies including Runway and OpenAI are playing fast and loose with copyrighted material. The topic of intellectual property will likely remain a major sticking point in the development of generative AI, perhaps especially when it comes to AI models that can generate entire videos. The tech is even forcing legislators to revisit "fair use," a doctrine that permits the limited use of copyrighted material under US law. While AI companies have previously argued that much of the scraped data is fair game in court, many copyright holders have cried foul, leading to a fierce and still growing legal battle. And by linking its work to ripped-off and pirated videos, Runway has vaulted itself into the hot seat.
[3]
In latest AI training drama, Runway accused of using publicly available YouTube videos - SiliconANGLE
In latest AI training drama, Runway accused of using publicly available YouTube videos In the latest drama surrounding the training of artificial intelligence models, video generation startup Runway AI Inc. is being accused of using publicly available YouTube videos to train its AI video generation model. The company, which launched its Gen-3 Alpha model for generating 10-second videos in June to generally positive reviews, is claimed by 404 Media to scraped "thousands of videos from popular YouTube creators and brands, as well as pirated films." The claim is made based on an internal spreadsheet allegedly obtained by the outlet. Among the YouTube channels allegedly used to train Rumway's AI include those from The New Yorker, VICE News, Pixar, Disney, Netflix and Sony. Videos from YouTube creators, including Casey Neistat, Sam Kolder, Benjamin Hardman and Marques Brownlee were also apparently used. Notably, the leaked spreadsheet is said to show that the company was trying to obtain videos that had a specific type of subject matter, camera work and a diverse set of people in them. In some cases, the videos targeted included those showing rain, beaches and even doctors. 404 Media claims that the use of such material to assist in training AI models is ripping off YouTube creators with a theme that somehow reading or viewing publicly available material is some sort of massive crime. And yet, it isn't. While there are arguably grey areas around AI that laws are yet to catch up with, if someone reads 100 books or videos and then comes to a conclusion based on them, that's not copyright theft unless the knowledge learned - outside of facts - was copied verbatim. The closest 404 Media can get to is that a video of a man skiing generated by Runway is somewhat similar to a video from a YouTube creator. Another video of a racing car was also similar. Both of the examples used prompts specifically asking Runway to copy the original video - not regular user behavior - and that the result was not identical (which 404 Media admits) means that they are not a breach of copyright. The drama around Runway followed a similar storm in a teacup on July 16 when Anthropic PBC, Nvidia Corp., Apple Inc., and Salesforce Inc. were accused of using subtitles from YouTube videos to help train their AI models. Legal action has also been taken in relation to AI training, with Microsoft Corp. and OpenAI sued for their use of nonfiction authors' work in AI training in November. The class-action lawsuit, led by a New York Times reporter, claimed that OpenAI allegedly scraped the content of hundreds of thousands of nonfiction books to train their AI models. The Times also accused OpenAI, Google LLC and Meta Holdings Inc. in April of skirting legal boundaries for AI training data.
[4]
This Google-backed billion-dollar AI startup has been accused of scraping YouTube for its video generating tool - Times of India
Runway, a billion-dollar AI startup backed by Google, is facing backlash over allegations that it scraped thousands of YouTube videos without permission to train its latest AI video generation model. The accusations stem from a leaked internal spreadsheet obtained by 404 Media. The document, reportedly shared by a former Runway employee, details plans to categorise and tag content from over 3,900 YouTube channels.These include major media companies like Disney and Netflix, as well as popular individual creators such as Casey Neistat and Marques Brownlee (MKBHD). According to 404 Media's report, the data was used to develop "Jupiter," now known as Runway's Gen-3 AI video creation model. The spreadsheet also allegedly includes links to pirated video websites. Runway, valued at $1.5 billion and having raised $141 million from investors including Google, has not confirmed the authenticity of the spreadsheet. The company previously stated it uses "curated, internal datasets" for training but declined to provide specifics. YouTubers, including MKBHD, expressed displeasure on social media, who says over 1600 videos of his' were scrapped off. Another YouTuber Mr WhoseTheBoss, also took it to the social media, calling out the practice as "scary stuff," and revealed that the company used 1600 videos from his channel. YouTube asides, the report further suggests that the training data set also included piracy sites like KissCartoon, which has a vast library of anime and animated content. Earlier this year in April, YouTube CEO Neal Mohan, told Bloomberg that training AI models on the videos uploaded on the platform is a "clear violation" of the company policies. Runway is already facing lawsuits from creators over unauthorised use of their content in AI training. Recent reports have implicated other tech giants like Apple, Anthropic and Nvidia in similar practices. OpenAI is also not sure of if it's text-to-video generating tool has been trained on YouTube videos or not. The TOI Tech Desk is a dedicated team of journalists committed to delivering the latest and most relevant news from the world of technology to readers of The Times of India. TOI Tech Desk's news coverage spans a wide spectrum across gadget launches, gadget reviews, trends, in-depth analysis, exclusive reports and breaking stories that impact technology and the digital universe. Be it how-tos or the latest happenings in AI, cybersecurity, personal gadgets, platforms like WhatsApp, Instagram, Facebook and more; TOI Tech Desk brings the news with accuracy and authenticity.
Share
Share
Copy Link
A leaked document suggests that Runway, a Google-backed AI startup, may have used publicly available YouTube videos and copyrighted content to train its Gen-3 AI video generation tool without proper authorization.
A recently leaked document has sparked controversy in the AI community, suggesting that Runway, a billion-dollar AI startup backed by Google, may have used publicly available YouTube videos and copyrighted content to train its Gen-3 AI video generation tool without proper authorization 1. The document, which appears to be an internal slide deck, outlines the company's training data sources and has raised questions about the ethical and legal implications of AI model training practices.
Runway's Gen-3 tool is a sophisticated AI-powered video generation system capable of creating, editing, and manipulating video content. The company has gained significant attention and investment, including backing from tech giant Google 4. However, the leaked document has cast a shadow over the company's data acquisition methods.
According to the leaked information, Runway may have utilized a vast array of publicly available YouTube videos for training its AI model 2. This practice, often referred to as "scraping," involves collecting large amounts of data from public sources without explicit permission from content creators or platform owners. The document reportedly lists various video categories used for training, including music videos, TV shows, and user-generated content.
The allegations have raised significant legal and ethical concerns within the tech industry. Using copyrighted content without permission for AI training purposes exists in a gray area of intellectual property law 3. While some argue that such use falls under fair use doctrine, others contend that it infringes on creators' rights and could potentially harm their livelihoods.
This controversy is not unique to Runway and reflects a broader debate in the AI industry about data acquisition and usage for model training. Similar accusations have been leveled against other AI companies, highlighting the need for clearer guidelines and regulations surrounding AI training data 1.
As of now, Runway has not publicly commented on the leaked document or the allegations. The AI community and legal experts are closely watching this situation, as it could potentially set precedents for how AI companies approach data collection and usage in the future 2. The outcome of this controversy may influence future AI development practices and potentially lead to more stringent regulations in the rapidly evolving field of artificial intelligence.
Reference
[2]
[3]
Major tech companies, including Apple, Nvidia, and Anthropic, are facing allegations of using thousands of YouTube videos to train their AI models without proper authorization, sparking controversy and frustration among content creators.
27 Sources
Runway AI, a leader in AI-powered video generation, has launched an API for its advanced video model. This move aims to expand access to its technology, enabling developers and enterprises to integrate powerful video generation capabilities into their applications and products.
8 Sources
Runway introduces Gen-3 Alpha Turbo, an AI-powered tool that can turn selfies into action-packed videos. This advancement in AI technology promises faster and more cost-effective video generation for content creators.
2 Sources
YouTube's introduction of AI-generated content tools sparks debate on creativity, authenticity, and potential risks. While offering new opportunities for creators, concerns arise about content quality and the platform's ecosystem.
4 Sources
AI firms are encountering a significant challenge as data owners increasingly restrict access to their intellectual property for AI training. This trend is causing a shrinkage in available training data, potentially impacting the development of future AI models.
3 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2024 TheOutpost.AI All rights reserved