Cleaning up their act

Nonprofit scrubs illegal content from controversial AI training dataset

After backlash, LAION cleans child sex abuse materials from AI training data.

Ashley Belanger – Aug 30, 2024 6:44 PM | 50

Credit: Kirillm | iStock / Getty Images Plus

After Stanford Internet Observatory researcher David Thiel found links to child sexual abuse materials (CSAM) in an AI training dataset tainting image generators, the controversial dataset was immediately taken down in 2023.

Now, the LAION (Large-scale Artificial Intelligence Open Network) team has released a scrubbed version of the LAION-5B dataset called Re-LAION-5B and claimed that it "is the first web-scale, text-link to images pair dataset to be thoroughly cleaned of known links to suspected CSAM."

To scrub the dataset, LAION partnered with the Internet Watch Foundation (IWF) and the Canadian Center for Child Protection (C3P) to remove 2,236 links that matched with hashed images in the online safety organizations' databases. Removals include all the links flagged by Thiel, as well as content flagged by LAION's partners and other watchdogs, like Human Rights Watch, which warned of privacy issues after finding photos of real kids included in the dataset without their consent.

In his study, Thiel warned that "the inclusion of child abuse material in AI model training data teaches tools to associate children in illicit sexual activity and uses known child abuse images to generate new, potentially realistic child abuse content."

Thiel urged LAION and other researchers scraping the Internet for AI training data that a new safety standard was needed to better filter out not just CSAM, but any explicit imagery that could be combined with photos of children to generate CSAM. (Recently, the US Department of Justice pointedly said that "CSAM generated by AI is still CSAM.")

Ars Video

While LAION's new dataset won't alter models that were trained on the prior dataset, LAION claimed that Re-LAION-5B sets "a new safety standard for cleaning web-scale image-link datasets." Where before illegal content "slipped through" LAION's filters, the researchers have now developed an improved new system "for identifying and removing illegal content," LAION's blog said.

Thiel told Ars that he would agree that LAION has set a new safety standard with its latest release, but "there are absolutely ways to improve it." However, "those methods would require possession of all original images or a brand new crawl," and LAION's post made clear that it only utilized image hashes and did not conduct a new crawl that could have risked pulling in more illegal or sensitive content. (On Threads, Thiel shared more in-depth impressions of LAION's effort to clean the dataset.)

LAION warned that "current state-of-the-art filters alone are not reliable enough to guarantee protection from CSAM in web scale data composition scenarios."

"To ensure better filtering, lists of hashes of suspected links or images created by expert organizations (in our case, IWF and C3P) are suitable choices," LAION's blog said. "We recommend research labs and any other organizations composing datasets from the public web to partner with organizations like IWF and C3P to obtain such hash lists and use those for filtering. In the longer term, a larger common initiative can be created that makes such hash lists available for the research community working on dataset composition from web."

According to LAION, the bigger concern is that some links to known CSAM scraped into a 2022 dataset are still active more than a year later.

"It is a clear hint that law enforcement bodies have to intensify the efforts to take down domains that host such image content on public web following information and recommendations by organizations like IWF and C3P, making it a safer place, also for various kinds of research related activities," LAION's blog said.

HRW researcher Hye Jung Han praised LAION for removing sensitive data that she flagged, while also urging more interventions.

"LAION’s responsive removal of some children’s personal photos from their dataset is very welcome, and will help to protect these children from their likenesses being misused by AI systems," Han told Ars. "It’s now up to governments to pass child data protection laws that would protect all children’s privacy online."

Although LAION's blog said that the content removals represented an "upper bound" of CSAM that existed in the initial dataset, AI specialist and Creative.AI co-founder Alex Champandard told Ars that he's skeptical that all CSAM was removed.

"They only filter out previously identified CSAM, which is only a partial solution," Champandard told Ars. "Statistically speaking, most instances of CSAM have likely never been reported nor investigated by C3P or IWF. A more reasonable estimate of the problem is about 25,000 instances of things you'd never want to train generative models on—maybe even 50,000."

Champandard agreed with Han that more regulations are needed to protect people from AI harms when training data is scraped from the web.

"There's room for improvement on all fronts: privacy, copyright, illegal content, etc.," Champandard said. Because "there are too many data rights being broken with such web-scraped datasets," Champandard suggested that datasets like LAION's won't "stand the test of time."

"LAION is simply operating in the regulatory gap and lag in the judiciary system until policymakers realize the magnitude of the problem," Champandard said.

“Gold standard” for AI training data

LAION aims to promote AI research by providing an open and transparent dataset, unlike closed models like OpenAI's GPT, which cannot be studied. Re-LAION-5B makes it easy for third parties who made derivatives of the original dataset to clean their derivatives, LAION's blog said.

"Removing this small subset does not significantly impact the large scale of the dataset, while restoring its usability as a reference dataset for research purposes," LAION's blog said.

Thiel told Ars that he's "strongly in favor of transparent datasets and model cards," noting that the "entire reason" he was able to conduct his study on LAION-5B was because "the dataset was openly available."

"We can't vet anything about datasets used to train closed models," Thiel said.

LAION said that it takes "full accountability" for all of its projects and is "dedicated to building safe and legally compliant datasets and tools to advance research and promote widespread accessibility of AI for academia and technology." But as a small nonprofit research organization, LAION alone cannot "single-handedly rectify all publicly available online information."

Through its partnerships with IWF and C3P, LAION seemingly feels better prepared to prevent future releases from referencing illegal content that should be removed from the open web.

LAION thinks that "open datasets should be subject to continuous scrutiny by the broad community, in a common effort to make open datasets better and better."

Its blog noted that the organization "appreciate[s] very much the effort David Thiel from the Stanford Internet Observatory undertook to look closely at LAION-5B and are grateful to all partner organizations for working with us on making it a better, safer dataset for the research community to use."

Thiel told Ars that the "gold standard" for AI training data would be to never combine images of children with explicit content of any kind in a dataset. The purpose of his LAION-5B study was to propose a stronger safety standard so that AI models like Stable Diffusion 1.5, which trained on LAION-5B, could never be exploited to generate CSAM.

Champandard believes that "the most responsible and forward-looking approach is to focus" on creating "datasets with public domain works (more than enough to do most academic research), or permissive licensed content, or mutual collaborations, or purpose-built datasets with opt-in only."

"Datasets like 5B are slowing down the development and adoption of more compliant datasets, as researchers will take the datasets that exist and consider them good enough," Champandard suggested. "From that perspective, the web-scale approach is harmful to humanity's progress in AI that fits better with human rights and data rights. New technology can also help train with less data, but that research is not prioritized because there are convenient datasets available, even if they are likely to still include illegal content."

AI safety research on LAION datasets continues

As LAION encourages researchers like Thiel to continue probing open datasets, the nonprofit has requested that researchers contact them directly over any issues flagged. In its blog, LAION claimed that Thiel's team gave them little time to react to mitigate potential harms of models training on LAION-5B.

Thiel told Ars that LAION had seemingly been aware of illegal content in the dataset since the issue was reported shortly after LAION-5B's release in 2022 but did nothing until his report was released in December 2023.

Champandard alleged to Ars that "LAION knew all along there was CSAM in their 5B dataset," which is "why they built a very naive classifier originally on their own to try to filter it out." Because he claimed that he and several other AI groups disclosed the issue to LAION as early as April 2023, Champandard considered LAION providing continued access to LAION-5B to be "willful misconduct" and "gross negligence." He also said that RunwayML nuking Stable Diffusion 1.5 from HuggingFace and GitHub "should tell you how bad this" is.

"They knew it wouldn't work perfectly (obviously), and other experts in the field documented the risk of their approach," Champandard told Ars. "Yet they released the dataset anyway, and their corporate partners quickly found out there was more than enough CSAM in the dataset for diffusion models to generate CSAM. Multiple other responsible AI groups disclosed to LAION the problem (who didn't care). I wrote to LAION on HuggingFace about the CSAM problem, but LAION only took action when the press got involved."

Champandard told Ars that it was "brave (to put it kindly)" of LAION "to release a new dataset, knowing that everything they publish can be used against them and anyone training models on their data." He also questioned why LAION didn't get IWF involved sooner.

According to Thiel, he shared his study exposing the CSAM problem with LAION shortly before it was published, without leaving time to fix the issue, because he believed that the "rapid release of the findings was appropriate because the datasets were still potentially being used to train, and others were unknowingly proliferating the imagery."

"There was also no fast and easy way to sanitize the datasets," Thiel said. "We very specifically did not want them to just do a simple prune of the dataset; taking it down was the correct response."

LAION co-founder and scientific lead Jenia Jitsev told Ars that "there was no evidence that any of the models" trained on LAION-5B "have indeed seen any of potential CSAM samples—as models were trained on filtered subsets of full LAION-5B and it is not clear whether any of the models were impacted. The only clear thing is that LAION-5B contained 1,008 links to suspected CSAM. Whether any models were ever trained on any of those 1,008 samples is not clear."

"Such proof would entail a generated sample that is similar to any of known CSAM content," Jitsev told Ars, similar to how The New York Times has shown "that pieces of their articles are generated by OpenAI models."

"No such proof were ever presented by anybody so far," Jitsev told Ars.

Listing image: Kirillm | iStock / Getty Images Plus

Ashley Belanger Senior Policy Reporter

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

50 Comments

Ars Video

“Gold standard” for AI training data

AI safety research on LAION datasets continues

nproxy.org