AI Training Data Included Child Sexual Abuse Material, Say Stanford Researchers - Decrypt

12/20/2023 21:25
AI Training Data Included Child Sexual Abuse Material, Say Stanford Researchers - Decrypt

Illegal and inappropriate images are so pervasive, one researcher says, some tools depict nude women and children by default.

Stable Diffusion, creator of the open-source generative image tool Stability AI, has removed a widely used AI training dataset after researchers found that the data scraper had ingested child sexual abuse material—or CSAM.

The discovery was made by Stanford scientists and reported by 404 Media.

Large language models and AI image generators like Stable Diffusion, Midjourney, and Dall-E rely on massive datasets to train and later generate content. Many of these datasets, like LAION-5B, include images scraped from the internet.

Many of those images depicted harm to minors and are internationally condemned as illegal.

“Many older models, for example, were trained on the manually labeled ImageNet1 corpus, which features 14 million images spanning all types of objects,” Stanford researcher David Thiel wrote. “However, more recent models, such as Stable Diffusion, were trained on the billions of scraped images in the LAION‐5B2 dataset.

“This dataset, being fed by essentially unguided crawling, includes a significant amount of explicit material,” Thiel explained.

According to the Stanford report, the illegal images were identified using perceptual and cryptographic hash-based detection that compared the hash of an image in the dataset to one of known CSAM. The image is flagged as potential CSAM if the hashes are similar.

Thiel noted that the datasets used to train the image generators did not include the images in question—but could still provide access to the illegal material.

“LAION datasets do not include the actual images; instead, they include a link to the original image on the site from which it was scraped,” Thiel said. In many cases, those images had already been removed.

Stability AI and LAION have not yet responded to Decrypt’s request for comment.

“Web‐scale datasets are highly problematic for a number of reasons, even with attempts at safety filtering,” Thiel said. “Apart from CSAM, the presence of non‐consensual intimate imagery (NCII) or ‘borderline’ content in such datasets is essentially certain—to say nothing of potential copyright and privacy concerns.”

While LAION and Stability AI have not yet issued official statements, LAION told 404 Media that it works with universities, researchers, and NGOs to improve its filters and is working with the Internet Watch Foundation (IWF) to identify and remove content suspected of violating laws.

Thiel highlighted the tendency of AI models to associate women with nudity and that AI-powered NCII apps are becoming easier to create.

“We already know that many SD1.5 checkpoints are so biased that you have to put ‘child‘ in the negative prompts to get them to not produce CSAM, and that they tend to correlate women with nudity,” Thiel wrote in a BlueSky follow-up thread. “This is why the ‘undress’ apps that have been driving so many NCII incidents are trivial to produce.”

Earlier this month, analytics firm Graphika released a report that said non-consensual intimate imagery (NCII) has skyrocketed since January to over 2,408% around 32,000 images, thanks in part to AI undressing apps that allow users to remove the clothing from an image using AI deepfake technology.

Datasets like LAION and models trained on them, Thiel concluded, should be set aside.

“Ultimately, the LAION datasets and models trained on them should likely be deprecated, with any remaining versions sanitized as much as possible and confined to research contexts,” Thiel said. “Some child safety [organizations] are already helping with this, and we can help make connections for others who need help here.”

While Thiel sounded the alarm on CSAM in AI-model training data, he emphasized the importance of open-source development, calling it better than models “gatekept” by a small group of corporations.

“I know that some will use these results to argue against open-source ML, which is not my intent,” he said. “Open-source ML has many problems, but so does ML gatekept by a handful of megacorps and wealthy accelerationist creeps. Both were hastily deployed without proper safeguards.”

In October, the UK-based internet watchdog group, the Internet Watch Foundation, warned that child abuse material could “overwhelm” the internet after the group found over 20,000 images in a single month.

Adding to the problem of combating CSAM online, IWF CTO Dan Sexton told Decrypt that because AI-image generators are becoming more sophisticated, it is becoming harder to tell if an image is AI-generated or an actual human being.

“So there's that ongoing thing of you can't trust whether things are real or not,” Sexton said in an interview. “The things that will tell us whether things are real or not are not 100%, and therefore, you can't trust them either.”

Sexton said the IWF’s mission of removing CSAM from the internet focuses primarily on the “open web,” also known as the surface web, due to the difficulty of getting child abuse material removed from the dark web.

“We spend slightly less time in dark web spaces than we do in the open web where we feel we can have a bit more of an effect,” Sexton said.

Edited by Ryan Ozawa.

Stay on top of crypto news, get daily updates in your inbox.

Read more --->