Nonprofit scrubs illegal content from controversial AI training dataset

50me12

Ars Tribunus Angusticlavius
7,546
Removing this small subset does not significantly impact the large scale of the dataset, while restoring its usability as a reference dataset for research purposes

Is there any testing that shows differences between before and after and how this content actually would impact the output?

Good idea to remove such content from the dataset, but it's not clear if there's a real tangible outcome from it being IN there or OUT.

Can't say you fixed the problem / monitor it if you can't measure if ... you stopped anything.
 
Last edited:
Upvote
-11 (16 / -27)

invertedpanda

Ars Tribunus Militum
2,292
Subscriptor
I'm worried that this is just rearranging deck chairs on the titanic.

It is not exactly challenging for an AI model trained on perfectly legal images to infer what CSAM might look like.
Not to mention it's become quite a bit simpler to do LoRA training with whatever images you want. Doesn't even take much for a model like Flux.
 
Upvote
26 (26 / 0)

Mustachioed Copy Cat

Ars Praefectus
4,801
Subscriptor++
This should have already taken place. The existence of that hash database is not a secret. The nature of the internet and the type of training being done make the necessity of such comparison obvious on its face. Its exclusion from the training process should be taken as willful, because the extra step was inconvenient.

And at this point, if any of the CSAM detection processes seem viable, that should probably be included as well for anything that is flagged as NSFW (a detection system that does seem fairly reliable). If CSAM gets into your training data it had better of been novel (unhashed) and evaded a couple other attempts to identify it.
 
Upvote
26 (33 / -7)

50me12

Ars Tribunus Angusticlavius
7,546
I'm worried that this is just rearranging deck chairs on the titanic.

It is not exactly challenging for an AI model trained on perfectly legal images to infer what CSAM might look like.
That's what I'm wondering as well.

What might lead AI to produce an undesirable outcome ... might not be directly similar data, it might be the data "like" it or data that we don't think of as exactly the same. So you remove CSAM, you might still not prevent the AI from doing a bad thing if other data contributed to it.

Example:

People are weird and the data surrounding our language and etc are also weird. On Reddit there were a series of "fashion" subs years ago. All the language used on those subs was about fashion.

The pictures tho ... they were all candid photos taken in public .... If you looked at it long enough you realized they were all candid photos of underaged kids ...

They weren't fashion subs at all, they were all about candid photos of young kids on the street wearing not a lot of clothing. Creepy as hell. (reddit eventually shut them down in a series of user and sub bans)

What would AI do with that? Maybe that's more of an influence on bad AI outcomes than easily defined horrible material.
 
Upvote
43 (44 / -1)

leonwid

Ars Tribunus Militum
1,689
Subscriptor++
I'm worried that this is just rearranging deck chairs on the titanic.

It is not exactly challenging for an AI model trained on perfectly legal images to infer what CSAM might look like.
If you describe, in detail, on your prompt what you’d like to see, how is that the problem of the LLM?
I fully agree that the dataset cannot have illegal content, or content without permission (a discussion onto itself). But if people use it for illegal content that they, and not the tool, is most responsible.
I do think that for online tools it should be the website that is not allowed to generate illegal content regardless of the llm training set.
 
Upvote
18 (20 / -2)

50me12

Ars Tribunus Angusticlavius
7,546
If you describe, in detail, on your prompt what you’d like to see, how is that the problem of the LLM?
I fully agree that the dataset cannot have illegal content, or content without permission (a discussion onto itself). But if people use it for illegal content that they, and not the tool, is most responsible.
I do think that for online tools it should be the website that is not allowed to generate illegal content regardless of the llm training set.

I don't disagree, I do think there's a challenge: AI right now is trained on human content and requires a large amount of it to do its thing.

I fear that as long as that is the case AI will constantly discover "LOL you humans are a bunch of horn dogs and you want to see terrible shit sometimes. In fact terrible stuff drives engagement and gets your attention." and when we tell it not to do that "Yeah whatever I've seen all your content ... you want that. Everything you showed me indicates you want that."

And we're faced with the fact that AI is not entirely wrong ...
 
Upvote
20 (21 / -1)

Mad Klingon

Ars Tribunus Militum
1,556
Subscriptor++
With many sites upset that their data is being scraped for AI use, can't help but wonder how many sites have deliberately poisoned their sites with 'interesting' data. Should be fairly easy to include stuff on a site that a normal user could never access but a scraping bot might find if it is ignoring the robots file.
 
Upvote
1 (7 / -6)

metrometro

Wise, Aged Ars Veteran
183
Upvote
82 (83 / -1)

50me12

Ars Tribunus Angusticlavius
7,546
With many sites upset that their data is being scraped for AI use, can't help but wonder how many sites have deliberately poisoned their sites with 'interesting' data. Should be fairly easy to include stuff on a site that a normal user could never access but a scraping bot might find if it is ignoring the robots file.
There was some work that indicated that poisoning sites doesn't work.

However, some indications were that actual AI output on a site ... poisoned them fairly effectively.
 
Upvote
7 (7 / 0)

Dark Pumpkin

Ars Scholae Palatinae
1,178
I'll bet this makes no difference in the end.

I can type unrelated things you would never see together in the real world and these image synthesis models just figure out how to put those things into an image together. What's to stop the models from figuring out to combine Child and [redacted] if they can combine two of practically anything else?

The only way to stop CSAM content from being generated is to detect it when it is being generated and stop the output. But that will only work for server side generation. It would be trivial for someone with a bit of programming knowledge to remove the detection from open source models and use those on an air gapped computer at home with a decent GPU.

This is not a problem which can be solved now that AI image generation has been released into the wild.
 
Upvote
19 (21 / -2)
There was some work that indicated that poisoning sites doesn't work.

However, some indications were that actual AI output on a site ... poisoned them fairly effectively.
I'm deeply curious about that. Do you have links to that research handy?

I've read about model collapse, which -- I understand, perhaps incorrectly -- is not exactly the same thing.

But I also know that fine-tuning diffusion models on human curated AI outputs can improve the models dramatically... at a substantial cost of lost "creativity". It's what many user-created Loras do, after all.
 
Upvote
1 (1 / 0)
I'll bet this makes no difference in the end.

I can type unrelated things you would never see together in the real world and these image synthesis models just figure out how to put those things into an image together. What's to stop the models from figuring out to combine Child and [redacted] if they can combine two of practically anything else?

The only way to stop CSAM content from being generated is to detect it when it is being generated and stop the output. But that will only work for server side generation. It would be trivial for someone with a bit of programming knowledge to remove the detection from open source models and use those on an air gapped computer at home with a decent GPU.

This is not a problem which can be solved now that AI image generation has been released into the wild.
I don't believe that's strictly true. Diffusion models are shockingly resistant to creating things that don't appear fairly frequently in their data set. And the more fine-tuned the models are to producing realistic (anatomically correct) output, the harder it gets. Just ask anyone who has tried to create centaurs and cyclops and six-armed Kali's. Typical diffusion models just won't do it unless you fine-tune these anatomically "aberrant" concepts back in.

Edit: But I agree that the proverbial cat escaped quite some time ago. Edit again: And very young looking but anatomically normal people engaged in positions that older people are often engaged in isn't much of a stretch of the diffusion models' "imagination". So, yes, CSAM, probably easy. "Anything you can think of"... much harder.
 
Last edited:
Upvote
30 (30 / 0)

ajg

Seniorius Lurkius
25
With many sites upset that their data is being scraped for AI use, can't help but wonder how many sites have deliberately poisoned their sites with 'interesting' data. Should be fairly easy to include stuff on a site that a normal user could never access but a scraping bot might find if it is ignoring the robots file.
Many sites do something like that to the images, but if you're thinking of deliberately injecting CSAM into AI training sets, that would be massively illegal. The closest equivalent to that I can think of is when some South Korean ISP deliberately planted malware on torrents going over its network:
https://www.tomshardware.com/tech-i...issing-files-strange-folders-and-disabled-pcs
 
Upvote
15 (15 / 0)

blankdiploma

Wise, Aged Ars Veteran
152
I don't believe that's strictly true. Diffusion models are shockingly resistant to creating things that don't appear fairly frequently in their data set. And the more fine-tuned the models are to producing realistic (anatomically correct) output, the harder it gets. Just ask anyone who has tried to create centaurs and cyclops and six-armed Kali's. Typical diffusion models just won't do it unless you fine-tune these anatomically "aberrant" concepts back in.

Edit: But I agree that the proverbial cat escaped quite some time ago. Edit again: And very young looking but anatomically normal people engaged in positions that older people are often engaged in isn't much of a stretch of the diffusion models' "imagination". So, yes, CSAM, probably easy. "Anything you can think of"... much harder.

Yeah, I think in practice the difference between legal and illegal images of naked humans are dramatically lesser than the difference between a horse and a centaur.
 
Upvote
11 (11 / 0)

julesverne

Ars Scholae Palatinae
1,162
Is there any testing that shows differences between before and after and how this content actually would impact the output?

Good idea to remove such content from the dataset, but it's not clear if there's a real tangible outcome from it being IN there or OUT.

Can't say you fixed the problem / monitor it if you can't measure if ... you stopped anything.
The downvoting of this comment reflects a belief that it missed the point?

But the question is nevertheless valid. A couple thousand images(or links) will likely have a homeopathic effect on output. Given the vast number of text and image data in the complete training set far outweighing the CSAM.

The expressed fears that a couple thousand images could bring the LLM to generally associate children with illegal activities are understandable but don't reflect the huge volume disparity in the data.

Irrespective of that and an almost unnecessary comment due to it's obvious truth, it was right to delete the data no matter how small the effect. 1 image is too much.
 
Upvote
8 (10 / -2)

IncorrigibleTroll

Ars Tribunus Angusticlavius
9,228
That's what I'm wondering as well.

What might lead AI to produce an undesirable outcome ... might not be directly similar data, it might be the data "like" it or data that we don't think of as exactly the same. So you remove CSAM, you might still not prevent the AI from doing a bad thing if other data contributed to it.

That seems categorically different to me. If the model has CSAM in the training data, then any CSAM it produces is essentially revictimizing the original kids in the training CSAM. It’s a sort of fruit of the poisonous tree argument, and there’s some merit to it. If there is no genuine CSAM in the training corpus, then any noncey stuff it spits out strikes me as essentially the same as something like lolicon hentai: gross, but not harming any actual kids.
 
Upvote
3 (7 / -4)

richten

Ars Scholae Palatinae
946
Anyone making an AI model should be responsible for every single image they use in their training.
You don't know if their is CSAM in your training data? Well you should.
You don't know if you are using copyrighted material? Figure it out. Don't have permission? Ask, and pay.
Why doing things in a small scale is illegal but when these organizations do it in a massive scale suddenly it's ok?
 
Upvote
15 (23 / -8)

Kjella

Ars Tribunus Militum
1,992
While I generally think that people like Thiel cross the line into witch-hunting and are glad to find something that generates notoriety, why did it take a backlash for the LAION team to actually clean up their dataset?
Primarily that LAION is actually just metadata of crawled images on the Internet. This is the first question of their FAQ:
Q: Does LAION datasets respect copyright laws?
A: LAION datasets are simply indexes to the internet, i.e. lists of URLs to the original images together with the ALT texts found linked to those images. While we downloaded and calculated CLIP embeddings of the pictures to compute similarity scores between pictures and texts, we subsequently discarded all the photos. Any researcher using the datasets must reconstruct the images data by downloading the subset they are interested in. For this purpose, we suggest the img2dataset tool.
These are the dataset columns:
  • URL: the image url
  • TEXT: captions, in english for en, other languages for multi and nolang
  • WIDTH: picture width
  • HEIGHT: picture height
  • LANGUAGE: the language of the sample
  • similarity: cosine between text and image ViT-B/32 embeddings
  • pwatermark: probability of being a watermarked image
  • punsafe: probability of being an unsafe image
There's not even a hash here that could be used for later comparison. All the later discoveries was by scientists who followed these URLs, downloaded the images again and found that a few of them were not only unsafe but actually illegal. But that kind of in-depth checking was far beyond the scope of the initial project, which was simply to provide an open alternative to all the proprietary internal datasets held by large companies.
 
Upvote
13 (15 / -2)

aiken_d

Ars Tribunus Militum
2,022
So this company possessed and used CSAM? Why aren’t they under arrest?
Super clueless comment but just in case: there is no evidence and no allegations that the content was collected, stored, or identified because it was CSAM. Crimes require intent. This is no different than if you put CSAM on a web server that Google crawled: there isn’t a person in the world that would argue for prosecuting Google rather than the person hosting the content.
 
Upvote
8 (11 / -3)

aiken_d

Ars Tribunus Militum
2,022
Anyone making an AI model should be responsible for every single image they use in their training.
You don't know if their is CSAM in your training data? Well you should.
You don't know if you are using copyrighted material? Figure it out. Don't have permission? Ask, and pay.
Why doing things in a small scale is illegal but when these organizations do it in a massive scale suddenly it's ok?
Interesting take. How did you validate that your comment here is not accidentally copyright infringing?
 
Upvote
-3 (7 / -10)

RGMBill

Ars Scholae Palatinae
1,237
Sounds like they are doing it wrong. They need to actually train a "bad" LLM with CSAM and have it learn how to tell CSAM from non CSAM, then use that as a filter before pulling the linked images into the "good" LLM's training set If the filtering AI tags the image, it doesn't go in without a review by a real human (plus the added benefit that the appropriate legal authorities can be brought into the loop to have 8t taken down and the perpetrator punished.
 
Upvote
-7 (0 / -7)

JoHBE

Ars Tribunus Militum
2,565
Subscriptor++
I'll bet this makes no difference in the end.

I can type unrelated things you would never see together in the real world and these image synthesis models just figure out how to put those things into an image together. What's to stop the models from figuring out to combine Child and [redacted] if they can combine two of practically anything else?

The only way to stop CSAM content from being generated is to detect it when it is being generated and stop the output. But that will only work for server side generation. It would be trivial for someone with a bit of programming knowledge to remove the detection from open source models and use those on an air gapped computer at home with a decent GPU.

This is not a problem which can be solved now that AI image generation has been released into the wild.
So much this.

The problem with this entire project and the articles about it, is that it laser-focusses on a particular problem that deserves being addressed, but at the same time it is shockingly oblivious to an entire surrounding context that totally DWARFS the issue they are addressing. When reading I was almost getting the impression they were actively avoiding expanding from their extremely narrow perspective, it is baffling.

People who aren't properly informed and read this stuff, will get away with the idea that this issue is the main enabling factor for artificial CSAM, that effectively scrubbing all of the potential real CSAM from datasets will solve the issue, that there IS a practically enforceable way to achieve all this.

But in reality it's all a sideshow, and none of this will achieve anything, besides enabling (scrubbed) dataset providers to claim that they themselves acted responsibly.
 
Upvote
7 (8 / -1)

JoHBE

Ars Tribunus Militum
2,565
Subscriptor++
That seems categorically different to me. If the model has CSAM in the training data, then any CSAM it produces is essentially revictimizing the original kids in the training CSAM. It’s a sort of fruit of the poisonous tree argument, and there’s some merit to it. If there is no genuine CSAM in the training corpus, then any noncey stuff it spits out strikes me as essentially the same as something like lolicon hentai: gross, but not harming any actual kids.
I can see the "philosophical" problem with this.

But the PRACTICAL situation more than likely is that any of the output of a"tainted" model like that, is totally impossible to be meaningfully associated directly with the pictures/people. Any sense of "victimization" ENTIRELY depends on even being aware that the source material was in there, because the output alone is just remixed into genericness.

Look at it like this: let's say you have conclusive evidence that some pictures of a particular CSAM victim you know, got into the dataset. In practice, it's just a drop in the ocean that will never leave telltale recognizable signals in the output of the model. What's going to victimize the CSAM victim MORE: keeping the information for yourself, or following some (I would argue seriously misdirected) noble "right to know" principle, and rubbing in their noses that "you should know the model was also trained on your abuse pics".

The victimization is boosted by the knowledge, more than by the situation itself.
 
Upvote
0 (5 / -5)
I can see the "philosophical" problem with this.

But the PRACTICAL situation more than likely is that any of the output of a"tainted" model like that, is totally impossible to be meaningfully associated directly with the pictures/people. Any sense of "victimization" ENTIRELY depends on even being aware that the source material was in there, because the output alone is just remixed into genericness.

Look at it like this: let's say you have conclusive evidence that some pictures of a particular CSAM victim you know, got into the dataset. In practice, it's just a drop in the ocean that will never leave telltale recognizable signals in the output of the model. What's going to victimize the CSAM victim MORE: keeping the information for yourself, or following some (I would argue seriously misdirected) noble "right to know" principle, and rubbing in their noses that "you should know the model was also trained on your abuse pics".

The victimization is boosted by the knowledge, more than by the situation itself.
True. But one could still scrub the training data of the known CSAM without informing the victims portrayed in the CSAM or even without telling anyone external to the organization. I would think that would be best of both worlds situation, yes?
 
Upvote
5 (5 / 0)

Psyborgue

Ars Tribunus Angusticlavius
7,637
Subscriptor++
Anyone making an AI model should be responsible for every single image they use in their training.
We're talking billions of images that are scraped, which is what's required. The best one can do is use image hashes and and there isn't a guarantee even then. The hashes are constantly updated.
 
Upvote
-7 (1 / -8)

IncorrigibleTroll

Ars Tribunus Angusticlavius
9,228
I can see the "philosophical" problem with this.

But the PRACTICAL situation more than likely is that any of the output of a"tainted" model like that, is totally impossible to be meaningfully associated directly with the pictures/people. Any sense of "victimization" ENTIRELY depends on even being aware that the source material was in there, because the output alone is just remixed into genericness.

Look at it like this: let's say you have conclusive evidence that some pictures of a particular CSAM victim you know, got into the dataset. In practice, it's just a drop in the ocean that will never leave telltale recognizable signals in the output of the model. What's going to victimize the CSAM victim MORE: keeping the information for yourself, or following some (I would argue seriously misdirected) noble "right to know" principle, and rubbing in their noses that "you should know the model was also trained on your abuse pics".

The victimization is boosted by the knowledge, more than by the situation itself.

I don’t disagree with any of that. I was addressing the very real possibility that a model completely and utterly devoid of training CSAM could still output pedo images. The notion of a tainted model is statistically dubious as you say, but the fact remains that the model is utilizing images of abuse. Sure, it gets a bit into transitive culpability, but there is at least some connection to actual victims. It’s tenuous, but slightly more direct than say dropbox’s.

There’s definitely room for argument about how serious it is. Personally, I think anybody training a model should make a good faith best effort at filtering every bit of csam possible and that the general public should be understanding that no effort is perfect.
 
Upvote
2 (2 / 0)

zdanee

Ars Scholae Palatinae
730
If you describe, in detail, on your prompt what you’d like to see, how is that the problem of the LLM?
I fully agree that the dataset cannot have illegal content, or content without permission (a discussion onto itself). But if people use it for illegal content that they, and not the tool, is most responsible.
I do think that for online tools it should be the website that is not allowed to generate illegal content regardless of the llm training set.
And this is why AI is a misnomer, it's "image generating neural network". If it was "intelligent" you could set up barriers, like not to generate CSAM and it could understand from the prompt what you expect and deny it, no matter how you word it. Imagine talking to your secretary: "I need pictures of Spiderman!" "Yes sir!" vs. "I need images of Peter Parker, middle school student, changing into a spandex" "Sir, I have called the FBI, they are here."
 
Upvote
4 (4 / 0)

zdanee

Ars Scholae Palatinae
730
Sounds like they are doing it wrong. They need to actually train a "bad" LLM with CSAM and have it learn how to tell CSAM from non CSAM, then use that as a filter before pulling the linked images into the "good" LLM's training set If the filtering AI tags the image, it doesn't go in without a review by a real human (plus the added benefit that the appropriate legal authorities can be brought into the loop to have 8t taken down and the perpetrator punished.
Oh man, you want to create the true monster? The only reason it does not exist yet is because the cost of making a new model, but each day we edge closer to a point a group of criminals could afford it and when they do, I don't even know what could be done after that gets made and released into the wild... Close the old internet and start anew probably.
 
Upvote
0 (1 / -1)

JoHBE

Ars Tribunus Militum
2,565
Subscriptor++
True. But one could still scrub the training data of the known CSAM without informing the victims portrayed in the CSAM or even without telling anyone external to the organization. I would think that would be best of both worlds situation, yes?
Well, yes, obviously. :)

Anyway, I think there's a more general point to be made that it is possible to try TOO hard when assessing and declaring potential harm and "victims". Which then ends up having a bigger "harm effect size" than just letting it go, and also diverts attention from the real problems.

Remember that there were also complaints that the dataset contained links to private pics of children in some country somewhere. And it was literally claimed that they were being victimized by this, and that their rights needed to be protected.

If you're somewhat aware of how all of this works, that is just such hyperbole, and an insult to everyone who is ACTUALLY victimized in some way by something. I wish they would focus on some of the ACTUAL problems, instead of going hysterical over stuff that is somewhere at the nth digit behind the decimal point in causing actual issues.
 
Upvote
0 (3 / -3)

JoHBE

Ars Tribunus Militum
2,565
Subscriptor++
No disagreement here.
I don’t disagree with any of that. I was addressing the very real possibility that a model completely and utterly devoid of training CSAM could still output pedo images. The notion of a tainted model is statistically dubious as you say, but the fact remains that the model is utilizing images of abuse. Sure, it gets a bit into transitive culpability, but there is at least some connection to actual victims. It’s tenuous, but slightly more direct than say dropbox’s.

There’s definitely room for argument about how serious it is. Personally, I think anybody training a model should make a good faith best effort at filtering every bit of csam possible and that the general public should be understanding that no effort is perfect.
No substantial disagreement here.

I just think that there's definitely a thing like "trying too hard to identify harm". Something can be "not right" and "unfortunate" and "avoid next time" WITHOUT actually causing harm, or justify a call for removing it from the Internet.
 
Upvote
2 (2 / 0)