Removing this small subset does not significantly impact the large scale of the dataset, while restoring its usability as a reference dataset for research purposes
Not to mention it's become quite a bit simpler to do LoRA training with whatever images you want. Doesn't even take much for a model like Flux.I'm worried that this is just rearranging deck chairs on the titanic.
It is not exactly challenging for an AI model trained on perfectly legal images to infer what CSAM might look like.
That's what I'm wondering as well.I'm worried that this is just rearranging deck chairs on the titanic.
It is not exactly challenging for an AI model trained on perfectly legal images to infer what CSAM might look like.
I mean, duh!!After backlash, LAION cleans child sex abuse materials from AI training data.
If you describe, in detail, on your prompt what you’d like to see, how is that the problem of the LLM?I'm worried that this is just rearranging deck chairs on the titanic.
It is not exactly challenging for an AI model trained on perfectly legal images to infer what CSAM might look like.
If you describe, in detail, on your prompt what you’d like to see, how is that the problem of the LLM?
I fully agree that the dataset cannot have illegal content, or content without permission (a discussion onto itself). But if people use it for illegal content that they, and not the tool, is most responsible.
I do think that for online tools it should be the website that is not allowed to generate illegal content regardless of the llm training set.
There was some work that indicated that poisoning sites doesn't work.With many sites upset that their data is being scraped for AI use, can't help but wonder how many sites have deliberately poisoned their sites with 'interesting' data. Should be fairly easy to include stuff on a site that a normal user could never access but a scraping bot might find if it is ignoring the robots file.
I'm deeply curious about that. Do you have links to that research handy?There was some work that indicated that poisoning sites doesn't work.
However, some indications were that actual AI output on a site ... poisoned them fairly effectively.
I don't believe that's strictly true. Diffusion models are shockingly resistant to creating things that don't appear fairly frequently in their data set. And the more fine-tuned the models are to producing realistic (anatomically correct) output, the harder it gets. Just ask anyone who has tried to create centaurs and cyclops and six-armed Kali's. Typical diffusion models just won't do it unless you fine-tune these anatomically "aberrant" concepts back in.I'll bet this makes no difference in the end.
I can type unrelated things you would never see together in the real world and these image synthesis models just figure out how to put those things into an image together. What's to stop the models from figuring out to combine Child and [redacted] if they can combine two of practically anything else?
The only way to stop CSAM content from being generated is to detect it when it is being generated and stop the output. But that will only work for server side generation. It would be trivial for someone with a bit of programming knowledge to remove the detection from open source models and use those on an air gapped computer at home with a decent GPU.
This is not a problem which can be solved now that AI image generation has been released into the wild.
Many sites do something like that to the images, but if you're thinking of deliberately injecting CSAM into AI training sets, that would be massively illegal. The closest equivalent to that I can think of is when some South Korean ISP deliberately planted malware on torrents going over its network:With many sites upset that their data is being scraped for AI use, can't help but wonder how many sites have deliberately poisoned their sites with 'interesting' data. Should be fairly easy to include stuff on a site that a normal user could never access but a scraping bot might find if it is ignoring the robots file.
I don't believe that's strictly true. Diffusion models are shockingly resistant to creating things that don't appear fairly frequently in their data set. And the more fine-tuned the models are to producing realistic (anatomically correct) output, the harder it gets. Just ask anyone who has tried to create centaurs and cyclops and six-armed Kali's. Typical diffusion models just won't do it unless you fine-tune these anatomically "aberrant" concepts back in.
Edit: But I agree that the proverbial cat escaped quite some time ago. Edit again: And very young looking but anatomically normal people engaged in positions that older people are often engaged in isn't much of a stretch of the diffusion models' "imagination". So, yes, CSAM, probably easy. "Anything you can think of"... much harder.
The downvoting of this comment reflects a belief that it missed the point?Is there any testing that shows differences between before and after and how this content actually would impact the output?
Good idea to remove such content from the dataset, but it's not clear if there's a real tangible outcome from it being IN there or OUT.
Can't say you fixed the problem / monitor it if you can't measure if ... you stopped anything.
That's what I'm wondering as well.
What might lead AI to produce an undesirable outcome ... might not be directly similar data, it might be the data "like" it or data that we don't think of as exactly the same. So you remove CSAM, you might still not prevent the AI from doing a bad thing if other data contributed to it.
Primarily that LAION is actually just metadata of crawled images on the Internet. This is the first question of their FAQ:While I generally think that people like Thiel cross the line into witch-hunting and are glad to find something that generates notoriety, why did it take a backlash for the LAION team to actually clean up their dataset?
These are the dataset columns:Q: Does LAION datasets respect copyright laws?
A: LAION datasets are simply indexes to the internet, i.e. lists of URLs to the original images together with the ALT texts found linked to those images. While we downloaded and calculated CLIP embeddings of the pictures to compute similarity scores between pictures and texts, we subsequently discarded all the photos. Any researcher using the datasets must reconstruct the images data by downloading the subset they are interested in. For this purpose, we suggest the img2dataset tool.
There's not even a hash here that could be used for later comparison. All the later discoveries was by scientists who followed these URLs, downloaded the images again and found that a few of them were not only unsafe but actually illegal. But that kind of in-depth checking was far beyond the scope of the initial project, which was simply to provide an open alternative to all the proprietary internal datasets held by large companies.
- URL: the image url
- TEXT: captions, in english for en, other languages for multi and nolang
- WIDTH: picture width
- HEIGHT: picture height
- LANGUAGE: the language of the sample
- similarity: cosine between text and image ViT-B/32 embeddings
- pwatermark: probability of being a watermarked image
- punsafe: probability of being an unsafe image
Super clueless comment but just in case: there is no evidence and no allegations that the content was collected, stored, or identified because it was CSAM. Crimes require intent. This is no different than if you put CSAM on a web server that Google crawled: there isn’t a person in the world that would argue for prosecuting Google rather than the person hosting the content.So this company possessed and used CSAM? Why aren’t they under arrest?
Interesting take. How did you validate that your comment here is not accidentally copyright infringing?Anyone making an AI model should be responsible for every single image they use in their training.
You don't know if their is CSAM in your training data? Well you should.
You don't know if you are using copyrighted material? Figure it out. Don't have permission? Ask, and pay.
Why doing things in a small scale is illegal but when these organizations do it in a massive scale suddenly it's ok?
So much this.I'll bet this makes no difference in the end.
I can type unrelated things you would never see together in the real world and these image synthesis models just figure out how to put those things into an image together. What's to stop the models from figuring out to combine Child and [redacted] if they can combine two of practically anything else?
The only way to stop CSAM content from being generated is to detect it when it is being generated and stop the output. But that will only work for server side generation. It would be trivial for someone with a bit of programming knowledge to remove the detection from open source models and use those on an air gapped computer at home with a decent GPU.
This is not a problem which can be solved now that AI image generation has been released into the wild.
I can see the "philosophical" problem with this.That seems categorically different to me. If the model has CSAM in the training data, then any CSAM it produces is essentially revictimizing the original kids in the training CSAM. It’s a sort of fruit of the poisonous tree argument, and there’s some merit to it. If there is no genuine CSAM in the training corpus, then any noncey stuff it spits out strikes me as essentially the same as something like lolicon hentai: gross, but not harming any actual kids.
True. But one could still scrub the training data of the known CSAM without informing the victims portrayed in the CSAM or even without telling anyone external to the organization. I would think that would be best of both worlds situation, yes?I can see the "philosophical" problem with this.
But the PRACTICAL situation more than likely is that any of the output of a"tainted" model like that, is totally impossible to be meaningfully associated directly with the pictures/people. Any sense of "victimization" ENTIRELY depends on even being aware that the source material was in there, because the output alone is just remixed into genericness.
Look at it like this: let's say you have conclusive evidence that some pictures of a particular CSAM victim you know, got into the dataset. In practice, it's just a drop in the ocean that will never leave telltale recognizable signals in the output of the model. What's going to victimize the CSAM victim MORE: keeping the information for yourself, or following some (I would argue seriously misdirected) noble "right to know" principle, and rubbing in their noses that "you should know the model was also trained on your abuse pics".
The victimization is boosted by the knowledge, more than by the situation itself.
We're talking billions of images that are scraped, which is what's required. The best one can do is use image hashes and and there isn't a guarantee even then. The hashes are constantly updated.Anyone making an AI model should be responsible for every single image they use in their training.
I can see the "philosophical" problem with this.
But the PRACTICAL situation more than likely is that any of the output of a"tainted" model like that, is totally impossible to be meaningfully associated directly with the pictures/people. Any sense of "victimization" ENTIRELY depends on even being aware that the source material was in there, because the output alone is just remixed into genericness.
Look at it like this: let's say you have conclusive evidence that some pictures of a particular CSAM victim you know, got into the dataset. In practice, it's just a drop in the ocean that will never leave telltale recognizable signals in the output of the model. What's going to victimize the CSAM victim MORE: keeping the information for yourself, or following some (I would argue seriously misdirected) noble "right to know" principle, and rubbing in their noses that "you should know the model was also trained on your abuse pics".
The victimization is boosted by the knowledge, more than by the situation itself.
"Such proof would entail a generated sample that is similar to any of known CSAM content"
And this is why AI is a misnomer, it's "image generating neural network". If it was "intelligent" you could set up barriers, like not to generate CSAM and it could understand from the prompt what you expect and deny it, no matter how you word it. Imagine talking to your secretary: "I need pictures of Spiderman!" "Yes sir!" vs. "I need images of Peter Parker, middle school student, changing into a spandex" "Sir, I have called the FBI, they are here."If you describe, in detail, on your prompt what you’d like to see, how is that the problem of the LLM?
I fully agree that the dataset cannot have illegal content, or content without permission (a discussion onto itself). But if people use it for illegal content that they, and not the tool, is most responsible.
I do think that for online tools it should be the website that is not allowed to generate illegal content regardless of the llm training set.
Oh man, you want to create the true monster? The only reason it does not exist yet is because the cost of making a new model, but each day we edge closer to a point a group of criminals could afford it and when they do, I don't even know what could be done after that gets made and released into the wild... Close the old internet and start anew probably.Sounds like they are doing it wrong. They need to actually train a "bad" LLM with CSAM and have it learn how to tell CSAM from non CSAM, then use that as a filter before pulling the linked images into the "good" LLM's training set If the filtering AI tags the image, it doesn't go in without a review by a real human (plus the added benefit that the appropriate legal authorities can be brought into the loop to have 8t taken down and the perpetrator punished.
Well, yes, obviously.True. But one could still scrub the training data of the known CSAM without informing the victims portrayed in the CSAM or even without telling anyone external to the organization. I would think that would be best of both worlds situation, yes?
No substantial disagreement here.I don’t disagree with any of that. I was addressing the very real possibility that a model completely and utterly devoid of training CSAM could still output pedo images. The notion of a tainted model is statistically dubious as you say, but the fact remains that the model is utilizing images of abuse. Sure, it gets a bit into transitive culpability, but there is at least some connection to actual victims. It’s tenuous, but slightly more direct than say dropbox’s.
There’s definitely room for argument about how serious it is. Personally, I think anybody training a model should make a good faith best effort at filtering every bit of csam possible and that the general public should be understanding that no effort is perfect.