Photos of Brazilian kids—sometimes spanning their entire childhood—have been used without their consent to power AI tools, including popular image generators like Stable Diffusion, Human Rights Watch (HRW) warned on Monday.
This act poses urgent privacy risks to kids and seems to increase risks of non-consensual AI-generated images bearing their likenesses, HRW's report said.
An HRW researcher, Hye Jung Han, helped expose the problem. She analyzed "less than 0.0001 percent" of LAION-5B, a dataset built from Common Crawl snapshots of the public web. The dataset does not contain the actual photos but includes image-text pairs derived from 5.85 billion images and captions posted online since 2008.
Among those images linked in the dataset, Han found 170 photos of children from at least 10 Brazilian states. These were mostly family photos uploaded to personal and parenting blogs most Internet surfers wouldn't easily stumble upon, "as well as stills from YouTube videos with small view counts, seemingly uploaded to be shared with family and friends," Wired reported.
LAION, the German nonprofit that created the dataset, has worked with HRW to remove the links to the children's images in the dataset.
That may not completely resolve the problem, though. HRW's report warned that the removed links are "likely to be a significant undercount of the total amount of children’s personal data that exists in LAION-5B." Han told Wired that she fears that the dataset may still be referencing personal photos of kids "from all over the world."
Removing the links also does not remove the images from the public web, where they can still be referenced and used in other AI datasets, particularly those relying on Common Crawl, LAION's spokesperson, Nate Tyler, told Ars.