I guess this is growing up?

AI trained on photos from kids’ entire childhood without their consent

Kids "easily traceable" from photos used to train AI models, advocates warn.

Ashley Belanger – Jun 10, 2024 10:37 PM | 136

Photos of Brazilian kids—sometimes spanning their entire childhood—have been used without their consent to power AI tools, including popular image generators like Stable Diffusion, Human Rights Watch (HRW) warned on Monday.

This act poses urgent privacy risks to kids and seems to increase risks of non-consensual AI-generated images bearing their likenesses, HRW's report said.

An HRW researcher, Hye Jung Han, helped expose the problem. She analyzed "less than 0.0001 percent" of LAION-5B, a dataset built from Common Crawl snapshots of the public web. The dataset does not contain the actual photos but includes image-text pairs derived from 5.85 billion images and captions posted online since 2008.

Among those images linked in the dataset, Han found 170 photos of children from at least 10 Brazilian states. These were mostly family photos uploaded to personal and parenting blogs most Internet surfers wouldn't easily stumble upon, "as well as stills from YouTube videos with small view counts, seemingly uploaded to be shared with family and friends," Wired reported.

LAION, the German nonprofit that created the dataset, has worked with HRW to remove the links to the children's images in the dataset.

That may not completely resolve the problem, though. HRW's report warned that the removed links are "likely to be a significant undercount of the total amount of children’s personal data that exists in LAION-5B." Han told Wired that she fears that the dataset may still be referencing personal photos of kids "from all over the world."

Ars Video

Removing the links also does not remove the images from the public web, where they can still be referenced and used in other AI datasets, particularly those relying on Common Crawl, LAION's spokesperson, Nate Tyler, told Ars.

"This is a larger and very concerning issue, and as a nonprofit, volunteer organization, we will do our part to help," Tyler told Ars.

Han told Ars that "Common Crawl should stop scraping children’s personal data, given the privacy risks involved and the potential for new forms of misuse."

According to HRW's analysis, many of the Brazilian children's identities were "easily traceable," due to children's names and locations being included in image captions that were processed when building the LAION dataset.

And at a time when middle and high school-aged students are at greater risk of being targeted by bullies or bad actors turning "innocuous photos" into explicit imagery, it's possible that AI tools may be better equipped to generate AI clones of kids whose images are referenced in AI datasets, HRW suggested.

"The photos reviewed span the entirety of childhood," HRW's report said. "They capture intimate moments of babies being born into the gloved hands of doctors, young children blowing out candles on their birthday cake or dancing in their underwear at home, students giving a presentation at school, and teenagers posing for photos at their high school’s carnival."

There is less risk that the Brazilian kids' photos are currently powering AI tools since "all publicly available versions of LAION-5B were taken down" in December, Tyler told Ars. That decision came out of an "abundance of caution" after a Stanford University report "found links in the dataset pointing to illegal content on the public web," Tyler said, including 3,226 suspected instances of child sexual abuse material.

Han told Ars that "the version of the dataset that we examined pre-dates LAION’s temporary removal of its dataset in December 2023." The dataset will not be available again until LAION determines that all flagged illegal content has been removed.

"LAION is currently working with the Internet Watch Foundation, the Canadian Centre for Child Protection, Stanford, and Human Rights Watch to remove all known references to illegal content from LAION-5B," Tyler told Ars. "We are grateful for their support and hope to republish a revised LAION-5B soon."

In Brazil, "at least 85 girls" have reported classmates harassing them by using AI tools to "create sexually explicit deepfakes of the girls based on photos taken from their social media profiles," HRW reported. Once these explicit deepfakes are posted online, they can inflict "lasting harm," HRW warned, potentially remaining online for their entire lives.

“Children should not have to live in fear that their photos might be stolen and weaponized against them,” Han said. “The government should urgently adopt policies to protect children’s data from AI-fueled misuse.”

Ella Irwin, the SVP of Integrity for Stable Diffusion maker Stability AI provided Ars with a statement, confirming that "Stability AI models were trained on a filtered subset of the LAION-5B dataset. In addition, we subsequently fine-tuned these models to mitigate residual behaviours."

"Stability AI is committed to preventing the misuse of AI," Irwin said. "We prohibit the use of our image models and services for unlawful activity, including attempts to edit or create non-consensual content.”

Safeguards to keep kids’ data away from AI

When LAION-5B was introduced in spring 2022, it was described as an attempt to replicate OpenAI's dataset and touted as "the largest freely available image-text dataset." With its release, AI researchers cut off from private companies' proprietary datasets had a way to experiment more freely with AI.

Around that time, LAION researchers released a paper that said that LAION anticipated some "potential problems arising from an unfiltered dataset" and "introduced an improved inappropriate content tagging" to make it easier to flag harmful content and update and improve the dataset.

Back when the dataset was publicly available, users were encouraged "to explore and, subsequently, report further not yet detected content and thus contribute to the improvement of our and other existing approaches," the report said.

This is essentially what happened with HRW's report this week and is one reason why LAION sees its dataset as more transparent than other large AI datasets.

"In our opinion, this process is not supposed to be a non-transparent closed-door avenue," LAION's paper said. "It should be approached by [a] broad research community, resulting in open and transparent datasets and procedures for model training."

Other researchers could potentially help flag more URLs linking to real kids' images to keep improving the dataset off the back of HRW's research once the dataset is again publicly available.

When HRW contacted LAION about the images about a month ago, LAION told HRW that AI models trained on LAION-5B could not reproduce kids' personal data verbatim. But acknowledging other privacy and security risks, LAION began removing links to photos from the dataset while also advising that "children and their guardians were responsible for removing children’s personal photos from the Internet." That, LAION said, would be "the most effective protection against misuse."

Han told Wired that she disagreed, arguing that previously, most of the people in these photos enjoyed "a measure of privacy" because their photos were mostly "not possible to find online through a reverse image search.” Likely the people posting never anticipated their rarely clicked family photos would one day, sometimes more than a decade later, become fuel for AI engines.

"Children and their parents shouldn't be made to shoulder responsibility for protecting kids against a technology that's fundamentally impossible to protect against," Han said. "It's not their fault."

Instead, "LAION should take action to prevent the ingestion of children’s personal data into its datasets, and it should also regularly scan for and remove children’s data," Han said.

And lawmakers should urgently intervene to protect children's privacy as AI technologies emerge and proliferate, HRW reported.

In Brazil, legal changes are expected as soon as July.

Last April, the National Council for the Rights of Children and Adolescents published a resolution directing the Ministry of Human Rights and Citizenship "to develop a national policy to protect the rights of children and adolescents in the digital environment within 90 days," HRW reported.

Through that initiative, the children's rights body said that, among other provisions, the policy should specifically cover AI, protect against harassment, establish a right to privacy, and only allow processing of kids' personal data when consent is freely given "in advance" of data collection. Seemingly, that means that soon children could get the right to revoke consent for any AI training on their data, should those provisions be upheld.

In the US, laws have been introduced in Congress to narrowly prevent the spread of non-consensual explicit deepfakes—including those that target children and adults—through the DEFIANCE Act and the “Preventing Deepfakes of Intimate Images Act.” But in Brazil, HRW is advocating for lawmakers to go further and specifically cut kids' personal data entirely out of AI systems.

Brazil's "new policy should prohibit scraping children’s personal data into AI systems, given the privacy risks involved and the potential for new forms of misuse as the technology evolves," HRW recommended. "It should also prohibit the nonconsensual digital replication or manipulation of children’s likenesses. And it should provide children who experience harm with mechanisms to seek meaningful justice and remedy." And Brazil's General Personal Data Protection Law should be updated to adopt "additional, comprehensive safeguards for children’s data privacy," HRW said.

Ars could not immediately reach the Ministry or Han for comment.

This story was updated on June 11 to add comments from HRW researcher Hye Jung Han.

Listing image: RicardoImagen | E+

Ashley Belanger Senior Policy Reporter

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

136 Comments

Ars Video

Safeguards to keep kids’ data away from AI

nproxy.org