How one YouTuber is trying to poison the AI bots stealing her content

Artemis-kun

Wise, Aged Ars Veteran
115
Huh. I remember when a fabsub group once used this exact same method to force VLC to fix its subtitle support, which at the time was notoriously bad. I had a friend with a Mac for which only VLC was available to handle subs at the time, and playing files with poisoned subs would cause it to crash, while MPC using the community codec pack would play the files just fine.
 
Upvote
117 (117 / 0)

Thom Kidd

Ars Praetorian
487
Subscriptor++
While I adore this approach, personally, I can already hear the disingenuous pushback replies from LLM makers: "How dare content creators poison our learning models?!"

Well, maybe don't build your learning models on EVERYONE ELSE'S hard work and then treat it as your own?

"But... how will we beat competitors to market if we have to do the time-consuming initial legwork? Our sales director told the engineering folks to just stea... er, scrape everything on the internet."
 
Upvote
163 (183 / -20)
Post content hidden for low score. Show…
Post content hidden for low score. Show…

ropo153

Wise, Aged Ars Veteran
112
I came across F4mi's channel at some point last year, and it turned out to be one of my favorite Youtube discoveries in years. I find all of her content worth watching, even if I'm not interested in the topic.

Also, even though she's Gen Z, she likes to upload in 4:3 with VHS filters, which amuses this aging Millennial.
 
Upvote
91 (92 / -1)
Post content hidden for low score. Show…
Post content hidden for low score. Show…

"Obese Chess"

Smack-Fu Master, in training
3
Like every other so-called "AI poisoning", such as Glaze, or Nightshade, it doesn't work as intended, is easily circumvented, and makes things worse for actual humans.

Google / YouTube should be investing in ways to rank these garbage low-effort channels down into oblivion, or at least be as responsive to these complaints as they are to DMCA takedowns.
After a year of lurking this is the first comment that's made me roll my eyes so much I registered an account to comment.

Did we read the same article or watch the same video? Her channel is incredibly fun, creative, and high-effort, and her priority the entire time is to mess with AI data scrapers while maintaining accessibility for her human audience.

I'm not sure where you're getting the idea that this "makes things worse for actual humans" or that this is somehow "low-effort" "garbage" that should be ranked "down into oblivion" or handled the same way as a DMCA takedown request.
 
Upvote
182 (206 / -24)

David Mayer

Wise, Aged Ars Veteran
1,154
After a year of lurking this is the first comment that's made me roll my eyes so much I registered an account to comment.

Did we read the same article or watch the same video? Her channel is incredibly fun, creative, and high-effort, and her priority the entire time is to mess with AI data scrapers while maintaining accessibility for her human audience.

I'm not sure where you're getting the idea that this "makes things worse for actual humans" or that this is somehow "low-effort" "garbage" that should be ranked "down into oblivion" or handled the same way as a DMCA takedown request.
Did we read the same article or watch the same video?
I think @Stamped_Fish is complaining about the same "low-effort" "garbage" that the video we are talking about was complaining about.
 
Last edited:
Upvote
55 (62 / -7)

AdamM

Ars Praefectus
5,804
Subscriptor
After a year of lurking this is the first comment that's made me roll my eyes so much I registered an account to comment.

Did we read the same article or watch the same video? Her channel is incredibly fun, creative, and high-effort, and her priority the entire time is to mess with AI data scrapers while maintaining accessibility for her human audience.

I'm not sure where you're getting the idea that this "makes things worse for actual humans" or that this is somehow "low-effort" "garbage" that should be ranked "down into oblivion" or handled the same way as a DMCA takedown request.

I interpreted OP’s comment to be referring to the video scrapers, not the AI poisoning channel.

As far as making things worse for humans.

From the article:

F4mi said she was able to get around this wrinkle by writing a Python script to hide her junk captions as black-on-black text, which can fill the screen whenever the scene fades to black. But in the video description, F4mi notes that "some people were having their phone crash due to the subtitles being too heavy," showing there is a bit of overhead cost to this kind of mischief.
 
Upvote
36 (41 / -5)

starglider

Ars Scholae Palatinae
999
Subscriptor++
I sympathize with content creators, although I selfishly hope that this doesn't catch on. One of my absolute favorite features of Kagi is the "summarize this youtube video" feature, which "reads" the titling. It's absolutely incredible for when you want a quick answer to something, and the best answer is buried in some 20 minute long YT video with 19.5 minutes of "sponsored content" and the actual answer consists of three words at index 17:31.

"What is the command to get the CPU temperature on my Raspberry Pi?"

"CPU temperatures are HOT HOT HOT these days! I'm always trying to find my CPU temperature, and every time I do, I see that it's really high! That's why I'm so delighted to be sponsored today by Coolermaster, the best coolers! Coolermaster is the best! [10 more minutes]. About a decade ago, I began my journey to finding out how to measure CPU temperatures. I once hiked up to K2's peak to see what CPU temperatures were at 8,000 meters! On the way, Black Diamond crampons were my go-to crampons, and I thank them for sponsoring today's video! You can check the temperature by typing in vcgen . . . [youtube ad interrupts]."
 
Upvote
57 (83 / -26)

Don Reba

Ars Tribunus Militum
2,980
Subscriptor++
I sympathize with content creators, although I selfishly hope that this doesn't catch on. One of my absolute favorite features of Kagi is the "summarize this youtube video" feature, which "reads" the titling. It's absolutely incredible for when you want a quick answer to something, and the best answer is buried in some 20 minute long YT video with 19.5 minutes of "sponsored content" and the actual answer consists of three words at index 17:31.
Learned something useful today. I think this will come in handy.
 
Upvote
18 (18 / 0)

robrob

Ars Tribunus Angusticlavius
7,632
Subscriptor
Is it too much to ask Google to engage meaningfully to stop the scraping?

/rhetoricquestion

Considering they can detect a small snippet of a song in the background and divert royalties based on that, you'd think it'd be pretty easy to detect a video that's based entirely off the transcript of another popular video.

But content creators would probably expect them to do outlandish things like take those videos down, which hurts advertising revenue for Google. So that's not going to happen.
 
Upvote
79 (80 / -1)
Post content hidden for low score. Show…

chaos215bar2

Ars Tribunus Militum
2,266
Subscriptor++
After a year of lurking this is the first comment that's made me roll my eyes so much I registered an account to comment.

Did we read the same article or watch the same video? Her channel is incredibly fun, creative, and high-effort, and her priority the entire time is to mess with AI data scrapers while maintaining accessibility for her human audience.

I'm not sure where you're getting the idea that this "makes things worse for actual humans" or that this is somehow "low-effort" "garbage" that should be ranked "down into oblivion" or handled the same way as a DMCA takedown request.
Uh, I think you and a lot of other people completely misread that comment. I'm pretty sure the "low effort" channels are the ones using AI.

The main point also stands: This technique, like other "poisoning" attacks, is ultimately easily defeated by AI companies. In this case, just ignore captions that are out of bounds or invisible.

While the point about interfering with human viewers is a bit more subtle, this is also true, for instance, if someone were to be translating the subtitles. If you're editing subtitles directly as a human, there's a good chance your tool will not filter out the ones intended to poison AIs, since most people don't put invisible nonsense in their subtitles.

This may be an unpleasant truth, but that's the reality. The only effective protections here would be legal, and currently do not exist. If something is readable to a human, it's readable by AI with relatively little effort needed to work around obfuscations. Just like any DRM-adjacent technique.
 
Upvote
24 (30 / -6)

jjpaq

Smack-Fu Master, in training
23
I wonder if you could use a similar technique against OpenAI's Whisper using ultrasound.
I know nothing about Whisper but a fair bit about audio. A band pass filter should take care of that in less than a minute—not to mention that, depending on the encoding of the input audio files themselves, inaudible frequencies may not be preserved to begin with.

On the other hand, if it's possible to add inaudible perturbation within audible frequency ranges (like glazing photos but for audio) that makes them useless as training data, that might be a viable approach.
 
Upvote
38 (38 / 0)

jjpaq

Smack-Fu Master, in training
23
So why are we inventing new stupid computer problems for plucky humans to overcome, rather than just ... not doing that?
Couple of reasons, I guess. For this creator it's probably about solving an interesting problem while protecting her own content, which is a perfectly understandable motivation, even if in the long run the problem gets solved.

It's also part of a larger trend of projects aiming to prevent indiscriminate, unlicensed scraping of data from the Web, which could contribute to a future where it's easier and cheaper to properly license training data than to steal it.

Short answer, it's worth a try.
 
Upvote
41 (41 / 0)
This is a topic where I have some expertise! My friend and I as a hobby have been making a somewhat successful product that basically lets people pull out pros and cons from YouTube videos and jump right to the exact timestamp mentioned and does some other stuff too (but also presents the video and makes it the center of attention!).

I’m not going to plug it here because it’s more of a beer money type thing.

The relevance to this topic is that we had noticed more and more of these slop videos (with surprisingly high views and subs) that were cluttering our product search and creating a terrible experience. So we built an algorithm that detects these videos with ~95% accuracy. As a result, I now have a great set of statistics about how many videos per category are AI slop.

For general product reviews, it is around 40% ai content. It used to be that the YouTube UI did a better job of filtering these out than the official API but that is no longer the case.

The worst by far is hospitality though. Literally >90% of the videos are slop. To try it yourself by trying to look up a nice tropical getaway at your favorite destination.

The latest version of this grift that uses AI generated video will be very difficult to detect, and I don’t know that I care enough to fight that battle, so it’s fortunate that it is relatively expensive to generate.

It’s been a very fun project while it lasted.
 
Upvote
42 (42 / 0)

DrewW

Ars Scholae Palatinae
1,467
Subscriptor++
Google / YouTube should be investing in ways to rank these garbage low-effort channels down into oblivion, or at least be as responsive to these complaints as they are to DMCA takedowns.
YouTube doesn’t care about the amount of effort, it’s all about engagement and time on site. People watch the slop, so YouTube propagates it. People also watch high quality videos, and YouTube recommends them.

Most likely people will get bored of the slop and it will fade like ASMR videos, mukbangs, creator houses, and putting rubber bands on watermelons until they explode.
 
Upvote
31 (31 / 0)

tcowher

Ars Tribunus Militum
1,753
From what I hear from the legitimate creators I watch it seems that is to easy to get your channel banned from iffy DMCA notices or demonetized. Why aren't some of these creators forming a LLC or scorp to hire a lawyer to file DMCA against these infringers?

On an side comment, whats to stop some country like east bumfarkistan :ROFLMAO: from changing their laws to allow training on publicly accessible data and AI companies moving the training operations to that country, sort of how some countries dodge taxes by having subsidies in certain tax friendly countries.
 
Upvote
-4 (10 / -14)