April 23rd, 2009
Testing YouTube's Audio Content ID System
Commentary by Fred von Lohmann
An enterprising YouTube user has completed a fascinating set of tests to figure out how sensitive the audio fingerprinting tools are in YouTube's Content ID system. (This is the system being used by Warner Music Group to do wholesale censorship of music, including clear fair uses, on YouTube.) After uploading 82 videos that include altered versions of The Waitresses' hit, "I Know What Boys Like," the experimenter comes up with a number of interesting conclusions:
It's everywhere: It scans every single newly-uploaded video, no matter if it has a title/description that seems suspicious. It generally finds them mere minutes after the upload completes. And videos uploaded before the system was installed aren't immune either. It looks like it's going through every single video that has ever been uploaded to the site, looking for copyright problems. It sounds ludicrous, but remember that YouTube is backed by Google, and Google has plenty of hardware to throw around. I have no doubt that they'll eventually trudge through every single video, if they haven't already finished. I wonder how much CPU time (and electricity) they squandered on this?
It's surprisingly resilient: I really thought it would fail some of the amplification tests. Especially the +/-48 dB tests. One was so inaudibly quiet, and the other was so distorted it was completely unlistenable. It found all of them. Likewise, it could detect the sound amidst constant background noise, until the noise level passed the 45% mark. With that much noise, it overpowers the song you're trying to hide. Likewise, it catches all subtle changes in pitch and tempo, requiring changes of up to 5% before it consistently fails to identify material.
It's rather finicky: I can't explain why it was able to detect the camcorder-recorded audio at 5' and 31', but not at 12'. Similarly, the vocal removal/isolation tests should've had similar results. But then again, the effectiveness of the Stereo Imagery tests depends entirely on how the song itself was engineered -- Just because it turned out one way for this song, that doesn't mean it will react the same way to the other songs with that same modification.
It's downright dumb: Wrap your heads around this. When I muted the beginning of the song up until 0:30 (leaving the rest to play) the fingerprinter missed it. When I kept the beginning up until 0:30 and muted everything from 0:30 to the end, the fingerprinter caught it. That indicates that the content database only knows about something in the first 30 seconds of the song. As long as you cut that part off, you can theoretically use the remainder of the song without being detected. I don't know if all samples in the content database suffer from similar weaknesses, but it's something that merits further research.
It seems to hear in mono: When I uploaded the files with out-of-phase audio, the tests consistently passed. When the first out-of-phase test is played back in mono, the resulting audio sounds exactly like the Vocal Remove test (which also passed). When the mono-converted/out-of-phase test is played back in mono, both the channels cancel each other out and the result is (theoretically) silence. This is what the fingerprinter hears, and what it bases its conclusions on.
No comments:
Post a Comment