AI image training dataset found to include child sexual abuse imagery

misk@sopuli.xyz · 11 months ago

AI image training dataset found to include child sexual abuse imagery

Communist@lemmy.ml · 11 months ago

How could this even happen by accident?

kromem@lemmy.world · 11 months ago

Because it has five billion images?

The potentially at issue images comprise less than one percent of one percent of one percent of the total.

Communist@lemmy.ml · 11 months ago

Don’t they need to label the data?

kromem@lemmy.world · 11 months ago

No, it’s not manually labeled. It connects the text to the image based on things like alt text or the comment next to it in a social media post, and then ran them through a different AI (CLIP) which rated how well the text description matched the image and they filter out the ones with a low score.

The point of the OP research is that they should add another step to check CSAM databases and not rely on social media curation to have avoided illegal material (which they should, even though it’s a very very small portion of the overall dataset).

But at no time was a human reviewing CSAM, labeling it, and including it in the data.

sir_reginald@lemmy.world · edit-2 11 months ago

removing these images from the open web has been a headache of webmasters and admins for years in sites which host user uploaded images.

if the millions of images in the training data were automatically scraped from the internet, I don’t find it surprising that there was CSAM there.

Communist@lemmy.ml · 11 months ago

Don’t they need to label the data?

Alex@feddit.ro · 11 months ago

Not manually