How AI Datasets ACTUALLY Work
I love the irony of techbros shouting “YOU NEO-LUDDITES JUST DON’T KNOW HOW THE TECH WORKS” when they obviously don’t know how AI datasets that generate images work.
This is not true “artificial intelligence.” It doesn’t see images, form an understanding of them, then create something new. It’s not like a person looking at photos of frogs and then making a new painting of a frog.
That’s what tech bros seem to not get. True artificial intelligence WOULD be able to do that.
But the machine being fed real art is NOT artificially intelligent. What it does is take dozens of images and break them up into teeny tiny pieces, like a 5000-piece jigsaw puzzle but on a near pixel level. It also takes the understanding of those images based on what people say those images are.
Most puzzle-makers have the same die-cut, which means you can take pieces from multiple puzzles and put them together into something new. I won’t link it here so this post reaches the most people, but look up:
Puzzle Montage Art by Tim Klein
Examples of montage puzzle art:
What he did is EXACTLY what AI image generators do, except instead of using two or three artworks, one AI-generated image might use hundreds. And this is what those who actually understand the technology are trying to get across.
Right now, most things that exist are now fed into image datasets. The number of works that exist in midjourney and stable diffusion number in the literal billions. Datasets have stolen so much art that most people can’t fathom that kind of statistic because we’re just not capable of thinking in those kinds of numbers.
That’s why techbros think an art generator just takes inspiration from the works it has.
In reality, the reason tech-bros state that it takes them hours to get a particular set of prompts “just right” is because they are educating the machine. Basically, if you give it a set of prompts and it gives you something you don’t want, it is internally assigning labels to the pieces of the puzzle it gave you. Eventually, when you DO get what, the same datasets that produced those images will scrape them again and assign labels to pieces based on what is perceived as “correct.”
This is also why it is impossible to remove images from a dataset. Any image used to create an AI work must also exist within the work itself. Removing any one particular image necessitates finding and removing any child data produced from that image, or else the machine can literally just re-scrape it.
Currently, because a dataset uses the data from hundreds of images to create a new work and does not compensate the original artists for the use of their art, this qualifies as theft under international copyright law.
I hope providing a real-world example of how datasets work is helpful.