Art and the Machine

Copyright, fair use, and the relationship between generative AI's inputs and outputs

Aug 07, 2023

a robot painting an impressionistic flower painting surrounded by flowers

In a previous post about the WGA strike and AI, an important question came up, a question whose answer will have a very direct effect on the future of generative AI: what constitutes a copy?

As mentioned in that post, there are lawsuits currently working their way through the courts against some of the companies that have created generative AI systems. This includes Stable Diffusion, and Midjourney for image generation and GPT-4 and Llama 2 for text generation. The lawsuits accuse the companies that developed these systems of illegally using copyrighted material as well as illegally copying copyrighted material.

While those two accusations are similar, they are subtly different. In this post I’ll discuss them both and why most of the coverage in the media on these lawsuits and related concerns by different groups regarding AI doesn’t really address the actual issues at hand.

A Brief Look Inside LLMs and Diffusion Models

While this is not intended to be a particularly technical blog, a brief (and very rough) description of how LLMs and diffusion models function is necessary for the discussion. LLMs are the model used in text-centric systems like GPT-4, while diffusion models are used in systems like Midjourney.

LLM systems are composed of various internal models that simplisticly mimic the neuronal structures inside the brain. The basic building block of these models are artificial neurons, which are very simplified versions of the biological neurons in our brains.

Very roughly speaking, when a person reads written material, that material subtly affects the relationships between the neurons in their brain. The more material they’ve read, the better they’re able to interpret subsequent material they read and better able create their own new material. We typically refer this as learning.

Similarly, when an LLM ingests written material, that material also affects the relationship between the artificial neurons within its internal model. This is also referred to in computer science as learning, but it’s a very simplified version of the human version.

Instead of being based on comprehending the material, this machine learning is based on statistical analysis of the material. This learning process is referred to as training when it comes to both LLMs and difussion models. For LLMs, the system ingests a small chunk of material, then makes a probabilistic “guess” at what the next small chunk of material will be and compares its guess to what the actual next chunk is. If the system makes an incorrect guess, then its internal model is adjusted to reflect this failure. If it makes a correct guess, then its internal structure is strengthened to reflect this success.

It’s worth noting that these chunks are usually pretty small — frequently less than a single word. They represent the most common sequences of letters. When engaging in this statistical analysis, the LLM system is limited in the number of chunks it can consider at one time. This is called the context window, and it puts a limitation on both the scope of input the system can analyze and keep track of, as well as the length of conversation and output it can keep track of.

Diffusion models work by ingesting many images, degrading them, then trying to recreate them. As the system tries to successively recreate the degraded images, it adjusts its internal model, particularly the relationship between the artificial neurons in the model. Initially, this training dataset consists of images paired with captions so that the system will eventually be able to create images based on text prompts.

Like LLM models, diffusion models can be fine-tuned to more accurately create a desired type of image, and this process frequently involves ingesting more images and image/text pairs.

Once the training is complete, these systems no longer ingest text or images to adjust their internal structures. The internal models have been created, and the systems are ready to create new text or images.

Words Going In

One complaint in some of the lawsuits, as well as by many creators, is that the data used to train these systems is copyrighted and thus these systems are violating that copyright.

According to a law firm filing a class action suit against OpenAI and Meta with several authors as plaintiffs:

Today, on behalf of two wonderful book authors—Paul Tremblay and Mona Awad—we’ve filed a class-action lawsuit against OpenAI challenging ChatGPT and its underlying large language models, GPT-3.5 and GPT-4, which remix the copyrighted works of thousands of book authors—and many others—without consent, compensation, or credit.

The use of the word “remix” here is likely not arbitrary, as this relates what the LLMs do to the work of musical artists that use the digitally sampled work of other musical artists in their work. This has been litigated a number of times over the years, and it’s been pretty firmly established that a musical artist needs permission to use another musical artist’s work.

The law firm goes on to describe the degree of copying they allege these systems engage in to create their output.

Rather, a large language model is “trained” by copying massive amounts of text from various sources and feeding these copies into the model. (This corpus of input material is called the training dataset).

During training, the large language model copies each piece of text in the training dataset and extracts expressive information from it. The large language model progressively adjusts its output to more closely resemble the sequences of words copied from the training dataset. Once the large language model has copied and ingested all this text, it is able to emit convincing simulations of natural written language as it appears in the training dataset.

While these paragraphs make it clear that the lawsuit alleges a lot of copying is going on, they also provide a pretty inaccurate description of the technology at hand. There are multiple issues with this description, and most of them involve the use of the term copy, some form of which appears six times in the three paragraphs quoted above.

Is ingesting data, dividing it up into successive small blocks, breaking those small blocks into much smaller chunks, and then tokenizing those chunks according to statistical analysis equivalent to creating a copy?

Is ingesting information itself a form of copying that information? While the exact parameters of training GPT-4 and Llama 2 are not completely known outside of OpenAI and Meta, it seems likely that it wasn’t necessary to copy the entire internet and other sources of data somewhere before feeding it into the system rather than just ingesting the source data directly.

It is, however, likely that the data within the current context window is held within the system as it’s analyzed. Yet, this is not very different from going to a website and having the data from that website in your computer’s memory while it’s displayed in a browser. In fact, browsers frequently save a lot of that information to your hard drive so that the website can open more quickly the next time you visit it.

So if you look at a copyrighted image in a browser, are you copying it and violating the copyright of its creator?

It’s hard to know exactly what is meant by the phrase “extracts expressive information.” Whatever that is intended to mean, it’s worth keeping in mind that what the LLM does is cold, hard statistical analysis with no comprehension or analysis of expressiveness. It guesses at which chunks of data, chunks usually smaller than a single word, are statistically likely to follow one another and analyzes its success

The output from this analysis is also not a “simulation of natural written language,” it is natural written language. Natural language is a defined term, and natural in this context describes the nature of the language, not the nature of the entity creating it.

In the end, to make the argument that they’re making, it becomes necessary to stretch the definition of copy well beyond its usual meaning. This is not to say that the issue of copyright isn’t important in relation to LLMs, but instead that you can’t apply the law properly or create new ones without understanding the parameters of what you’re trying to litigate or regulate. It may be that a more expansive definition of copy will need to be codified, but it is likely the case that doing so will have lots of repercussions in areas that are not immediately obvious.

Taking a step back, it’s worth considering what human writers are legally allowed to do. A human novelist can, and usually does, read many books before writing their own novel. Each book the novelist reads affects the relationship between the neurons in that novelist’s brain. Each one increases that novelist’s ability to string words together in a way that makes sense and is engaging.

So the question then becomes, is what LLMs do when ingesting previous works functionally different to what humans do? If so, it’s going to probably require very careful differentiation to avoid potential legal pitfalls in the future.

Most of the information used for training of these systems is readily available to the public on the Internet. However, this may not be the case for all the training material, and this could definitely be a legal issue. The lawsuit alleges that Meta’s system’s training used what are termed shadow libraries, online digital repositories that frequently contain illegal copies of copyrighted material. If this is the case, and Meta seems to have indicated that it is, there could definitely be legal repercussions.

Images Going In

The same lawyers that filed the suit above also filed a suit against Stability AI, DeviantArt, and Midjourney for using copyrighted artwork in their generative AI systems.

All three companies use systems based on Stable Diffusion, a generative AI system released to the public.

The lawyers refer to Stable Diffusion as “a 21st-century collage tool that remixes the copyrighted works of millions of artists whose work was used as training data.”

That’s not a good start, as this is in no way an accurate description of Stable Diffusion. One would have to stretch the definitions of collage and remix beyond their current breaking point to use them when referring to Stable Diffusion or any system that uses a similar diffusion models to create images.

As with the LLMs, the diffusion model systems do not keep copies of the images internally nor do they create new images by making a collage or remixing those training images. Instead, the now trained model is able to create new images based on text prompts and randomization parameters. This process is somewhat equivalent to a human artist viewing a lot of artwork and then creating new artwork based on that experience. The new artwork will likely be influenced by the previously viewed artwork.

Here are several more quotes from the lawyers’ description of the basis for their lawsuit:

Stable Diffusion contains unauthorized copies of millions—and possibly billions—of copyrighted images. These copies were made without the knowledge or consent of the artists.

Stable Diffusion belongs to a category of AI systems called generative AI. These systems are trained on a certain kind of creative work—for instance text, software code, or images—and then remix these works to derive (or “generate”) more works of the same kind.
Having copied the five billion images—without the consent of the original artists—Stable Diffusion relies on a mathematical process called diffusion to store compressed copies of these training images, which in turn are recombined to derive other images. It is, in short, a 21st-century collage tool.
These resulting images may or may not outwardly resemble the training images. Nevertheless, they are derived from copies of the training images, and compete with them in the marketplace. At minimum, Stable Diffusion’s ability to flood the market with an essentially unlimited number of infringing images will inflict permanent damage on the market for art and artists.

The lawyers’ website then goes into a more detailed description of the technology behind Stable Diffusion. Unfortunately, the above paragraphs and the further description of the technology are both confused jumbles of technical inaccuracies, bad analogies, and muddled terminology that seem unlikely to bolster their case.

One of the biggest problems is that they repeatedly confuse the training phase of the system, in which it tries to reconstruct degraded source images so as to adjust the parameters of its internal model, with the actual final output of the system. In doing so, they seem to be claiming that the final output is a copy of the input source images.

They state:

In short, diffusion is a way for an AI program to figure out how to reconstruct a copy of the training data through denoising. Because this is so, in copyright terms it’s no different than an MP3 or JPEG—a way of storing a compressed copy of certain digital data.

There is really nothing accurate in this paragraph. Diffusion is not a way to reconstruct training images, but instead a way to train a system in how to make images in general. It is also not remotely analogous to an MP3 or JPEG, nor a way of storing a compressed copy of the original data. They are confusing one step of the ingestion process during the training phase with the overall functioning of the system.

They claim that Stable Diffusion stores latent images of its training data to create new images. This seems to be a confusion of the term latent space. During training, Stable Diffusion converts an input image from its normal pixel space into a what’s termed latent space so that it can manipulate the image in a more useful form. While this could be considered a form of compression, this is only done during the training period while its adjusting its internal model.

After training on that particular image, there is no copy of it, compressed or otherwise, in the final system. In other words, you could examine every bit of the system and you would not find the original pattern of the image or a compressed version of it. It’s the same as a person memorizing a poem: you won’t find the letters and words of that poem if you examine their brain. Memorizing the poem has affected the structure and functioning of the neurons in the person's brain in such a way that they can recreate the poem (although not always accurately).

Complications

Of course, nothing is simple. Although the original written materials are not stored in LLMs and training images are not stored inside diffusion models, this does not necessarily guarantee that the original images or text can’t be recreated by the system.

There have been several papers detailing how it’s possible to use an “adversarial attack” on these sorts of systems to get around their internal safeguards and coax them into recreating parts of their training data.

This paper showed how it was possible to coax GPT-3 into recreating some of its input data, including text and personal data that was supposed to be anonymized. This paper showed how it was possible to recreate training images in slightly degraded form using systems like Stable Diffusion and DALL-E 2.

So does this mean that the lawyers quoted above are correct in their assessments?

No, but these papers do point out an issue that should be addressed. The use of the term copy in the above quotes is factually incorrect, but that doesn’t mean that it’s impossible to recreate some of the training materials used in the systems. These are not copies in the normal sense of the word. Instead, they are akin to someone memorizing a picture or poem and then recreating it from memory. It may not be a perfect recreation, but it’s close enough to be recognizably as the original image or poem.

The question now becomes one of law rather than technology. Is the system’s ability to recreate an image or text enough to make the system itself a copyright infringement, or is it necessary to actually recreate a specific image or body of text to trigger the infringement?

The developers of these systems are aware of the issue, and each new version of the systems has more safeguards in place to prevent a user from doing this. It’s difficult, though not impossible, for an average user to figure out a way around those safeguards. However, due to the nature of the technology, it is very difficult to completely prevent a technically skilled adversary from getting around them.

The issue itself is not new. It came up with photocopying and with audio and video cassette recording and eventually with digital recording. In other words, is a photocopier itself a copyright infringement or does someone need to copy something copyrighted and distribute it to trigger the infringement? Is it the possibility of making a copy that is the infringement or the actual making of a copy that is the infringement? In the past, the courts have most often ruled that the infringement is triggered only in the latter case.

Use and Fair Use

Finally, there is another question to consider: even if the systems don’t actually copy images, are the images they create still derivative of their input images and therefore copyright infringement?

This gets into legal areas like Fair Use, something beyond the scope of this blog to delve too far into and frequently a matter of contention in the courts. In fact, a major fair use case made it to the Supreme Court on October 12, 2022. It involved a series of Andy Warhol images of Prince (the artist’s name at the time) that were based on a photograph of the musician. The Andy Warhol Foundation had licensed their image without the permission of the original photographer. The Warhol image was a stylized silkscreen of the original photograph.

The Andy Warhol Foundation considered this fair use because Warhol’s image had a different message than the original photograph. On May 18, 2023, the Supreme Court disagreed, agreeing with a lower court ruling that the images were too similar for it to be fair use.

There are various considerations for what constitutes fair use, including aspects related to the public interest, but one that could potentially be a major issue for generative AI systems is the effect of that use upon the potential market for or value of the copyrighted works used. This will likely come up a lot in the future.

Moving Forward

Three possible paths forward that stand out to me are the following:

We could specify in detail how what these systems do is functionally different from what humans do and decide that what they do is illegal because of those differences. For example, one important difference could be that these systems are able to do what humans, but they do it on such a vastly larger scale that it is no longer fair use.

We could instead decide that even though what these systems do is functionally very close to what humans do, it’s illegal simply because an AI system is doing it rather than a human.

Lastly, we could judge whether copyright infringement has taken place on any particular output image or text that’s sold or distributed on a case by case basis. This is what we do with human based copyright infringement cases.

It seems that at some point a definition of copy might have to include recreation of original data that approaches a certain level of similarity to the original regardless of whether there is an actual copy of the original stored in the system. Again, this is how we judge human produced copies. However, this may mean that as long as the system isn’t coaxed into recreating a copy, then there is no copyright infringement.

If you memorize a book, then you’re only guilty of copyright infringement if you create and distribute a copy of it. Should the same be true for AI systems?