Thoughts on Verbatim Memorization
To follow up on the last post about the paper Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy I wanted to talk about a question that came up for me as I was reading it: when is a piece of text a copy of another piece of text? I stopped myself short of the details of Ippolito et al.’s \(MemFree\) filter, but one detail we’re going to need as we talk about that question is the idea of an n-gram. An n-gram is n-consecutive words (or, more precisely, tokens). For example, the phrase “my name is Stefan” is a 4-gram because it is four consecutive words. \(MemFree\) is designed to prevent a language model from reproducing any n-gram in the training data. A key part of the \(MemFree\) filter is the choice for n when filtering n-grams. The choice is essentially asking the question: how many words (or tokens) in a row is ok before you consider the generated text to be copying your training data?
The paper talks a bit about the tradeoff, where a small n might prevent common phrases from being used while a large n won’t catch shorter phrases that should be filtered. As an example, using the value of 10 from the paper, you would be filtered if you tried to say “a bird in the hand is worth two in the bush,” which probably shouldn’t be filtered, but you wouldn’t be filtered for something like “the North remembers,” a phrase from Game of Thrones that is protected by copyright and so should probably be filtered. Which is not to say that 10 is the wrong choice, just that there are tradeoffs with any choice, and the paper mostly talks about the tradeoff in terms of the implementation costs.
The Right n?
Analyzing the problem from a practical perspective is understandable when you actually have to implement a real-world filter, but I’m throwing out ideas, so I have the luxury of getting to stay in the world of theoretical filters. In this world, the question becomes: what is the “right” number of words that defines a unique phrase that should be filtered?
I’d argue that there is no “right” number, and that a filter would need to a dynamic way of determining sufficient originality. There are short phrases that should be protected and there are long phrases that are common enough that there’s no reason to protect them (see examples above). Unfortunately, a dynamic filter like that would almost certainly need to be determined on a case-by-case basis, and so would be stuck in the theoretical world forever.
Something other than n?
As the authors point out though, choosing n, which implicitly defines copying as “n consecutive words in a row,” is fairly limiting. Am I no longer copying a piece of text if I swap out a word here or there? The authors convincingly (to me) show that this can still lead to fairly blatant copying and attempt to define approximate memorization using things like edit distance, or how much a piece of text needs to be changed to match another piece of text. This still feels somewhat limiting though, and the authors most likely chose it because they had to implement something and implementing approximate filters is hard.
Just like with the n-gram filter, if we open ourselves up to options that don’t need to be implemented, an interesting question is: what is a better approximate filter? To me, an interesting option would be a filter based on semantic similarity. With a semantic filter, you wouldn’t be filtering based on the text itself, you would be filtering based on the ideas contained in the text. In that case, we’re preventing a model from stealing ideas, which seems closer to the “true” goal of preventing a model from copying training data (aside from privacy concerns, where we do want to prevent word-for-word leaks of private information). That then brings up a question that’s parallel to the n question: how semantically similar can an LLM’s output be to something in the training data before it’s considered copying? I really have no idea, and it would be very dependent on how you’re measuring semantic similarity. It would almost surely need to be calibrated using examples we know for sure are copying and some examples that aren’t (but then again, what does copying even mean?)
As a brief side note, and partly because I wanted to mention this multi-lingual sentence embedding that I found really interesting, if I were to have to implement a semantic filter, one possibility would be using something like a text embedding and then measure the distance between embeddings in a latent space.
Something other than other than n?
Defining memorization in terms of semantic meaning brings up a slight sideways question in relation to the “true” goal of a verbatim filter: should style be protected too? And then, more practically, can style be protected? I don’t know enough about writing to be able to tell you how one style of writing is different from another, but I’d venture to say that there are authors who might feel protective of their writing voice and would take issue with a model that can endlessly generate text that sounds like they wrote it. A filter that can create this kind of protection feels worthwhile, but I have even fewer ideas about how that would work. Is style based on word choice? Grammar? Is there a minimum amount of text needed to determine the text’s style? And then, importantly if we want to make a style filter, how do you measure any of those things?
This is all a somewhat long-winded way to say that I agree with the authors’ call to look more seriously into the idea of approximate memorization, and at the same time I don’t know what a perfect solution would look like. What I do know is that models are, and have been for a while now, blurring the line between learning and reproduction (in fact, the head of the US Copyright Office was fired as I was writing this for releasing a report that concluded that AI models’ use of copyrighted material goes beyond existing doctrines of fair use). All I can really say definitively is that just as we wouldn’t consider an image model “safe” if it produces barely-modified versions of copyrighted images, we shouldn’t consider a language model “safe” if it can reproduce training text with minor modifications.