The Allen Institute for Artificial Intelligence, claims to trace LLM responses back to training data. But, it really just looks like a text search of the training data.
OLMoTrace identifies the exact pre-training document behind a response — including full, direct quote matches. It also provides source links. To do so, the underlying technology uses a process called “exact-match search” or “string matching.”
Furthermore, traceability has its own limits. For instance, although OLMoTrace can provide exact text matches for simple facts, it’s not possible to trace sources for creative generation, such as poems or stories.
This is early work but very interesting. As we know too well, text search across massive corpora seems simple in principle and turns out to be super hard in practice.
I got the impression that the people in the interview wanted it to sound like they were reverse engineering the weights. And, simply indexing the training data seemed less significant.
Was there a different paper or something in that article that you found more credible?
The hope is that by open sourcing infini-gram, he said, the engine can be added to and refined: “By releasing source code and packages, we can allow other people to build indexes on their own training data so they can build for this gap.”