Searching Training Data

mike · 7 May 2025 21:50

The Allen Institute for Artificial Intelligence, claims to trace LLM responses back to training data. But, it really just looks like a text search of the training data.

OLMoTrace identifies the exact pre-training document behind a response — including full, direct quote matches. It also provides source links. To do so, the underlying technology uses a process called “exact-match search” or “string matching.”

Furthermore, traceability has its own limits. For instance, although OLMoTrace can provide exact text matches for simple facts, it’s not possible to trace sources for creative generation, such as poems or stories.

LLM Traces Outputs to Specific Training Data | The New Stack

Colin · 9 May 2025 10:07

This is early work but very interesting. As we know too well, text search across massive corpora seems simple in principle and turns out to be super hard in practice.

mike · 9 May 2025 10:12

I got the impression that the people in the interview wanted it to sound like they were reverse engineering the weights. And, simply indexing the training data seemed less significant.

Was there a different paper or something in that article that you found more credible?

Colin · 9 May 2025 12:13

I didn’t get past https://infini-gram.io in the article. And then went straight to the paper: [2401.17377] Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

mike · 9 May 2025 15:29

That paper you cited was helpful. I did not realize all the useful applications of the infini-gram engine.

Do you think that Mojeek would benefit from building its own infini-gram index?

The hope is that by open sourcing infini-gram, he said, the engine can be added to and refined: “By releasing source code and packages, we can allow other people to build indexes on their own training data so they can build for this gap.”

Breakthrough: LLM Traces Outputs to Specific Training Data - The New Stack