Talkie could silence the tech bros by taking AI back to the 1930s

A ‘vintage’ LLM trained on English texts from before 1931 provides a riposte to firms that insist on ingesting the copyrighted web

John NaughtonColumnist

Artificial intelligence

Columnists

AI shortcuts aren’t all they’re cracked up to be

My portrait of a friend’s double mastectomy

Here’s some refreshing news. Instead of burbling endlessly about an imagined (and unknowable) future, a trio of distinguished AI researchers decided to explore something that we do know about; namely, the past.

They built a “vintage” large language model (LLM) called Talkie that was trained exclusively on pre-1931 English texts. So, unlike most LLMs, which are trained on everything that their makers can scrape from the internet, Talkie has a hard cut-off date – 31 December 1930 – in its knowledge base.

What’s the significance of that date? Simply this: everything published before it is in the public domain, even under the warped US legislation crafted over decades by Hollywood to ensure that Mickey Mouse stayed copyrighted for as long as possible.

That places Talkie squarely inside the copyright fight that is now engulfing the AI companies. It is, by design, the thing that the plaintiffs in the current piracy lawsuits are arguing that the big labs should have built – instead of ripping off the intellectual property of millions of authors.

The model was trained on 260 billion tokens (words or fragments of words) of historical pre-1931 English text, including books, newspapers, periodicals, scientific journals, patents and case law. What’s really charming about it is that its conversational interface was built from pre-1931 reference works including etiquette manuals, letter-writing manuals, encyclopedias and poetry collections.

So Talkie’s notion of how to respond to the user is reconstructed from Edwardian and Victorian conventions of correspondence and conduct rather than the conversational guff one gets from 21st-century LLMs. So, at times, you half-expect it to reply to a prompt saying: “With regard to your inquiry of the 15th… ”

It would be tempting, but wrong, to regard Talkie as just a retro curiosity. In fact, it provides a vivid confirmation of Alison Gopnik’s insight that LLMs are not artificial “minds” but just cultural technologies such as writing, printing and libraries.

In other words, they are tools we use in order to access the accumulated knowledge of our species. Talkie isn’t pretending to be a human from the 1930s. It’s just providing a view into the collective written knowledge of the period and, via that, an insight into its culture.

AI shortcuts aren’t all they’re cracked up to be

My portrait of a friend’s double mastectomy

Newsletters

Choose the newsletters you want to receive

For information about how The Observer protects your data, read our Privacy Policy

In that sense, it enables us to interact with texts written by people who did not possess the 20/20 vision that hindsight bestows. Ask Talkie what will be the likely effect of the automobile on public morality, for example, and it will dig out this from Blackwood’s magazine: “The automobile has had an unquestionable effect in democratising pleasure-seeking, and enlarging the sphere of popular recreations. It has popularised holiday-making, and added to the number of those who spend their leisure in outdoor amusements. The consequence has been to improve public morality, inasmuch as it has substituted innocent for vicious pleasures, and has set up a healthier standard of enjoyment.”

Talkie is a good example of the value of curiosity-driven research. It’s truly generative in the best sense of the word, in that it prompts people to think, to daydream about having conversations with people in the past.

What would you ask someone with no knowledge of what was to come?

It also enables us to do thought experiments. How good were people in the pre-1930s at predicting things that happened before the end of the period? Did ordinary people in Germany in the 1920s foresee the possibility of a fascist takeover?

Or, as Demis Hassabis, the boss of Google DeepMind, wondered, could a language model trained on data up to 1911 independently discover general relativity, as Einstein did in 1915? And so on.

What Talkie quietly demonstrates is that Gopnik’s framing isn’t just a corrective to the breathless mythology of AI; it’s a guide to what the technology should be allowed to be. A library, after all, doesn’t justify its holdings by claiming they constitute a “mind”. An LLM is just a special kind of library that has an interactive, infinitely patient, combinatorial index to our accumulated written past, which you can access.

Talkie also provides a neat, understated riposte to the AI companies’ insistence that their technology cannot work without ingesting the contemporary copyrighted web. It shows that they have been outflanked, intellectually if not commercially, by three researchers working with material from before colour film was invented.

The smallness of the team, the modesty of the claim, and the public-domain training corpus together represent a quietly impressive rebuke to the current Gadarene stampede in search of AI supremacy.

So maybe the best slogan for Talkie would be: the shock of the old.