Meet Talkie-1930: an open-weight 13B LLM trained on pre-1931 English text for historical reasoning and generalization research

What if the language model had never heard of the Internet, smartphones, or even World War II? This isn’t hypothetical, it’s exactly what a team of researchers led by Nick Levine, David Duvenaud, and Alec Radford built. They call him WirelessIt may be the most historically disciplined large language model ever released to the public.

Talkie is a 13 billion-parameter open-weight language model trained exclusively on pre-1931 English text. The project was developed by a non-profit team and offers what researchers call a “old language model” – An LM device with difficult knowledge that is not linked to the time of its training, but to a specific moment in history.

What exactly is the ancient language model?

To understand Talkie, you first need to understand the concept behind it. Most of the holders of modern LLM certifications like GPT-4, LLaMA, Mistral, etc. are trained in the massive crawling operations of the modern web. Their knowledge reflects the world as it exists today, or as of the date their training ended. It turns the old language model on its head: it is deliberately trained on historical data only so that its “worldview” is frozen at a certain point in the past.

For speaking, this cut is December 31, 1930 -Selected specifically because this is the date at which works enter the public domain in the United States, making the pre-1931 text legally usable for training.

Form – officially named Talkie-1930-13b-base – It was trained 260 billion tokens Historical English text before 1931, including books, newspapers, periodicals, scholarly journals, patents and case law. Separate conversation checkpoint after training, wireless-1930-13b-itAlso available for interactive use. The team has set up a 24/7 live demo at talkie-lm.com/chat where Claude Sonnet 4.6 constantly plays the exact form of the instructions, allowing visitors to monitor and learn the talkie’s voice in real time.

Why a model from 1930?

This is not a nostalgia project. The research team has identified several concrete and technically meaningful use cases that make speaking interesting to the AI research community.

1. Contamination-free generalization experiments: Benchmark contamination-where test data inadvertently leaks into training data-is one of the most persistent and underappreciated issues in modern LLM evaluation. Since the Talkie has only been trained on pre-1931 text, it is free of contamination by construction with respect to any modern standard. This opens up a clean experimental setup to test how well LM can generalize beyond pre-training data. For example, the team tested whether a speaker could learn Python – a language that didn’t exist in 1930 – by providing some illustrative examples in context. using HumanEval They found that although legacy models significantly outperform web-trained models, they are “slowly but surely improving at this task at scale.”

2. Evaluate prediction and temporal surpriseInspired by Calcifer Computing’s work on temporal language models, the research team used a speech device to measure… Astonishment (measured in bits per byte) for descriptions of historical events from New York Times“On This Day” feature. Events that occurred after 1930 – the blackout of knowledge of communication devices – were always more surprising to the model, and the effect was most pronounced in the events of the 1950s and 1960s, followed by the stabilization phase. This creates an initial setup to study how predictability scales with model size and how performance declines over longer time periods.

3. Master’s degree in identity and personality formation: Because talkie was trained on a completely different distribution than any modern model, it opens up questions about what constitutes the “identity” of LLM. Modern LLM degree holders – regardless of providers – all share a common ancestor in web data, whether through direct training or through distillation and synthetic data pipelines. Talkie breaks this lineage completely, giving researchers a tool to examine universal language modeling behaviors and capabilities against the makings of contemporary web training.

Training Pipeline: What makes this so difficult

Building an ancient language model is not as easy as filtering a modern dataset by date. The talking research team faced many non-trivial engineering challenges.

Temporary leak It is the most important. If any post-1930 text creeps into the training corpus-through misdated documents, or older texts with outdated editorial introductions-the historical fidelity of the model is compromised. Apparently an earlier 7B version of the talkie knew about Roosevelt’s presidency and New Deal legislation, revealing an incomplete liquidation process. The team built a Document-level anachronism classifier To round out the collection, but acknowledge that this is still incomplete – Issue 13B retains some awareness of World War II and the post-war order.

Data quality Another major hurdle. Since there was no digital publishing in 1930, every token in the speaking practice set had to be copied from physical sources via optical character recognition (OCR). In controlled experiments, the team found that training on written text with traditional OCR systems produced only results 30% of learning efficiency For a model trained on human versions of the same texts. Simple regular expression cleaning improved this percentage to 70%, but a large gap remained. To close it, they build a custom Old Optical Character Recognition (OCR) system Fine-tuned for historical document layouts.

Booze after training: Instruction tuning phase – requires building a completely new pipeline from scratch. Using modern instruction-response pairs injects contemporary expectations into the model’s behavior. Instead, the team created instruction-answer pairs from structured historical texts: etiquette manuals, letter-writing guides, cookbooks, dictionaries, encyclopedias, and collections of poetry and tales. Then they ran away Direct Online Preference Optimization (DPO) Use Claude Sonnet 4.6 As a judge, he improved the speaker’s average rating of following instructions from 2.0 to 3.4 on a five-point scale. A final round of supervised fine-tuning was used for the multi-turn rejection-sampled synthetic chats created between Claude Opus 4.6 and talkie.

Benchmarks: How does the 1930 model stack up?

To provide useful context, the research team trained A “Modern Twin” – An architecturally identical 13B model trained on modern web data (FineWeb) – and compared with a talkie. Not surprisingly, the talkie underperformed its modern counterpart in standard LM ratings. However, when controlling An anachronistic question – Filtered out questions that referred to concepts that did not exist in 1930 – The performance gap was almost halved. The research team notes an encouraging parity in basic language comprehension and arithmetic tasks, and attributes the remaining gap primarily to OCR noise and differences in subject distribution.

Key takeaways

The Talkie is a “vintage language model” with an open weight of 13B It is trained on 260 billion tokens of exclusive pre-1931 English text – making it the largest known vintage LM, with a strict knowledge cutoff of 31 December 1930.
Standard contamination is eliminated by design. Because Talkie has never seen recent data, it serves as a uniquely clean testbed for generalization experiments – including whether a model with no knowledge of digital computers can learn to write Python code from examples in context alone.
Creating a legacy LM is more difficult than filtering by date. The research team had to solve the problem of temporal leakage (post-1930 data leakage), reduce OCR noise from training efficiency to only 30% of human-typed text, and build the entire post-training pipeline from pre-1931 sources such as etiquette manuals and encyclopedias.
There are two publicly available checkpoints within Apache 2.0: talkie-1930-13b-base For rough completion and talkie-1930-13b-it For chat – but running it locally requires a CUDA GPU with at least 28GB of video memory.
Bigger models are coming. The research team is targeting a GPT-3-level legacy model by the summer of 2026, with a collection they estimate could reach more than a trillion tokens – enough to match the capacity of the original ChatGPT, which was frozen in 1930.

verify Model weights, repo and Technical details. Also, feel free to follow us on twitter And don’t forget to join us 130k+ ml SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.

Do you need to partner with us to promote your GitHub Repo page, face hug page, product release, webinar, etc.? Contact us

Meet Talkie-1930: an open-weight 13B LLM trained on pre-1931 English text for historical reasoning and generalization research

What exactly is the ancient language model?

Why a model from 1930?

Training Pipeline: What makes this so difficult

Benchmarks: How does the 1930 model stack up?

Key takeaways

Like this:

Related

Like this:

Like this:

Like this:

Leave a ReplyCancel reply

ZAILLUSION

News

Legal

What exactly is the ancient language model?

Why a model from 1930?

Training Pipeline: What makes this so difficult

Benchmarks: How does the 1930 model stack up?

Key takeaways

Share this:

Like this:

Related

Related Posts

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Leave a ReplyCancel reply