AI’s punctuation is only human – I love an em dash too

The proliferation of the grammatical device is closely correlated with the rise of large language models. But don’t forget, they learn from us

John NaughtonColumnist

Do you have an em dash problem? No? Lucky you. I do and, until the other day, I didn’t even know it was a problem. What is an em dash, you ask? Sorry, I should have explained. It’s a punctuation mark that’s roughly the width of the letter “m” (hence the name) and I use it quite a lot because it’s very versatile. I can use it sometimes to avoid putting explanatory text in brackets, for example. On occasion, it can stand in for commas. And it’s less pompous than putting in a colon before listing a number of things that I want to include in a sentence.

So why is it a problem? Well, it turns out that large language models (LLMs) – which at the moment is the term often used as a proxy for AI – seem to be very fond of em dashes. And some eagle-eyed scrutineers have noticed that their proliferation seems to be closely correlated with the arrival and use of LLMs such as ChatGPT, Claude, Gemini et al.

One such linguistic detective, François Keck, used an analytical tool to pull 10,000 abstracts from online archives of English-language scientific journals published between 2021 (before the launch of ChatGPT) and 2025 and then calculated the frequency of various punctuation and special characters in those abstracts. He found that the relative frequency of em dashes more than doubled over the four-year period and that no other character came close to that magnitude of change.

Now, of course, correlation is not causation, but that hasn’t stopped a lot of folks – in education and elsewhere – from getting a firm hold on the wrong end of the stick. As the hysteria about “AI cheating” grew, a preponderance of em dashes in a text came to be regarded as an indicator that it had been generated by a machine.

Which is, to put it politely, baloney. As Nick Potkalitsky puts it in an insightful Substack analysis, Why AI Can’t Stop Using Em Dashes, the em dash phenomenon represents something far more complex: “a convergence of linguistic patterns, training methodologies, technical constraints, and stylistic inheritance that reveals how AI systems process and generate human language. Understanding why AI gravitates toward this particular punctuation mark offers a window into the deeper mechanics of how these systems work and what happens when human writing patterns meet algorithmic optimisation.”

Let’s unpack that. LLMs are essentially machines that construct sentences one “token” (word, or part of a word) at a time. Often those sentences are complex, with lots of subordinate clauses. The em dash is a convenient tool for linking such clauses. But unlike human writers, who also use em dashes a lot, the machines don’t have any stylistic overview: they don’t realise that you can sometimes have too much of a good thing.

Newsletters

Choose the newsletters you want to receive

For information about how The Observer protects your data, read our Privacy Policy

And then there’s the fact that LLMs are trained on human-written text, where em dashes appear frequently in high-quality writing – literature, journalism and formal prose, for example. So maybe the models associate em dashes with authoritative, polished writing. And if there’s one thing that LLMs crave, it’s an authoritative tone – which is why they sound confident even when they’re hallucinating.

Like all machines, LLMs prioritise efficiency. Their currency is tokens – which is how the cost of using them is generally calculated. OpenAI, for example, charges by token usage. (I’ve just checked and that last sentence comes out at 36 tokens.) Accordingly, says Potkalitsky, LLMs “throw in em dashes and often an em dash lets the model condense information into fewer tokens. Instead of writing verbose connective phrases that might require multiple tokens, the model can use a single em dash token to seamlessly attach a clause.”

Before LLMs are released to the public, they are usually tuned by RLHF, or “reinforcement learning from human feedback”, in which human evaluators reward outputs that are clear, well structured and thorough.

This, Potkalitsky believes, inadvertently encourages em dash usage. “When a model wants to add clarification or an example to an ongoing sentence, a dash is often the clearest way to do so without creating run-on sentences or confusing comma usage.”

There’s a delicious irony emerging from this analysis: humans are criticising AI for imitating patterns they have “learned” from accomplished human authors – and from the humans who have been employed to vet their outputs before they are released.

And, in a sense, this is an old story. When the internet first arrived, it rapidly became clear that it held a mirror up to human nature, and some of the resulting images were not, er, flattering.

Now we have a new technology that likewise faithfully mirrors what humans do, and we don’t like what we see, and use the em dash as a way of judging it – or, at any rate, as a rationale for denying the authenticity of its outputs.

So – full disclosure – this column contains 11 em dashes, and yet it was definitely written by a human.

What I’m reading

Sells like teen spirit

The AI Bubble and the Extinction of the Mallrat is a perceptive essay by Casey Mock about the cultural significance of shopping centres for teenagers of earlier generations.

Rolling stock

An interesting blogpost by Dave Birch is Tulips, Steam and Decentralised Finance, arguing that “cryptocurrencies are more like railway shares in Victorian Britain than tulips in the Dutch golden age”.

Academic paper

Harvard vs Trump vs the Media: An Update is a thoughtful critique by James Fallows of the New York Times’s coverage of the conflict between Donald Trump and the oldest US university.

About

Work with us Careers Editorial policy

Join

Journalism school

Events Shop

Follow

The Observer

The Observer Magazine

The ObserverNew Review

The Observer Food Monthly