Using Brain Scans to increase parity

39%

average word error rate, across subjects

28%

sentences word-perfect, best subject

~90h

of brain data — and still no ceiling

incisions: a scanner, not surgery

§ 0

The direction problem

Every frontier lab is pointed at the same horizon. The word changes — superintelligence, AGI, capability — but the vector is identical: build a mind that is more than a human one. Faster, broader, less wrong. That is a defensible goal for a lab. It is the wrong goal for us.

A synthetic user is not supposed to be brilliant. It is supposed to be a person — specific, distractible, a little impatient, confused by your onboarding flow, anchored to a price it saw three years ago. The thing we are trying to copy is not intelligence. It is behaviour, and most human behaviour is not very intelligent at all. It is habitual, emotional, context-bound, and gloriously inconsistent.

This puts us at right angles to the labs we depend on. The smarter a base model gets, the further its default voice drifts from the ordinary person we need it to be. So we have spent the company walking the other way — not toward a cleverer model, but toward the source of the behaviour we are trying to reproduce. In So where does the data come from? we laid that source out as a stack: census microdata, the big social surveys, decades of panel studies, and the world model a frontier LLM has already built from internet-scale text. And we said we were adding a new layer — brain scans. This post is about a paper that made the case for it better than we could have.

For the generalistLabs build models to be smarter than people; we build them to behave like people — and the way to stay human is to anchor to where behaviour comes from: the brain.

§ 1

Two examples, before this gets abstract

This is easier to see than to argue. So before the neuroscience, two cases any researcher will recognise.

B2C · Consumer

The £2 price rise

A streaming service raises its price, and you want to know who churns.

The smart synthetic answer: “I’d assess whether the library still justifies the cost versus competitors, and cancel if it doesn’t.”

Sensible — and wrong about almost everyone. The real subscriber grumbles, means to cancel, and doesn’t: mid-season on one show, can’t face redoing the watchlist. What you need to predict is the gap between what people say and what they do.

B2B · Business

The vendor renewal

An IT director chooses between the incumbent and a cheaper, better-reviewed challenger.

The smart synthetic answer: “I’d score both on cost, security and integration, and pick the higher score.”

The real director renews the incumbent — switching is career risk, the challenger left one box blank on the security questionnaire, and nobody gets fired for renewing. What you’re modelling is the human reason the optimal choice doesn’t get made.

In both, the failure is identical: a model pointed at intelligence answers like a brilliant consultant, when you needed it to answer like a tired person on a sofa or a nervous manager covering himself. Closing that gap means anchoring the simulation to where the behaviour actually comes from. Which is the long way of explaining why we keep walking toward the brain — and why the result below matters.

§ 2

What Meta just did

On 25 June, a team from Meta AI, the École Normale Supérieure, the Basque Center on Cognition, Brain and Language, and Inria released Brain2Qwerty v2 — a model that reads natural sentences directly out of a non-invasive brain scan.

The setup is almost mundane. Nine healthy volunteers sat under a magnetoencephalography (MEG) scanner — a machine that picks up the faint magnetic fields thrown off by firing neurons — and typed sentences they had just heard. No surgery, no electrodes in the skull. Each was recorded for about ten hours, typing some twenty-two thousand sentences between them. The model’s only input is the raw brain signal; its output is the sentence.

The headline number is a 39% word error rate on average — roughly two words in five come out wrong. That sounds modest until you see the distribution. For the best participant, the model decoded half of all sentences with one word error or less, and more than a quarter of them perfect, word for word. For two decades, non-invasive brain reading was stuck on toy tasks because the signal was thought too noisy to carry language. Brain2Qwerty v2 is the first result to pull fluent, natural sentences out of the non-invasive side of that line.

Word error rate by decoder — lower is better

Adding a fine-tuned language model that reads the brain signal beats both the raw character decoder and a classical language-model correction. Source: Zhang, Lévy et al., 2026, Fig. 3B.

For the generalistA scanner you wear, not one they implant, just read full sentences out of brain activity — many of them word-perfect, no surgery required.

§ 3

How it reads, and the proof it isn’t bluffing

The architecture is the part UX researchers should care about, because it is the same shape we use to instantiate a respondent: characters, then words, then meaning, stacked.

The pipeline, in three levels

Signal

Brain
activity

→

Character

Keystroke
guess

→

Word

Brain→word
alignment

→

Sentence

Fine-tuned
LLM writes it

One module reads keystroke-level signal, a second ties chunks of brain activity to words, and a fine-tuned LLM composes the sentence from both.

That last block raises the obvious suspicion. Put a language model on the end of anything and it produces fluent English — so is it reading the brain, or just autocompleting over a noisy guess? The authors ran the clean experiment: they cut the brain-derived signal out of the language model’s input and let it work from the rough text alone. Accuracy dropped. The model is genuinely leaning on the neural signal to choose its words — it is a language model that has learned to read.

Switch the brain off, and error rises (WER)

Removing the brain-derived “neuro tokens” and conditioning only on the rough text raises the word error rate by about 6 points (0.43 → 0.49, both on a Qwen3-0.6B backbone) — evidence the model decodes from the signal, not from language priors alone. Source: Fig. S4 / §2.2.

A third detail is close to home. To tune the pipeline, the team turned loose autonomous coding agents — Cursor running on Claude — to improve the model by rewriting its own code. Inside a tight search space they beat classical optimisation and found tricks the humans kept. Handed an open-ended brief, they fell apart. A force multiplier, not a replacement; the humans stayed in the loop.

For the generalistSwitch the brain signal off and accuracy drops — so the model is genuinely reading the brain, not just autocompleting plausible text.

§ 4

Why this is a source, not a stunt

One result, however good, is a demo. What makes Brain2Qwerty v2 a source is its shape. First, it scales. Retrained on growing amounts of brain data, accuracy improves log-linearly with recording hours — with no sign of flattening at the ninety-hour ceiling the team could afford. Pour in more brain, get more signal, predictably.

More brain data, lower error — and no plateau (CER)

Character error rate against total pooled recording hours (log scale). The decline is near-perfectly log-linear — about −0.39 CER per decade — and shows no sign of saturating at ~90 hours (EnglishBCBL: ≈0.52 at 20h → 0.25 at full data). Source: Fig. 1F, §2.1.

Second — and this maps directly onto how we think — diversity matters as much as volume. Holding the total amount of data fixed but widening the variety of sentences made the model dramatically more accurate. Quantity and variety are separate axes of quality. Anyone who has watched a synthetic panel collapse toward a single bland voice knows why that matters: variety is not a nice-to-have, it is the signal. A source behaves like this — it rewards more data, it rewards more varied data, and it does not saturate. That is the property we are buying when we reach for the brain instead of for a bigger prompt.

For the generalistMore brain data keeps improving accuracy with no ceiling in sight, and variety matters as much as volume — that’s how a real data source behaves, not a one-off trick.

§ 5

Where it still breaks

It would be dishonest not to keep the habit. The variation between people is large — the model that is near-perfect for the best subject is shaky for the worst. These were healthy volunteers typing, not patients trying to communicate. The signal read is, to a real degree, motor — the brain driving the hands — cleaned up by a model that already knows English. This is not telepathy, and it is not reading a silent thought off a resting mind. And the hardware is a three-hundred-sensor machine cooled with liquid helium in a shielded room. Nobody is wearing that to a usability test.

It degrades gently as you remove sensors (WER)

A ~150-sensor array — the size of an emerging room-temperature wearable helmet — loses only a few points of word accuracy versus the full lab rig. Source: Table 2.

Two facts point forward. The model stays robust when you throw away half the sensors — the information is not hiding in a few special channels — and wearable, room-temperature MEG sensors are arriving. The lab proof and the deployable device are converging. But the honest line is the same one we drew before: the gap to surgical implants, which type at under 2% error, is still real, and the bridge gets crossed at the speed of the weaker side.

For the generalistLab-only, healthy people, a big machine, mostly motor signal — a proof of concept, but the kind that scales and shrinks, not the kind that stays a party trick.

§ 6

What this means if you run user research

Here is the part to take back to your team, and it is not “Synthetic Users is going to scan your customers.” The promise is quieter and more structural. When you ask a synthetic panel what your users would do, the answer has to come from somewhere. Today it comes from text — surveys, transcripts, a language model’s compressed account of millions of humans it read about. That is a strong foundation and a leaky one: the model’s instinct is to be smarter, calmer, and more coherent than the person you are modelling. The drift toward the bland average is the central engineering problem of this entire field.

You fight that drift at the source. Every layer anchored to measured human signal — rather than a model’s best guess at being human — pulls the simulation back toward the messy, specific person you need. Brain2Qwerty v2 is one more brick in that wall: it shows, with numbers, that brain activity is rich enough to read, scales like a real data source, and rewards variety the way good data should. It tells us the layer we bet on is load-bearing.

We do not need this exact model in your next study. We need the world it points to — one where “what a human would do” stops being purely inferred from text about humans and starts being read, in part, from humans themselves. The labs are building a mind that leaves the human behind. We are using the same tools to build one that stays. The whole difference is which way you point the data. And the data, more and more, runs through the brain.

For the generalistYou won’t be scanning your users. But the model standing in for them gets anchored to measured human signal instead of a model’s guess at being human — which is how you stop synthetic respondents drifting into a bland, over-smart average.

The fMRI-versus-LLM analogy was a metaphor in 2024. In 2026 it is a measurement. The same shift is now happening to the oldest assumption in our field — that you can only learn what a person will do by asking them. You can also, it turns out, read it off the source.

Cited

Zhang, Lévy, Rommel, Rapin, Bel, Bonnaire, Nieto, Bourdillon, Pinet, d’Ascoli, Moreau, King. Accurate Decoding of Natural Sentences from Non-Invasive Brain Recordings. Meta AI, ENS-PSL, BCBL, Inria. June 2026. (Brain2Qwerty v2.)
Lévy, Zhang, Pinet, Rapin, Banville, d’Ascoli, King. Brain-to-text decoding: a non-invasive approach via typing. Nature Neuroscience, 2025. (Brain2Qwerty v1.)
d’Ascoli, Bel, Rapin, Banville, Benchetrit, Pallier, King. Towards decoding individual words from non-invasive brain recordings. Nature Communications, 2025.
Défossez, Caucheteux, Rapin, Kabeli, King. Decoding speech perception from non-invasive brain recordings. Nature Machine Intelligence, 2023.
Jude, Levi-Aharoni, et al. Restoring rapid natural bimanual typing with a neuroprosthesis after paralysis. Nature Neuroscience, 2026. (Invasive benchmark, <2% WER.)
Boto, Holmes, et al. Moving magnetoencephalography towards real-world applications with a wearable system. Nature, 2018. (Optically-pumped magnetometers.)