Believe what repeats

Syntheticusers

Here is something you may have noticed in Synthetic Users. You show a panel three concepts, they rank them, and you get a clear winner. You rebuild the panel with the same parameters, ask the same questions, and the order has shifted. Nothing is broken — in fact, it is a sign of how faithfully the platform mirrors real human research. You have simply found the one place where a brilliant qualitative instrument and a quantitative one part ways, and knowing it makes you a sharper researcher overnight.

This is for anyone using Synthetic Users to count things — to rank concepts, score features, or read off percentages. Those exercises are genuinely useful, and the platform is wonderful at the thinking behind them. The only craft is to read the numbers the way the instrument produces them. Do that, and your quantitative reads become as dependable as the rich qualitative insight Synthetic Users already hands you.

§ 1Why a rerun can reshuffle the order

Two separate engines drive the variation, and it helps to see them apart.

The first is small samples — and it would bite organic users too. A ranking is fragile when the panel is small. With twenty respondents, a fifty-fifty split carries a standard error of about eleven points, which means a ninety-five percent confidence interval roughly twenty-two points wide. Two concepts that genuinely sit close together simply cannot be separated at this sample size, by synthetic respondents or by organic ones.

Make it concrete. Suppose three concepts have true appeal scores of 7.2, 6.9 and 5.0 on a ten-point scale, and the ordinary person-to-person spread within a panel is about 1.8 points. That spread means the average from any one twenty-person panel carries a standard error near 0.4. The gap between the top two concepts is 0.3 — smaller than the error bar on either one. The gap down to the third is 2.2 — far larger. Rerun fresh panels thousands of times and the top two trade places on roughly one run in three, while the third concept lands last every single time. The instrument is not vaguely noisy. It is precise about the distinction you didn't care about and noisy about exactly the close call you wanted resolved.

Figure 1. Three concepts, twenty respondents. The error bars on A and B overlap, so which one "wins" is close to a coin-flip across reruns. C sits clear of both, so it stays last every time. The gap you can trust is the one wider than the whiskers.

The second engine is unique to synthetic respondents, and it is inseparable from the magic. Every Synthetic User is generated by a model drawing from a probability distribution — the very mechanism that lets the platform conjure a fresh, believable person on demand. So when you rebuild a panel "with the same parameters," you have not re-interviewed the same twenty people; you have drawn twenty new ones from the same cloud. That is the mental model to keep: same population, fresh sample, every single time. The same generative spark that makes each respondent feel real is what produces honest, lifelike variation.

You didn't re-interview the same twenty people. You drew twenty new ones from the same cloud.

And underneath both engines sits the happy truth about what Synthetic Users is for. Qualitative research is built for depth — the why behind a preference, the language people reach for, the objection you didn't anticipate, the need nobody had named. This is exactly where Synthetic Users shines, and its measure of confidence is saturation: hearing the same theme recur until new interviews stop surprising you. Asking it instead for "Concept A won, eleven votes to nine" is using a superb microscope as a kitchen scale. The microscope is extraordinary — it simply measures something richer than a single number.

In plain terms

Twenty respondents can't separate two close options — that's just small-sample statistics, and it would happen with organic users too. On top of that, each synthetic panel is a fresh random draw, not the same people re-asked. So a flipped ranking usually isn't a finding changing. It's the noise you'd expect, finally showing itself.

§ 2Get clean, trustworthy signal

A handful of small habits turn Synthetic Users into a remarkably dependable quantitative read. The theme is simple: play to its strengths, and hold steady everything that doesn't need to move.

Trust gaps, not orders. A ranking is just a chain of comparisons, and most chains contain one or two links that are within the margin of error. Act only on gaps wider than the panel's own spread. If two concepts finish neck and neck, the honest finding is "indistinguishable," not "A beat B."
Hold everything constant that you can. Same prompt wording, same persona parameters, same model settings. Every degree of freedom you leave open adds variance that has nothing to do with the concepts you're testing.
Ask why, not which. "Which of these wins" is the fragile question. "What would make someone choose this, and what would stop them" is robust — and it is the thing you actually need in order to act. Let the instrument do what it is best at.
Widen what you test. Don't ask a twenty-person panel to separate a 7.2 from a 6.9. Test concepts that are genuinely different, or sharpen them until they are. Big, real differences survive small samples; small ones never do.
Triangulate. Use Synthetic Users to generate and prioritise hypotheses at a speed nothing else can match, then send the genuine close calls to a larger panel for the final count. It is the most powerful front end to quant you can have — it makes sure every dollar of fieldwork lands on a question worth answering.

In plain terms

Most of the fix is restraint. Believe the big gaps, hold your setup steady, and lean on the tool for the reasons behind a choice rather than the photo finish at the top of the ranking.

§ 3The Replication Test

Here is the pro move that folds all of this into one simple, repeatable habit — and makes the findings you pull from Synthetic Users genuinely hard to argue with. Before you trust any quantitative-looking result, run the study at least twice — three times is better — each with a fresh panel. Then read the runs together, like this.

What survives every run is signal. If Concept C is last in all three runs, C is genuinely behind. You can take that to a meeting.
What moves between runs is noise. If A and B trade places, they are tied. Report them as tied — not as a winner with an asterisk.
The spread between runs is your confidence interval. You don't have to calculate anything. The variation you can see with your own eyes across three runs is the honest margin of error on the study. If a "winner" sits inside that spread, you don't have a winner yet.

Figure 2. The Replication Test in one picture. Across three fresh panels, A and B keep swapping inside the same band — that band is your margin of error, and the verdict is "tie." C sits below it in every run, so being last is a real result, not an artefact.

This is the quiet brilliance of working this way. The variation between two supposedly identical studies is not a defect to hide — it is the measurement. A single run conceals its own uncertainty behind a confident-looking ranking. Three runs show you precisely how much of that ranking you are allowed to believe, and hand you findings you can defend in any room.

And make no mistake about what you are holding. Synthetic Users will tell you, in an afternoon and for a rounding error of the usual cost, which ideas are alive and which are dead, what people will feel about them, and why — an enormous head start before a cent is spent on fieldwork. Read it the way it wants to be read and it is one of the most powerful instruments in modern research. Just let a bigger room certify the photo finish.

Believe what repeats. Send the close calls to a bigger room.

Notes

The eleven-point figure is the standard error of a 50% proportion at n = 20, √(p(1−p)/n); the ~22-point span is its 95% confidence interval. Basic sampling theory — it is a property of the sample size, not of synthetic respondents.
The worked example (true appeal 7.2 / 6.9 / 5.0, within-panel SD ≈ 1.8) was simulated over 10,000 fresh twenty-person panels: the top two swapped on roughly one run in three, while the third concept finished last in 100% of runs.
On saturation as the confidence measure in qualitative work, see Guest, Bunce & Johnson, How Many Interviews Are Enough? (Field Methods, 2006). Synthetic respondents add generation-temperature variance on top of ordinary sampling error.