Teaching Synthetic Users What Real People Actually Think

Synthetic Users without calibration are individually believable, but collectively wrong. The missing piece is calibration, not better models.

How we're grounding synthetic research in real-world sentiment.

When you ask an LLM to generate 20 synthetic users from Lyon, France, you don't get 20 Lyonnais. You get 20 reflections of the internet's aggregate opinion circa 2024. WEIRD-skewed, mode-collapsed, and frozen in time. If 55% of actual French citizens support a Gaza ceasefire, your synthetic population has no mechanism to know that, let alone reflect it. The personas feel plausible. The distribution is fiction, unless you put some work into it.

This is the core problem with synthetic user research as it's commonly practiced: the individuals are often believable, but the population is uncalibrated. And as three recent papers demonstrate, from very different angles, who you simulate determines what you find.

At Synthetic Users, we believe the solution isn't a better model. It's a better calibration stack.


The evidence

Park et al. (2023) built generative agents with memory, reflection, and planning. Twenty-five characters in a Sims-like town who autonomously spread party invitations, formed relationships, and coordinated activities. The architecture produces individually believable behavior. But there's no mechanism ensuring the 25 agents represent anyone in particular. They're fictional characters whose attitudes emerged from hand-authored seed descriptions, not from real population data.

Paglieri et al. (2026) attacked the opposite problem: population-level diversity. They showed that naive LLM prompting produces stereotypical clustering. Everyone sounds the same. Their evolved Persona Generators achieved 80%+ coverage across attitudinal dimensions, which is a significant technical achievement. But they explicitly optimize for support coverage (spanning what's possible), not density matching (reproducing what's probable). A population where half the users hold rare fringe views might score well on their metrics, but it would be useless for market research or sentiment modeling.

Kirk et al. (2024) provided the empirical proof that sample composition matters. PRISM's 1,500 participants from 75 countries, rating 21 LLMs across 8,011 conversations, revealed that model rankings shift by geography, demographic group, and topic. Sampling exclusively from one demographic reduced welfare for out-groups. Even the best-performing model only achieved roughly 45% preference. No single LLM satisfies everyone, and who you ask determines what you conclude.

The gap across all three: nobody built the bridge between real-world sentiment data and synthetic population generation.

That bridge is the calibration stack. And it's what we're building at Synthetic Users.


The architecture

The calibration stack has three layers, each solving a distinct problem: Measure, Calibrate, Generate. Below each layer, we walk through the engineering choices that make it work.


Layer 1: Measure

The measurement layer ingests real-world attitudinal data and collapses it into a structured, multi-dimensional sentiment profile for a given region and topic. The key insight is that no single data source is sufficient. You need to triangulate across sources that operate at different temporal frequencies and capture different facets of sentiment.



What the measurement layer produces: For a given region (say, France) and topic (say, Gaza ceasefire), the output is a multi-dimensional sentiment vector. Not a single number, but a structured profile across K attitudinal axes. For European sentiment toward Gaza, those axes might include stance on ceasefire, stance on arms exports, attitude toward humanitarian aid, trust in media narratives, and degree of engagement with the issue.

The critical design choice is temporal weighting. Fast-moving signals (prediction market price shifts, protest frequency spikes, social media volume) detect sentiment changes days or weeks before slow-moving signals (biannual Eurobarometer surveys, UN voting patterns) catch up. The system should use fast signals as leading indicators and slow signals as anchors. Trust the direction of movement from real-time data but calibrate the magnitude against more methodologically rigorous polling.

The data sources, by frequency:

Real-time (daily). Prediction markets like Polymarket and Kalshi provide price histories via public APIs. Google Trends captures search volume by country and language. Social media sentiment analysis, ideally in local languages (French Twitter, German TikTok), tracks volume and framing shifts.

Weekly. ACLED protest data provides geolocated, categorized records of demonstrations across Europe. GDELT's Global Knowledge Graph tracks media tone across thousands of outlets in near-real time.

Monthly. National polling firms fill the gaps between major surveys. YouGov (UK and expanding), IFOP and Elabe (France), Forsa (Germany), SWG (Italy). Parliamentary motion data and voting records capture elite positioning that both reflects and shapes public opinion.

Biannual. Eurobarometer provides cross-country comparisons with consistent methodology. The European Council on Foreign Relations (ECFR) runs pan-European foreign policy polling specifically designed to capture attitudes toward issues like Gaza.

The engineering challenge is normalization: converting heterogeneous data sources into a common attitudinal space. Each source measures something slightly different (search interest is not the same as stated opinion is not the same as revealed behavior), and the system needs explicit weights for how much to trust each signal type.


Layer 2: Calibrate

This is the hardest engineering problem in the stack. You have a continuous, multi-dimensional distribution from Layer 1, and you need to map it onto N discrete synthetic personas (say, 20) while preserving three properties that naive sampling destroys.


Challenge 1: Match the marginals. If 55% of the population supports a ceasefire, 11 of your 20 users should reflect that. If 70% are under 45, 14 of your 20 should be. This is the simplest requirement, and iterative proportional fitting (also called raking) handles it well. You start with an initial sample, then iteratively adjust weights until the marginal distribution on each axis matches the target.

Challenge 2: Preserve the correlations. This is where most synthetic population systems fail. Views on ceasefire, arms exports, humanitarian aid, and trust in institutions are correlated in complex ways that vary by country, age, and political affiliation. Sampling each axis independently produces incoherent people. Someone who strongly opposes a ceasefire but also strongly supports cutting arms to Israel doesn't exist in meaningful numbers.

The technical approach: build demographic archetypes from polling crosstabs. Most serious polls publish cross-tabulations showing how attitudes break down by age, gender, education, and political leaning. These crosstabs give you the conditional distribution of attitudes given demographics. Use copula-based sampling to generate attitude profiles that preserve the joint structure from these crosstabs while allowing natural variation within each demographic bucket.

Challenge 3: Encode salience and intensity. A person who "supports ceasefire" at 6/10 intensity behaves very differently from someone at 10/10. The 6 might not bring it up unprompted. The 10 might define their identity around it. And for most Europeans, Gaza isn't a top-five issue. Your synthetic population should reflect that.

This means your 20 personas might break down as: 7 who barely think about Gaza, 5 with mild opinions they'd share if asked, and 8 with strong views they'd volunteer. The "barely think about it" group is systematically underrepresented in most synthetic populations, but they're often the majority in real life. Salience data comes from Google Trends (search volume as a proxy for engagement), issue-importance questions in polls, and media consumption surveys.

The output of Layer 2 is a set of N structured persona specifications. Each contains a demographic profile, a position on each attitude axis (with intensity), a salience level, and a set of attitude correlations that are internally consistent.


Layer 3: Generate

The generation layer takes calibrated persona specs and produces rich, believable synthetic users via an LLM. This is where the architectural insights from Park et al. and Paglieri et al. become directly useful, but in service of a different objective.


What the generation prompt should not do is simply label-stamp: "You are a 34-year-old French woman who supports a ceasefire." That produces a stereotype. What it should do is construct a person who holds that view for coherent, specific reasons.

Stage A: Identity construction. Give the persona a reason for their calibrated position that's consistent with their demographic profile. A 34-year-old French nurse who supports a ceasefire because she's seen medical colleagues volunteer in conflict zones holds the same position as a 34-year-old French teacher who supports it because her students include Palestinian families. Same attitude, different expression, different engagement pattern, different responses to counterarguments.

Critically, inject at least one idiosyncratic detail that doesn't perfectly track the archetype. Real people are inconsistent. A devout Catholic who supports abortion access. A climate activist who works in oil and gas. A right-wing voter who supports universal healthcare. These contradictions are the difference between a stereotype and a character.

Paglieri et al. found that formative-memory approaches (backstory-driven personas) were consistently outperformed by action-oriented approaches: personas defined by how they decide and act rather than what happened to them as a child. We've incorporated this into our generation pipeline.

Stage B: Behavioral expansion. Following the "logic of appropriateness" framework, expand the persona by answering three questions: What kind of situation is this? What kind of person am I? What does a person like me do in a situation like this? This gives the LLM a decision-making framework rather than a static description, which produces more diverse downstream behavior.

The best-performing generators in Paglieri et al.'s experiments produced first-person reasoning personas or rule-based behavioral descriptions, not third-person biographical sketches. We've adopted the same approach.


The refresh cycle

Static calibration degrades. The KellyBench study, where every frontier AI model lost money betting on Premier League matches over a full season, illustrates the failure mode: models given rich historical data but no mechanism for updating against a changing reality will systematically underperform. Your calibration stack needs a refresh trigger.

We recommend a dual-trigger architecture. Time-based refresh recalibrates weekly using the fastest signals in Layer 1. Event-based refresh recalibrates immediately when a fast signal crosses a threshold: a prediction market price shifting more than 10 points in 48 hours, or ACLED detecting a protest volume spike exceeding 2 standard deviations from the regional baseline. Event-based triggers catch sentiment shifts the moment they happen. Time-based triggers catch the gradual drift that no single event causes.


What this doesn't solve

We want to be direct about the limitations, because overconfidence here is worse than the problem itself.

The stated-vs-revealed preference gap. Both PRISM and Paglieri et al. found that what people say they believe and what they actually do can diverge. PRISM found weak correlations between survey-stated preferences and in-conversation behavioral preferences. Paglieri et al. found that questionnaire-based diversity only partially transferred to downstream behavioral diversity. Calibrating synthetic users' stated attitudes does not guarantee calibrated behavior. This is a fundamental limitation of any survey-grounded approach.

Correlation structure from sparse data. The copula-based sampling in Layer 2 requires polling crosstabs that many topics simply don't have. If no one has polled the joint distribution of ceasefire support, arms embargo support, and media trust for your specific region, you're estimating correlations from adjacent data or the LLM's priors. Be explicit about where the calibration data ends and the assumptions begin.

The ethical line. Kirk et al.'s PRISM paper argues that participation in AI feedback processes has intrinsic value as an act of justice. People should not only speak but be heard. Synthetic Users can approximate populations, but they cannot replace participation. A calibrated synthetic population is a prototyping tool and a scaling mechanism. It is not a democratic process. It should never replace direct consultation with the communities it claims to represent.

If you're using Synthetic Users to model how residents of a conflict zone feel about a ceasefire, you should be deeply aware of the epistemological distance between your simulation and their reality. The calibration stack narrows that distance. It does not eliminate it.


The punchline

Every paper in avaialble literature solves one piece of the puzzle. Park et al. proved that LLMs can produce believable individual agents with the right architecture. Paglieri et al. proved that evolutionary optimization can break through mode collapse to produce diverse populations. Kirk et al. proved that who you include in your sample determines what you find, and that no single model satisfies everyone.

None of them built the calibration stack that connects real-world sentiment data to synthetic population generation. That's the infrastructure that turns synthetic user research from an expensive hallucination into a useful tool.

The model is a commodity. The calibration stack is the product. And at Synthetic Users, that's exactly what we're building.

Releated Articles

More articles for you

Teaching Synthetic Users What Real People Actually Think

Synthetic Users without calibration are individually believable, but collectively wrong. The missing piece is calibration, not better models.

The Lie We Tell Ourselves About Customer Research

Most research asks what people say. The problem is people don't do what they say. This piece breaks down the gap between stated and revealed preference — and why behavioral modeling, not better interviews, is how you close it.

Two ways to run research with Synthetic Users and why the difference matters

Iris, what is the difference of using agents to accelerate research.

Synthetic Users vs digital twins

You don’t need a twin for “a parent in rural Ohio who shops weekly at Walmart, prefers fragrance-free, and has a toddler with eczema.” You sample a parent profile with relevant traits and constraints, add retail and dermatology context, and generate behaviors consistent with both.

Two major papers. One shared direction.

LLM-powered Synthetic Users have crossed from concept to validated method. This proves they can predict human behavior accurately, letting teams run fast, low-cost behavioral experiments without replacing real participants.

Gartner says we lead. That's kind of them.

Gartner’s latest report on AI-powered synthetic user research cites Synthetic Users as a leader.

Introducing Shuffle v2

Shuffle v2 is a feature that intelligently shuffles between multiple large language models via a routing agent to produce more realistic, diverse Synthetic Users with better organic parity.

Chain-of-feeling

Synthetic Users use a “chain-of-feeling” approach—combining emotional states with OCEAN personality traits—to produce more human-like, realistic user responses and yield richer UX insights.

Generative Agent Simulations of 1,000 People

A paper that thoroughly executes a parity study between Synthetic and Organic users.

Cover image for the article: 21 peer-reviewed papers supporting the Synthetic Users thesis

21 Peer reviewed papers that support the Synthetic Users thesis

Here is a compilation of all the papers that help make a case for Synthetic Users.

Why we shuffle between models — to ensure both parity and diversity!

Synthetic Users balances aligned and unaligned models to maintain diversity and authenticity in simulated users while ensuring ethical standards and user expectations are met.

Latest press articles for Synthetic Users

Synthetic Users and AI are transforming research methodologies, offering innovative, cost-effective alternatives to traditional human subject studies.

Comparison studies. The opportunity lies in the deviation.

When we compare different studies, especially looking at what synthetic (artificial interviews) and organic (real-world interviews) data tell us, we often find they mostly talk about the same things but there's also a bit where they don't match up. This gap is super interesting because it's like finding hidden treasure in what we thought we knew versus what we might have missed.

How we deal with bias

Harnessing the power of AI in our Synthetic Users, we strive for a balance between reflecting reality and ethical responsibility, ensuring diversity and fairness while maintaining realism.

The transition to Continuous Insight

The transition towards Continuous Insight™ aligns research activities more closely with the dynamic needs of the business and ensures that product development is continuously informed by up-to-date user insights.

The Art of the Vibes Engine

Large language models (LLMs) like GPT-4 serve as powerful "vibes engines," empathizing with diverse groups and generating contextually relevant content. Their applications span market research, customer support, user experience design, and mental health support, offering invaluable insights and personalized experiences. While not infallible sources of truth, LLMs enable creativity, personalization, and connection within the realm of human language.

There is a faster and more accurate way to do research. Use Synthetic Users.

How Synthetic Users is changing the research process.

The wisdom of the silicon crowd

In the light of an ancient parable, we explore a new paper that dives into how ensembles of large language models match the prediction accuracy of human crowds. It reveals that combining machine predictions with human insights leads to the most robust forecasting results.

Three research papers that helped us build ❤️ Synthetic Users

For the sceptics amongst us who need more tangible research in order to engage with this brave new world. Full disclosure: we are part of the sceptics.

What is RAG and why it’s important for Synthetic Research

Ahead of our RAG launch we explain Retrieval-Augmented Generation (RAG) and how it enhances Synthetic Users by providing increased realism, contextual depth, and adaptive learning, with profound implications for market research, user experience testing, training, education, and innovative product development.

Synthetic Users system architecture (the simplified version).

Foundation models underpin Synthetic Users with advanced capabilities, enhanced by synthetic data and RAG layers for realism and business alignment, all within a collaborative multi-agent framework for richer interactions.

Saturation score. How do we know how many interviews to run?

Determine your interview target for achieving topic saturation using our efficient approach, leveraging the historical wisdom of research pioneers. This method ensures deep insights with theoretical sampling at its core.

How Synthetic Users are gaining depth

Synthetic Users are evolving to address criticism about their generalist nature by incorporating representative data sets and personal narratives.

How we compare interviews to ensure we improve our Synthetic Organic Parity — 85 to 92%

How do we know we are right? How do we know our Synthetic Users are as real as organic users? We compare.

Synthetic Users: Merging Qualitative and Quantitative Research, in seconds.

At Synthetic users we are blurring the lines between qualitative and quantitative research. Here's how we are going about this transformative approach.

Signup to our newsletter

AI-powered user research platform that replaces traditional participant recruitment with synthetic agents. Get research-grade insights in minutes, not weeks.

© 2026 Synthetic Users Inc.

Signup to our newsletter

AI-powered user research platform that replaces traditional participant recruitment with synthetic agents. Get research-grade insights in minutes, not weeks.

© 2026 Synthetic Users Inc.

Signup to our newsletter

AI-powered user research platform that replaces traditional participant recruitment with synthetic agents. Get research-grade insights in minutes, not weeks.

© 2026 Synthetic Users Inc.