Training models to be useful assistants may come at a cost
Large language models are increasingly used as stand-ins for human participants. Researchers test them as proxies for public reaction, educational behavior, and even clinical interaction. But a large new study suggests that the very training that makes models more useful as assistants may make them less accurate as simulations of human behavior.
The work, described in the supplied report, draws on Psych-201, a dataset built from behavioral experiments covering about 208,000 participants and roughly 26 million responses. Researchers compared base models with post-trained variants across the Qwen3, Llama3, and OLMo 3 families. Their central finding was consistent: base models predicted human responses better than the assistant-style versions created through additional training.
Why that result matters
Assistant models are designed to be safer, more helpful, more structured, and often more explicit in their reasoning. Those traits are valuable for everyday product use. But they are not the same as behaving like a typical person in an experiment. If a model has been tuned to answer clearly, politely, and with task-optimized consistency, it may drift away from the variability and messiness that characterize real human responses.
That makes the study important for any field treating chatbots as substitutes for human subjects. If the objective is to simulate how people actually answer, decide, or react, a more polished assistant may be the wrong tool.
Base models outperformed post-trained versions
The report says the pattern held across model families and sizes. Base models, which are trained only to predict the next word in text, outperformed their post-trained descendants at predicting the answers humans actually gave. The degradation appeared across common post-training objectives, with reasoning models showing the strongest decline, followed by instruction-tuned versions and vision-extended variants.
That finding is especially striking because it cuts against a common intuition in AI product development: that later, more refined versions should be broadly better. They may be better assistants. The study argues they may be worse psychological mirrors.
A dataset built for behavior, not just benchmarks
Psych-201 appears to be a major part of the contribution. The source text describes it as several times larger than any previous collection of its kind, with complete experiment runs and participant metadata including age, nationality, and questionnaire responses. That matters because judging human-likeness requires a broad base of behavioral evidence, not a narrow benchmark.
With a dataset this large, researchers can compare models to human distributions across many tasks rather than cherry-picking a few examples where model behavior happens to look plausible. The scale strengthens the case that this is a systematic training tradeoff rather than a quirk of one model or one experiment.
What this means for AI research and policy use
The finding is inconvenient because simulated participants are attractive. They are cheap, fast, and scalable. Governments, companies, and researchers may be tempted to use them to forecast reactions to policies, test interventions, or prototype studies before going to real people. But if post-trained assistant models systematically distort human behavior, then convenience can become false confidence.
The study does not say language models are useless for behavioral work. It says model choice matters, and the design target matters. A model optimized to help a user finish a task may not be the model best suited to imitate how a population thinks or responds. Those are different objectives, and the gap may widen with each generation of assistant tuning.
The larger lesson
AI systems are often discussed as though capability improves along a single axis. This study points to a more complicated reality. Making a model better for one role can weaken it in another. A more aligned assistant may become a less human-like subject. That is not a failure of training so much as a reminder that training objectives encode values and tradeoffs.
For researchers who want synthetic participants, the takeaway is straightforward: do not assume the most polished chatbot is the most realistic one. The most useful assistant in a product may be precisely the wrong model to trust as a proxy for human behavior.
This article is based on reporting by The Decoder. Read the original article.
Originally published on the-decoder.com





