Post‑training for Helpfulness Reduces Language Models' Ability to Simulate Human Behavior, Study Finds

News

5/30/2026, 2:09:13 PM

Post‑training for Helpfulness Reduces Language Models' Ability to Simulate Human Behavior, Study Finds

A cross‑institutional team including researchers from Helmholtz Munich analyzed Psych‑201 and found that the extra training steps used to turn base language models into assistants systematically weaken those models’ ability to predict how humans actually respond. This reduction in behavioral fidelity matters for any use of LMs as stand‑ins for people, because assistant tuning appears to erase the heuristics and biases that characterize real human answers.

Psych‑201 is a new behavioral dataset covering roughly 208,000 participants and about 26 million individual responses, collected from hundreds of experiments. The dataset includes full experiment runs plus per‑subject metadata such as age, nationality and questionnaire answers, enabling tests that compare model outputs against individual participants rather than aggregated trends.

The study directly compared base models — trained only to predict next tokens — with their post‑trained variants across the Qwen3, Llama3 and OLMo 3 model families. Post‑training in the comparison included instruction tuning, reasoning optimization and vision extensions. Evaluation measured how well each model predicted the actual answers given by human participants; across families and model sizes, base models consistently outperformed their assistant‑tuned descendants.

To test whether more deterministic output from assistant models explained the gap, the authors ran accuracy analyses on a subset of tasks with discrete answer options. Post‑trained models continued to perform worse in those settings, so higher determinism is unlikely to be the sole cause of the decline in human‑prediction fidelity. Distortions introduced by post‑training were largest in pure language tasks and reasoning tests. The authors attribute this pattern to training pressures that push models toward normatively correct answers, which can overwrite the heuristics, systematic errors and other biases typical of human responses.

The divergence between base and assistant models widened with successive model generations. Base models improved their match to human behavior from Qwen2 to Qwen2.5 to Qwen3, while their post‑trained counterparts fell further behind. The authors say this pattern suggests that ongoing advances in post‑training — including reinforcement learning from human feedback — may be increasing the mismatch between assistant behavior and the distribution of real human responses.

A commonly used mitigation — prompting models with participant‑specific demographic or persona information — provided practically no benefit for predicting individual responses. Because Psych‑201 contains per‑subject metadata, the team was able to condition models on age, nationality and questionnaire answers and still found that persona conditioning did not recover the lost fidelity.

Sources

The Decoder AI · 5/30/2026

Replies (0)

No replies in this topic yet.

Back