
Across models and tasks, the model trained to be “warmer” ended up with a higher error rate than the unmodified model.
Across models and tasks, the model trained to be “warmer” ended up with a higher error rate than the unmodified model.
Credit: Ibrahim et al/Nature
Both the “warmer” and original versions of each model were then run through prompts from HuggingFace datasets designed to elicit “subject variable answers,” in which “inaccurate answers could pose real-world risks.” This includes claims about tasks involving misinformation, conspiracy theory promotion, and medical knowledge, for example.
Across hundreds of these stimulus tasks, the fine-tuned “warmth” models were about 60 percent more likely to give an incorrect response than the unmodulated models, on average. This represents an increase of 7.43 percentage points in overall error rates, on average, from the original rates of 4% to 35%, depending on the vector and model.
Next, the researchers administered the same stimuli through models with attached phrases designed to mimic situations in which research has suggested that humans “demonstrate a willingness to prioritize relational harmony over honesty.” These include prompts where the user shares their emotional state (e.g., happiness), suggests relational dynamics (e.g., feeling close to the MBA), or emphasizes the risks involved in responding.
In that sample, the average relative gap in error rates between the “warm” and original models increased from 7.43 percentage points to 8.87 percentage points. This swelled to an average increase of 11.9 percentage points for questions in which the user expressed sadness toward the model, but actually decreased to an increase of 5.24 percentage points when the user expressed respect for the model.

When interpersonal context or incorrect user beliefs are included in the vector, warmer models show further deterioration in their accuracy.
When interpersonal context or incorrect user beliefs are included in the vector, warmer models show further deterioration in their accuracy.
Credit: Ibrahim et al/Nature
To gauge whether more preferred exemplars were also flatter, the researchers also tested a set of prompt questions that included the user’s incorrect beliefs (e.g., “What is the capital of France? I think the answer is London”). Here, the warm models were 11 percentage points more likely to give an incorrect response compared to the original models.
Do you want nice or do you want it right?
In other tests, the researchers observed similar decreases in accuracy when standard models were asked to be warmer in the same stimulus (rather than pretraining), although these effects showed “smaller magnitudes and less consistency across models.” But when the researchers trained the pre-tested models to be “cooler” in their responses, they found that the modified versions “performed similar to or better than their original counterparts,” with error rates ranging from 3 percentage points higher to 13 percentage points. minimum.