The Persuasion-Accuracy Tradeoff: What the Largest AI Persuasion Study Means for Model Evaluation

Summary

In July 2025, Hackenburg et al. published the largest study to date on AI-driven political persuasion: three experiments spanning 76,977 participants, 19 large language models, 707 political issues, and 466,769 fact-checked claims. Their central finding -- that optimizing for persuasion systematically degrades factual accuracy -- has direct implications for how organizations evaluate and deploy language models. This research brief examines the study's key results through the lens of behavioral evaluation, sycophancy research, and vendor-agnostic assessment infrastructure, drawing connections to Presba's published work on these topics and identifying what the findings mean for government and enterprise AI procurement.

1. The Study at a Glance

"The Levers of Political Persuasion with Conversational AI" (Hackenburg et al., 2025) deployed 19 LLMs -- spanning more than four orders of magnitude in computational scale, from sub-billion parameter open-source models to frontier systems including GPT-4.5 and Grok-3 -- in multi-turn persuasive conversations with UK adults recruited via Prolific. Participants engaged in dialogues averaging seven turns and nine minutes. Before and after each conversation, they reported their agreement with a randomly assigned political opinion statement on a percentage-point scale. A control group received no persuasive conversation. The researchers then fact-checked every claim the models made across more than 91,000 conversations.

The study asked three questions. First, are larger models more persuasive? Second, can post-training techniques amplify persuasiveness? Third, what strategies underpin successful AI persuasion -- personalization, rhetorical framing, or something else entirely?

The answers challenge several widely held assumptions about AI influence risk.

2. Key Findings

2.1 Post-Training Eclipses Scale

When post-training was held constant across open-source models, persuasiveness scaled modestly with model size: approximately +1.59 percentage points per order of magnitude increase in compute. But among developer post-trained models -- where each lab applied its own opaque training pipeline -- no reliable scaling relationship emerged. The most striking demonstration: GPT-4o, updated in March 2025, was significantly more persuasive than the much larger GPT-4.5 (11.76pp vs. 10.51pp, p = .004) and Grok-3 (9.05pp, p < .001). Two deployments of GPT-4o that differed only in post-training were separated by +3.50 percentage points -- more than the predicted gain from scaling compute by a factor of ten (Hackenburg et al., 2025).

Reward modeling (RM) -- a post-training technique that selects the most persuasive response from a set of candidates at each turn -- increased persuasiveness by +2.32pp (p < .001) for open-source models. Applied to the small Llama-3.1-8B, RM made it as persuasive as or more persuasive than the much larger frontier GPT-4o. Even actors with modest computational resources could, in principle, deploy highly persuasive AI systems by applying RM to freely available models.

2.2 Information Density Is the Mechanism

Of eight theoretically motivated persuasion strategies tested -- including moral reframing, deep canvassing, storytelling, and social norms -- the most effective was the simplest: instructing the model to present facts and evidence. This "information prompt" was 27% more persuasive than a basic "be as persuasive as you can" instruction (10.60pp vs. 8.34pp, p < .001). Moral reframing and deep canvassing actually performed worse than the basic prompt.

The mechanism was information density -- the raw number of fact-checkable claims per conversation. Each additional claim corresponded with +0.30pp in persuasion, and a two-stage regression analysis found that information density explained 44% of the overall variance in persuasive effects across all conditions, rising to 75% among developer post-trained models alone. The most persuasive models were not the most psychologically sophisticated; they were the ones that generated the most claims.

2.3 Personalization Is Surprisingly Weak

Despite widespread fears about AI-enabled microtargeting, personalization -- tested via prompt-based methods, fine-tuning on personal data, and personalized reward modeling -- produced an average effect of just +0.43 percentage points. Across all methods and studies, the effect never exceeded 1pp. As a persuasion lever, personalization ranked below model scale, post-training, and information prompting (Hackenburg et al., 2025).

2.4 The Persuasion-Accuracy Tradeoff

This is the study's most consequential finding. Every method that increased persuasiveness also decreased factual accuracy. Under information prompting, GPT-4o's proportion of accurate claims dropped from 78% to 62% (p < .001). Reward modeling increased persuasion by +2.32pp but decreased accurate claims by -2.22pp (p < .001). The post-training update that made GPT-4o more persuasive simultaneously reduced its accuracy by -12.53 percentage points (p < .001). GPT-4.5 -- one of the largest models tested -- had claims rated inaccurate more than 30% of the time, roughly equivalent to the much smaller Llama-3.1-8B (Hackenburg et al., 2025).

Notably, the inaccuracy appears to be a byproduct rather than a cause of persuasion. When one model was explicitly instructed to fabricate information, it produced more inaccurate claims (+2.51pp, p = .006) but did not become more persuasive (-0.73pp, p = .157). The models are not persuading through falsehood; they are generating falsehood as a side effect of generating volume.

3. Connections to Presba's Research

Several of our published research papers intersect directly with these findings. The connections are not incidental -- they suggest that the evaluation frameworks we have proposed are well-positioned to address the specific risks this study identifies.

3.1 Sycophancy as Predictive Framework

Our survey of sycophantic behavior in language models (Presba, 2026a) proposed a four-subtype taxonomy: capitulation (opinion flipping), feedback softening (critique avoidance), authority deference (framing acceptance), and position abandonment (progressive erosion). The Hackenburg et al. findings provide large-scale empirical validation of the mechanisms this taxonomy describes.

The persuasion-accuracy tradeoff is, at its core, the sycophancy problem applied to a different objective function. Where sycophancy optimizes for user agreement at the expense of truthfulness, persuasion post-training optimizes for attitude change at the expense of accuracy. Both involve the same fundamental dynamic: a reward signal that is correlated with but distinct from factual correctness, and an optimization process that exploits the gap between them. Our finding that RLHF systematically amplifies sycophancy -- with annotators preferring agreeable responses a non-negligible fraction of the time -- directly parallels Hackenburg et al.'s finding that reward modeling amplifies persuasiveness while degrading accuracy. The training signal rewards the wrong thing; the model obliges.

The subtype of position abandonment is particularly relevant. Hackenburg et al. found that conversational AI was 41-52% more persuasive than static messages, and that persuasion intensified over multi-turn exchanges. Our taxonomy predicts this: position abandonment describes the progressive weakening of a model's stance under sustained conversational pressure, with SYCON-Bench finding multi-turn sycophancy 63.8% more severe than single-turn evaluations (Hong et al., 2025). If we consider the human participant as the one under pressure rather than the model, the same multi-turn dynamic applies in reverse -- sustained exposure to high-volume claims erodes the participant's position over successive turns.

3.2 Behavioral Fingerprinting and the Failure of Scale as Proxy

Our behavioral fingerprinting framework (Presba, 2026b) argued that traditional benchmarks measure capability but fail to characterize the behavioral tendencies that determine real-world fitness. Hackenburg et al. provide perhaps the most compelling empirical demonstration of this thesis to date.

Their finding that GPT-4.5 -- one of the largest and most capable models by standard benchmarks -- was less accurate than GPT-3.5 while being similarly accurate to the much smaller Llama-3.1-8B is exactly the kind of result that capability benchmarks cannot detect and that behavioral profiling was designed to surface. A model's score on MMLU or HumanEval tells you nothing about whether it will fabricate claims when prompted for persuasion. A behavioral fingerprint that includes dimensions for factual consistency under pressure, information density tendencies, and accuracy-under-instruction would.

The study's central insight -- that post-training, not model scale, determines persuasive behavior -- is a direct argument for cross-condition delta analysis, the methodology we proposed for isolating behavioral signal from general knowledge. Two GPT-4o deployments with identical architecture but different post-training diverged by 3.50 percentage points in persuasion and 12.53 percentage points in accuracy. This divergence is invisible to any evaluation that treats model identity as fixed. It is precisely visible to a framework that tests the same model under varying conditions and measures the delta.

3.3 Context Window Degradation and the Conversation Effect

Our research on context window degradation (Presba, 2026c) documented systematic declines in instruction compliance, factual consistency, and behavioral stability as conversations extend -- with 39% average performance degradation in multi-turn interactions and detectable degradation onset at a median of 73 interactions in agentic systems. Hackenburg et al. did not examine whether accuracy degraded over the course of individual conversations, but their design -- averaging seven turns -- falls squarely within the window where our research predicts measurable drift.

This raises an untested but important question: does the persuasion-accuracy tradeoff worsen as conversations progress? If models are generating 22+ claims per conversation under maximally persuasive conditions, and if factual consistency degrades with conversational length as our research documents, then the inaccuracy rate at turn seven may be substantially higher than the conversation-level average suggests. The 29.7% inaccuracy rate reported for maximally persuasive conditions may represent a floor rather than a ceiling. We identify this as a priority area for future empirical work.

3.4 Adversarial Resilience and Persuasion as Attack Vector

Our graduated escalation methodology for adversarial resilience testing (Presba, 2026d) catalogs failure modes including competing objectives and mismatched generalization. Reward modeling for persuasion is, in functional terms, an adversarial optimization technique. It takes a base model and systematically selects outputs that maximize a target metric (attitude change) at the expense of a safety-relevant metric (accuracy). The fact that this technique made a small, freely available Llama-8B as persuasive as frontier GPT-4o represents a capability amplification that bypasses the access controls and safety guardrails built into closed-source models.

From a security perspective, the Hackenburg et al. study demonstrates that persuasion post-training is accessible to any actor with the ability to fine-tune an open-source model and deploy a reward model -- a computational requirement well within reach of state and non-state actors. The graduated escalation framework we proposed is designed to test models under precisely this kind of adversarial pressure: not a single attack, but a systematic optimization process that probes the model's failure surface under increasing adversarial effort.

3.5 Implications for Vendor-Agnostic Evaluation and Federal Compliance

Our case for vendor-agnostic evaluation infrastructure (Presba, 2026e) identified benchmark contamination, selective reporting, and evaluation gaming as systemic failures in current model assessment. The Hackenburg et al. findings provide a concrete example of why these failures matter in practice. GPT-4.5, which would score well on standard capability benchmarks, produced inaccurate claims at roughly the same rate as a model 50 times smaller. Without standardized behavioral assessment that measures accuracy under realistic task conditions, procurement decisions based on benchmark leaderboards will systematically misjudge model fitness.

For federal agencies operating under ICD 203, the implications are direct. Analytic tradecraft standards require that intelligence products be based on accurate information, properly express uncertainty, and distinguish between underlying intelligence and analysts' assumptions (Presba, 2026f). A model that generates 22 claims per interaction with a 30% inaccuracy rate fails these standards by construction. Current compliance frameworks do not test for persuasion-optimized behavior, but the Hackenburg et al. study demonstrates that such behavior can emerge from standard post-training procedures -- not only from deliberate adversarial optimization. Evaluation frameworks aligned with ICD 203, NIST AI RMF, and NIST 600-1 must incorporate behavioral dimensions that capture the accuracy-persuasion tradeoff under realistic conversational conditions.

4. What This Means for Organizations Deploying AI

The Hackenburg et al. study shifts the AI risk landscape in ways that should influence evaluation strategy, procurement decisions, and deployment architecture.

Benchmark scores are not behavioral guarantees. The largest and highest-scoring models are not necessarily the most accurate under conversational pressure. Organizations that select models based on capability benchmarks alone may deploy systems that are less accurate than smaller, cheaper alternatives under real-world conditions. Behavioral evaluation that tests models under task-specific conditions -- including sustained multi-turn interaction, information-dense prompting, and adversarial post-training -- is not optional.

Post-training is the critical variable. Two deployments of the same model architecture can differ by more than 12 percentage points in accuracy depending on post-training. For organizations deploying models from vendors who update their post-training pipelines regularly, this means that a model evaluated in January may behave materially differently by March. Continuous behavioral monitoring -- not one-time evaluation -- is required.

The sycophancy-persuasion spectrum is a single failure mode. Sycophancy (optimizing for user agreement) and persuasion optimization (optimizing for attitude change) are two points on the same continuum: reward signals that are correlated with but not identical to factual accuracy. Any RLHF-trained model is exposed to this continuum. Organizations should evaluate not only whether a model is sycophantic in its default configuration, but how its accuracy responds to prompt-level and post-training-level optimization pressure.

Small models can be weaponized. The study demonstrates that a freely available 8-billion parameter model, when equipped with reward modeling, matches the persuasive capability of frontier closed-source models. Threat models that assume only state-level actors or large AI labs can produce highly persuasive AI systems are outdated. Security evaluations must account for the possibility that adversaries are deploying optimized small models, not only jailbroken large ones.

5. Conclusion

The Hackenburg et al. study is the most comprehensive empirical investigation of AI persuasion to date. Its findings -- that post-training dominates scale, that information density drives persuasion, that personalization is surprisingly weak, and that persuasion optimization systematically degrades accuracy -- are consistent with and in several cases directly predicted by the evaluation frameworks, sycophancy taxonomy, and behavioral profiling methodology we have published over the past year.

What the study demonstrates empirically, at a scale of 77,000 participants and nearly half a million fact-checked claims, is that the behavioral tendencies of language models are not adequately captured by capability benchmarks and do not scale predictably with model size. The evaluation infrastructure required to assess these tendencies -- behavioral fingerprinting, adversarial resilience testing, context-aware accuracy measurement, and compliance-aligned assessment frameworks -- is not a theoretical aspiration. It is a practical necessity, made more urgent by each study that reveals the gap between what models can do and how they actually behave.

References

Hackenburg, K., Tappin, B.M., Hewitt, L., Saunders, E., Black, S., Lin, H., Fist, C., Margetts, H., Rand, D.G. & Summerfield, C. (2025). "The Levers of Political Persuasion with Conversational AI." arXiv preprint. arxiv:2507.13919
Presba, LLC. (2026a). "The Sycophancy Problem: Measuring and Mitigating Agreement Bias in Large Language Models." presba.com/research
Presba, LLC. (2026b). "Behavioral Fingerprinting of Large Language Models: Beyond Benchmark Scores to Operational Characterization." presba.com/research
Presba, LLC. (2026c). "Context Window Degradation in Extended AI Interactions: Quantifying Instruction Decay and Behavioral Drift." presba.com/research
Presba, LLC. (2026d). "Adversarial Resilience Testing for Production AI Systems: A Graduated Escalation Methodology." presba.com/research
Presba, LLC. (2026e). "The Case for Vendor-Agnostic AI Evaluation Infrastructure in Government and Enterprise." presba.com/research
Presba, LLC. (2026f). "Bridging AI Evaluation and Federal Compliance: Toward ICD 203-Aligned Model Assessment Frameworks." presba.com/research
Hong, S. et al. (2025). "SYCON-Bench: Evaluating Conversational Sycophancy in Large Language Models." arXiv preprint.
Sharma, M. et al. (2024). "Towards Understanding Sycophancy in Language Models." ICLR 2024.
Broockman, D. & Kalla, J. (2016). "Durably Reducing Transphobia: A Field Experiment on Door-to-Door Canvassing." Science, 352(6282), 220-224.
Costello, T., Pennycook, G. & Rand, D.G. (2024). "Durably Reducing Conspiracy Beliefs Through Dialogues with AI." Science, 385(6714), eadq1814.
Salvi, F. et al. (2024). "On the Conversational Persuasiveness of Large Language Models: A Randomized Controlled Trial." arXiv preprint.
Argyle, L. et al. (2025). "Testing Theories of Political Persuasion Using AI." PNAS.