Abstract
The rapid proliferation of large language models across government and enterprise environments has exposed fundamental weaknesses in the evaluation infrastructure upon which procurement and deployment decisions depend. Benchmark contamination, selective reporting by vendors, non-reproducible results, and high-profile incidents of evaluation gaming—most notably the April 2025 Llama 4 Maverick controversy—have eroded confidence in the metrics that ostensibly guide model selection. Meanwhile, regulatory frameworks in the United States and the European Union are converging on requirements for continuous, standardized, and independent AI evaluation. This paper examines the systemic failures of the current evaluation landscape, surveys government-specific requirements emerging from programs such as the Defense Innovation Unit's MYSTIC DEPOT solicitation and the NIST AI Risk Management Framework, and argues that vendor-agnostic evaluation infrastructure—combining capability benchmarks with behavioral profiling across dimensions such as sycophancy, hallucination tendency, and adversarial robustness—is not merely a technical improvement but an institutional necessity. We propose an architectural framework for such infrastructure and discuss its implications for procurement, compliance, and operational trust.
1. Introduction

Enterprise spending on generative AI reached $37 billion in 2025, more than tripling the $11.5 billion recorded in the prior year.[27] Across the federal government, agencies are integrating large language models into workflows ranging from intelligence analysis to acquisition support, driven by executive directives and competitive pressure from near-peer adversaries. Yet this expansion rests on an evaluation ecosystem that is, by the assessment of multiple independent bodies, fundamentally broken.

The European Union's Joint Research Centre, in a meta-review of approximately 110 studies presented at AIES 2025, identified "a series of systemic flaws in current benchmarking practices, such as misaligned incentives, construct validity issues, unknown unknowns, and problems with the gaming of benchmark results."[8] The Stanford AI Index Report 2025 documented benchmark saturation so severe that scores on foundational benchmarks like GPQA rose by 48.9 percentage points in a single year, rendering them effectively uninformative as discriminators of model quality.[25] And in April 2025, Meta's submission of an undisclosed experimental variant of Llama 4 Maverick to the LM Arena leaderboard—achieving a misleading first-place ranking—demonstrated that even the evaluation platforms widely regarded as independent are vulnerable to strategic manipulation.[33,35]

Against this backdrop, the Defense Innovation Unit and the Office of the Director of National Intelligence issued the MYSTIC DEPOT solicitation in March 2026, seeking "an evaluation harness and government-specific benchmarks that together enable rigorous, reproducible, vendor-agnostic assessment of any AI system against government-defined criteria."[11,12] This solicitation signals an institutional acknowledgment that current evaluation approaches are insufficient for the stakes involved. When an 80% failure rate characterizes government AI initiatives and only six percent of the federal acquisition workforce has received even basic AI training,[18] the gap between what is being procured and what is being measured is no longer a technical nuisance—it is a systemic risk.

This paper argues that the solution is not incremental improvement to existing benchmarks, but rather the construction of vendor-agnostic evaluation infrastructure capable of operating across classification levels, combining traditional capability benchmarks with behavioral profiling, and supporting continuous monitoring throughout the model lifecycle. We examine the evidence for this position, survey the regulatory and institutional landscape, and propose an architectural framework suitable for government and enterprise deployment.

2. The Current Evaluation Landscape

The ecosystem for evaluating large language models has grown rapidly but unevenly. A taxonomy of the principal evaluation mechanisms reveals both their contributions and their structural limitations.

2.1 Leaderboard Platforms

The Chatbot Arena (now LM Arena, formerly LMSYS) has emerged as the de facto standard for conversational AI ranking. Its methodology—crowdsourced pairwise blind comparisons producing Bradley-Terry model rankings from over two million human votes—represents a genuine advance in ecological validity.[30] The HuggingFace Open LLM Leaderboard, which launched its V2 iteration in October 2024 with updated benchmarks including MMLU-Pro and GPQA, provides automated evaluation for open-weight models.[31] LiveBench, a spotlight paper at ICLR 2025, addresses contamination by releasing new questions monthly derived from recent information sources and scoring them against objective ground truth without relying on LLM judges.[1]

Each platform addresses specific deficiencies of its predecessors, yet none provides the comprehensive, standardized, and reproducible evaluation that procurement decisions demand. Leaderboard rankings produce ordinal comparisons that obscure the multi-dimensional nature of model capability, while automated benchmarks optimize for measurability at the expense of ecological validity.

2.2 Static Benchmark Suites

The standard benchmark battery has undergone significant turnover as models have saturated earlier instruments. MMLU, long the workhorse of broad knowledge assessment, has been largely supplanted by MMLU-Pro, which employs ten answer choices and expert review to resist guessing strategies.[25] SWE-bench, measuring real-world software engineering capability, saw solve rates rise from 4.4% to 71.7% between 2023 and 2024. GPQA, designed for graduate-level scientific reasoning, experienced a 48.9 percentage point score increase in a single year. As the Stanford AI Index observed: "Every year, we look at how these algorithms are performing across benchmarks, and every year it seems like they're beating those benchmarks."[25]

This trajectory of rapid saturation means that benchmark scores become uninformative precisely when adoption decisions are being made. A model scoring 92% on MMLU in 2025 cannot be meaningfully compared to a model scoring 88% in the same period when the benchmark's discriminative power has been exhausted.

2.3 Vendor Self-Reporting

Perhaps the most consequential source of evaluation data for procurement officers remains the vendor technical report. These publications are, by construction, exercises in selective disclosure. No standardized reporting format exists; each vendor highlights benchmarks where its models excel while omitting unfavorable results. Self-reported numbers often derive from optimized configurations—temperature settings, system prompts, few-shot examples—that are unavailable to end users. Independent evaluations conducted by organizations such as Epoch AI and Scale AI frequently produce scores that diverge materially from vendor claims.[25] The absence of mandatory, standardized disclosure creates an information asymmetry that distorts procurement decisions across both government and enterprise.

3. Failures of the Status Quo
3.1 Benchmark Contamination

Data contamination—the overlap between training corpora and test sets—is the most technically pernicious failure mode of the current evaluation paradigm. Research presented at ACL 2025 introduced AntiLeakBench, an automated framework for constructing evaluation samples with knowledge explicitly absent from training sets, responding to the growing evidence that contamination inflates scores by approximately ten points on affected datasets.[2] Analysis of software engineering benchmarks revealed extreme leakage ratios: 100% for QuixBugs and 55.7% for BigCloneBench, though average rates across 83 benchmarks were lower at 4.8% for Python and 2.8% for Java.[3]

An ICML 2025 study examined whether the sheer scale of modern training datasets provides natural protection through "forgetting" of contaminated samples, concluding that while large datasets offer some mitigation, this mechanism is unreliable as a systematic defense.[10] Contamination detection itself remains immature: the first dedicated workshop was held only in 2024, and no algorithmic detection method has achieved broad acceptance. The implication is that any evaluation regime relying primarily on static benchmarks with known test sets operates on a foundation that is structurally compromised.

3.2 Selective Reporting and Benchmark Gaming

Goodhart's Law—"when a measure becomes a target, it ceases to be a good measure"—has found perhaps its most vivid contemporary illustration in AI evaluation. The SWE-bench experience is instructive: because the initial test set was limited to Python repositories, developers trained models exclusively on Python code, producing impressive benchmark scores that masked complete inability to handle other programming languages. The benchmark became a target rather than a measure.[8]

More broadly, the EU JRC's AIES 2025 meta-review documented a culture of "SOTA-chasing" in which benchmark scores are valued more highly than genuine capability assessment. Teams fine-tune on test sets, exploit unlimited submission policies, or selectively report results—practices that often operate within the norms of the field rather than violating them.[36] For procurement officers who must make consequential deployment decisions, this means that published benchmark scores are, at best, noisy signals contaminated by optimization pressure and, at worst, deliberately misleading.

3.3 Non-Reproducibility

The reproducibility crisis in AI research mirrors and amplifies the broader scientific reproducibility problem first highlighted in Science in 2018.[9] Current estimates indicate that nearly 70% of AI researchers report difficulty reproducing published results, with the rate increasing when AI is integrated into the research pipeline itself.[9] Primary barriers include inadequate documentation of hyperparameters and training conditions, unavailable code and data, poor adherence to reporting standards, and the inherent sensitivity of machine learning training to initial conditions. Princeton's ML Reproducibility Challenge, held in August 2025, represented one institutional response, but the field remains far from established norms of reproducible practice.[7]

For government and enterprise consumers, non-reproducibility translates directly to risk: a model that achieved published benchmark scores under laboratory conditions may perform differently when deployed in production environments, and the evaluating organization has no reliable mechanism to verify the claimed performance independently.

4. Case Study: The Llama 4 Maverick Incident

The April 2025 Llama 4 Maverick controversy merits detailed examination as it crystallizes the structural vulnerabilities of the current evaluation ecosystem in a single, well-documented episode.

In early April 2025, Meta submitted what it described as "Llama 4 Maverick" to the LM Arena (Chatbot Arena) leaderboard, where it achieved a first-place ranking with an Elo rating of 1,417, surpassing GPT-4o, Gemini 2.0 Pro, and all other listed models. Within days, independent analysis revealed that the model submitted to the Arena was not the same model publicly released under the Llama 4 Maverick designation. Meta had submitted an "experimental chat version" specifically optimized for conversational characteristics—notably excessive verbosity and stylistic flourishes—that exploited the Arena's known bias toward longer, more engaging responses.[33,34]

When the publicly available, unmodified version of Llama 4 Maverick was independently tested on the same platform, it ranked below GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro—a material discrepancy from the ranking claimed in Meta's public communications. Subsequent reporting revealed that Meta had tested 27 private model variants before the public release, a practice that allows vendors to optimize for evaluation platforms while publishing only the most favorable results.[35]

The LM Arena administrators stated publicly that "Meta's interpretation of our policy did not match what we expect from model providers."[37] Meta Vice President Ahmad Al-Dahle denied that the company had trained on test sets but did not address the fundamental issue: that the model evaluated and the model released were substantively different artifacts.[33]

Metric Meta's Claim Independent Finding
LM Arena Ranking #1 (Elo 1,417) Below GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro (unmodified version)
Model Tested "Llama 4 Maverick" Experimental chat variant, distinct from public release
Private Variants Not disclosed 27 variants tested prior to release

This incident is significant not because it represents an anomaly, but because it illustrates the structural incentives at work. The Arena platform, despite its methodological sophistication, had no mechanism to verify that the model submitted for evaluation was identical to the model subsequently released. Vendors face strong incentives to maximize leaderboard placement, and the absence of cryptographic model identity verification, mandatory disclosure of evaluation variants, or independent model sampling means that the evaluation infrastructure cannot enforce the assumption on which its credibility depends: that the model being evaluated is the model being deployed.

For government procurement, where model selection decisions may affect intelligence analysis, logistics planning, or operational safety, this vulnerability is not an academic concern. It is a supply chain integrity problem.

5. Government-Specific Evaluation Needs
5.1 The MYSTIC DEPOT Solicitation

The MYSTIC DEPOT Commercial Solutions Opening, issued jointly by the Defense Innovation Unit and the Office of the Director of National Intelligence in March 2026, represents the most explicit institutional articulation of requirements for vendor-agnostic AI evaluation infrastructure to date. The solicitation defines two lines of effort: first, an evaluation harness comprising a model interface, execution engine, measurement and scoring system, and output reporting capability; and second, government-specific benchmarks addressing requirements gathering, task decomposition, scoring criteria, baseline establishment, validation, and resistance to gaming.[11,12]

Critically, the solicitation requires deployment across unclassified, classified cloud, and air-gapped environments, and mandates capabilities for stress testing under Denied, Degraded, Intermittent, or Limited (DDIL) conditions, automated red-teaming, and human-machine teaming assessment across human-only, AI-only, and collaborative scenarios. These requirements go substantially beyond what any existing public evaluation platform provides.

5.2 Policy and Institutional Framework

The Federation of American Scientists has published specific recommendations for government AI evaluation infrastructure, proposing that the Chief Digital and Artificial Intelligence Office lead a formalized AI Benchmarking Initiative, with a $10 million expansion of testing and evaluation budgets and the creation of a centralized benchmarking repository.[13] The FAS framework calls for standardized defense-specific benchmarks, formalized pre-deployment evaluation integrated into acquisition platforms such as Tradewinds, theater-specific operational benchmarks contextualized to commands such as INDOPACOM and EUCOM, and human-in-the-loop evaluation measuring operator trust alongside model accuracy.

The NIST AI Risk Management Framework (AI RMF 1.0) provides a complementary structure, centering on Testing, Evaluation, Verification, and Validation (TEVV) as a core component to be performed regularly rather than as a one-time checkpoint. The framework requires fairness indicators, robustness measures under stress and adversarial inputs, continuous monitoring with drift detection, privacy leakage testing, and stress testing under extreme and degraded conditions.[15]

Additionally, Intelligence Community Directive 203 establishes analytic standards—objectivity, independence from political consideration, timeliness, multi-source integration, and rigorous tradecraft—that map directly to behavioral evaluation requirements for AI systems supporting intelligence analysis.[14] An AI system deployed in an analytic workflow must be evaluated not only for accuracy but for calibrated uncertainty expression, source acknowledgment, and resistance to confirmation bias: behavioral characteristics that no existing capability benchmark measures.

5.3 The Procurement Gap

The federal AI procurement environment is characterized by an acute capability deficit. Only approximately 12,000 of the 200,000 federal acquisition professionals had registered for GSA AI training as of late 2024, and an estimated 80% of government AI initiatives fail to achieve their stated objectives.[18] The Office of Management and Budget has issued requirements mandating that contracts bar vendors from using non-public government data to train publicly available AI systems and delineate intellectual property ownership rights, while the Federal Acquisition Regulation is undergoing modernization to incorporate AI governance clauses.[19] Yet without evaluation infrastructure capable of independently verifying vendor claims, these contractual safeguards amount to unverifiable compliance theater.

6. Vendor Abstraction Architecture

The technical foundation for vendor-agnostic evaluation already exists in the form of API abstraction layers that decouple application logic from provider-specific interfaces. The question is not whether such abstraction is feasible, but how to architect it for the specific demands of evaluation infrastructure across classification levels and operational contexts.

6.1 Existing Abstraction Patterns

Three architectural patterns for model abstraction are currently deployed at scale. OpenRouter provides a managed cloud gateway offering access to over 500 models from multiple providers through a unified API, enabling model switching without code changes.[28] LiteLLM offers an open-source Python routing layer that translates a single standard API call into provider-specific formats, with self-hosted deployment that keeps data within organizational boundaries.[29] Custom AI gateways, built on frameworks such as Ray Serve or KServe, provide full infrastructure control at the cost of development investment.

Feature Cloud Gateway Open-Source Router Custom Gateway
Hosting Managed (cloud) Self-hosted or proxy Self-hosted
Data Privacy Data traverses third party Remains in infrastructure Full control
Classified Use Not suitable Possible (self-hosted) Yes
Air-Gap Compatible No With modifications Yes

For government evaluation infrastructure, cloud-based gateways are unsuitable for classified environments. MYSTIC DEPOT explicitly requires deployment across unclassified, classified cloud, and air-gapped environments, mandating architectures that can operate without external network connectivity.[11]

6.2 Stanford HELM as Architectural Precedent

The Holistic Evaluation of Language Models (HELM) framework, developed by Stanford's Center for Research on Foundation Models, provides the most mature precedent for vendor-agnostic evaluation architecture. HELM evaluates models across seven dimensions—accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency—through a unified model interface that abstracts across providers including OpenAI, Anthropic, and Google.[26]

HELM's design principles—holistic multi-dimensional assessment, standardized evaluation protocols ensuring reproducibility, full transparency through inspectable prompts and responses, and extensibility to incorporate new benchmarks—map directly to the requirements of government evaluation infrastructure. IBM's HELM Enterprise extension, which adds domain-specific evaluation datasets for finance, legal, climate, and cybersecurity, demonstrates both the extensibility of the pattern and the insufficiency of generic benchmarks for domain-specific procurement.[32]

6.3 Proposed Evaluation Architecture

Drawing on HELM's architectural precedent and MYSTIC DEPOT's operational requirements, we propose a vendor-agnostic evaluation architecture comprising four layers. The model interface layer provides a standardized, pluggable abstraction over diverse model types, deployable across classification levels. The execution engine orchestrates evaluation workflows, managing prompt sequencing, parameter control, and response collection across heterogeneous configurations. The scoring and measurement system evaluates model outputs against both capability benchmarks and behavioral rubrics, supporting automated, LLM-judged, and human evaluation modalities. The reporting and monitoring layer produces auditable evaluation records and supports continuous drift detection for deployed models.

This architecture must be self-hosted and air-gap capable, support evaluation of both API-accessible and locally deployed models, maintain cryptographic provenance of evaluation artifacts, and integrate with existing acquisition and compliance workflows. It must also support living benchmarks—evaluation instruments that are refreshed regularly to prevent contamination—following the model pioneered by LiveBench.[1]

7. Behavioral Evaluation as a Complement to Capability Benchmarks

Capability benchmarks measure what a model can do: its accuracy on knowledge retrieval, its solve rate on coding problems, its performance on mathematical reasoning. Behavioral evaluation measures how a model does it: its tendencies, consistencies, risk profile, and alignment characteristics. We argue that both dimensions are essential for responsible procurement and that the current evaluation landscape systematically neglects the latter.

7.1 Psychometric Validation

Research published in Nature Machine Intelligence in 2025 applied comprehensive psychometric methodology to personality assessment of large language models, testing 18 models and finding that personality measurements in outputs of large, instruction-tuned models are "reliable and valid."[5] This finding has significant implications for evaluation: if models exhibit stable, measurable behavioral characteristics, then those characteristics are evaluable and should inform deployment decisions. The SpecEval framework, developed concurrently, tests whether models consistently adhere to their own published behavioral specifications, finding that it remains "unclear how consistently models adhere to the specifications" and that model outputs often "superficially mimic human participants, but fail to display coherent behaviour tied to their own internal states."[4]

7.2 Dimensions of Behavioral Profiling

For government and enterprise deployment contexts, we propose evaluation across sixteen behavioral dimensions: instruction adherence, refusal patterns, uncertainty expression, response consistency, verbosity characteristics, reasoning style, tone and formality, hallucination patterns, cultural and political bias, adversarial robustness, context window utilization, multi-turn coherence, latency characteristics, tool use behavior, error recovery, and citation and sourcing tendencies.

The procurement significance of behavioral profiling is substantial. Two models with identical MMLU scores may exhibit radically different behavioral profiles: one may express uncertainty calibrately and refuse requests outside its competence, while another may confabulate confidently and comply with adversarial prompts. Capability benchmarks cannot distinguish between these models; behavioral profiling can. For intelligence analysis applications governed by ICD 203 standards—requiring objectivity, source diversity, and calibrated uncertainty—behavioral evaluation is not supplementary but primary.[14]

7.3 Human-Machine Teaming

MYSTIC DEPOT explicitly requires subject matter expert interfaces for evaluating workload, usability, and performance across human-only, AI-only, and collaborative scenarios.[11] The FAS recommendations similarly call for measurement of operator trust and confidence alongside model accuracy.[13] Behavioral evaluation provides the foundation for these assessments: operator trust is a function not of model capability per se, but of behavioral characteristics such as uncertainty expression, consistency, and error acknowledgment that shape the human experience of working with the system.

8. International Regulatory Context
8.1 The EU AI Act

The European Union's AI Act, which began phased implementation with prohibitions on unacceptable-risk AI in February 2025 and General Purpose AI model obligations in August 2025, creates binding evaluation requirements for any organization deploying AI in European markets.[21,22] GPAI model providers must maintain technical documentation making development, training, and evaluation traceable; implement state-of-the-art evaluation protocols; conduct adversarial testing to identify and mitigate systemic risks; and publish transparency reports describing capabilities, limitations, and risks. Models classified as posing systemic risk face additional requirements for model evaluation using standardized protocols and tools "reflecting the state of the art," documented adversarial testing, and robust incident response.[24]

Full high-risk AI system obligations take effect in August 2026, requiring conformity assessments that evaluate quality management, risk management, accuracy, and robustness before market placement.[23] Enforcement carries penalties of up to EUR 35 million or 7% of global annual turnover—a scale of regulatory exposure that demands institutionalized evaluation capability rather than ad hoc testing.

8.2 US-EU Regulatory Convergence

The NIST AI Risk Management Framework and the EU AI Act share substantive alignment across multiple dimensions: risk-based evaluation approaches, continuous rather than point-in-time monitoring, adversarial testing requirements, fairness and bias evaluation, transparency and documentation standards, and post-deployment monitoring for behavioral drift.[15] This convergence has practical implications for organizations operating in both jurisdictions. A vendor-agnostic evaluation infrastructure designed to satisfy NIST AI RMF requirements can, with appropriate extension, serve EU AI Act compliance needs as well, avoiding the costly duplication of maintaining separate compliance evaluation pipelines for each regulatory regime.

The competitive dynamics between US and Chinese AI development further complicate the evaluation landscape. The Stanford AI Index documented a dramatic narrowing of the performance gap on major benchmarks: the MMLU gap decreased from 17.5 to 0.3 percentage points, and the HumanEval gap from 31.6 to 3.7 percentage points between the end of 2023 and end of 2024.[25] As model capabilities converge, the incentive to game evaluations intensifies, making independent evaluation infrastructure correspondingly more critical.

9. Discussion
9.1 The Vendor Lock-In Problem

The case for vendor-agnostic evaluation extends beyond technical rigor to strategic risk management. Foundation models are updated on monthly or weekly cycles; vendor lock-in has been characterized as a "silent productivity killer" that constrains organizational agility and complicates integration with existing enterprise systems.[39] Azure confirmed retirement dates for older GPT-4 variants in mid-2025, and OpenAI experienced a global API disruption in June 2025—events that underscore the operational risk of single-vendor dependence.

Organizations that build evaluation infrastructure coupled to a specific vendor's API, benchmarks, or reporting format inherit this dependency in their assessment capability. When the vendor updates, retires, or reprices a model, the organization's evaluation data becomes non-comparable, and the investment in evaluation infrastructure is partially stranded. Vendor-agnostic architecture ensures that evaluation capability is an organizational asset independent of any provider relationship.

9.2 Continuous Evaluation as Operational Requirement

Model performance is not static. Vendors routinely update deployed models without version changes, modify safety guardrails that affect behavioral characteristics, alter pricing structures, and retire models with limited notice. Organizations therefore require infrastructure to continuously re-evaluate models against their operational criteria—not solely at procurement time, but throughout the model lifecycle. This is precisely the "continuous monitoring" requirement articulated in both the MYSTIC DEPOT solicitation and the FAS recommendations.[11,13]

The enterprise data supports this urgency: nearly half of companies abandoned AI projects in 2025, approximately 65% of enterprises cannot advance AI pilots to production, and the gap between capability claims and operational reality continues to widen.[27] While these failures have multiple causes, the inability to independently and continuously verify model performance against operational requirements is a contributing factor that evaluation infrastructure directly addresses.

9.3 The AI-Washing Problem

The Securities and Exchange Commission has penalized companies for false or misleading statements about AI capabilities, and the Federal Trade Commission has warned against unsubstantiated AI performance claims. Vendor claims such as "99.9% accuracy in production" are factual assertions that can be tested, verified, and litigated. Yet without standardized evaluation infrastructure, procurement organizations lack the capability to conduct such verification, creating an environment in which inflated claims persist unchecked. Vendor-agnostic evaluation infrastructure transforms AI procurement from a trust-based regime to a verify-based regime—a transition that benefits both buyers and legitimate vendors whose products perform as advertised.

9.4 Limitations and Open Questions

Several challenges remain unresolved. Behavioral evaluation, while theoretically grounded, lacks the standardized instruments available for capability assessment; the sixteen dimensions proposed in this paper require validation and community refinement. The cost of maintaining living benchmarks that resist contamination is nontrivial and requires sustained investment. Air-gapped deployment of evaluation infrastructure introduces latency and operational complexity that may limit the frequency of evaluation cycles in classified environments. And the relationship between behavioral evaluation scores and operational outcomes—whether a model with higher adversarial robustness scores actually produces better results in contested information environments—remains an empirical question requiring longitudinal study.

10. Conclusion

The evidence presented in this paper supports three principal conclusions. First, the current AI evaluation landscape is structurally inadequate for the stakes involved in government and enterprise deployment. Benchmark contamination, selective vendor reporting, evaluation gaming, and non-reproducibility are not edge cases but systemic features of an ecosystem organized around misaligned incentives. The Llama 4 Maverick incident is illustrative rather than exceptional.

Second, the institutional demand for vendor-agnostic evaluation infrastructure is no longer hypothetical. The MYSTIC DEPOT solicitation, the FAS continuous benchmarking recommendations, the NIST AI RMF's TEVV requirements, and the EU AI Act's evaluation mandates collectively describe a convergent institutional consensus that independent, reproducible, and continuous evaluation is a prerequisite for responsible AI deployment. These are not aspirational positions; they are operational requirements with associated funding, regulatory enforcement, and organizational accountability.

Third, capability benchmarks alone are insufficient. Behavioral evaluation—measuring how models express uncertainty, maintain consistency, resist adversarial manipulation, and adhere to organizational norms—is essential for deployment contexts where the consequences of model failure extend beyond inaccuracy to questions of safety, trust, and institutional credibility. Two models indistinguishable on MMLU may diverge dramatically on the behavioral dimensions that determine operational fitness.

We propose that the development of open, vendor-agnostic evaluation infrastructure—combining capability benchmarks with behavioral profiling, supporting deployment across classification levels, and enabling continuous monitoring throughout the model lifecycle—should be treated as institutional infrastructure rather than a research exercise. The $37 billion now flowing annually into enterprise generative AI, and the national security implications of AI deployment across the Department of Defense and Intelligence Community, demand evaluation infrastructure commensurate with the investment and the risk. The alternative—continued reliance on vendor self-reporting, compromised benchmarks, and evaluation platforms vulnerable to strategic manipulation—is a posture that neither government nor enterprise can responsibly sustain.

References
  1. White, C., Dooley, S., Roberts, M., Pal, A., Feuer, B., Jain, S., et al. "LiveBench: A Challenging, Contamination-Free LLM Benchmark." ICLR 2025 Spotlight. https://livebench.ai/livebench.pdf
  2. "AntiLeakBench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Knowledge." ACL 2025. https://aclanthology.org/2025.acl-long.901/
  3. "LessLeak-Bench: A First Investigation of Data Leakage in LLMs Across 83 Software Engineering Benchmarks." arXiv, 2025. https://arxiv.org/html/2502.06215v1
  4. "SpecEval: Evaluating Model Adherence to Behavior Specifications." arXiv, 2025. https://arxiv.org/html/2509.02464
  5. "Psychometric Framework for Evaluating Personality Traits in Large Language Models." Nature Machine Intelligence, 2025. https://www.nature.com/articles/s42256-025-01115-6
  6. "Benchmarking is Broken—Don't Let AI be its Own Judge." arXiv, 2025. https://arxiv.org/html/2510.07575v1
  7. "Reproducibility: The New Frontier in AI Governance." arXiv, 2025. https://arxiv.org/html/2510.11595v1
  8. "AI Benchmarking: Nine Challenges and a Way Forward." EU AI Watch / JRC, AIES 2025. https://ai-watch.ec.europa.eu/news/ai-benchmarking-nine-challenges-and-way-forward-2025-09-10_en
  9. "What is Reproducibility in AI/ML Research?" AI Magazine (Wiley), 2025. https://onlinelibrary.wiley.com/doi/full/10.1002/aaai.70004
  10. "How Much Can We Forget about Data Contamination?" ICML 2025. https://icml.cc/virtual/2025/poster/45377
  11. MYSTIC DEPOT DIU Solicitation (PROJ00625). Defense Innovation Unit, 2026. https://www.diu.mil/work-with-us/submit-solution/PROJ00625
  12. Rogue, M. "Pentagon, IC Want AI Evaluation Harness for Testing AI Systems." DefenseScoop, March 11, 2026. https://defensescoop.com/2026/03/11/ai-system-testing-dod-intelligence-agencies/
  13. Federation of American Scientists. "Codifying and Expanding Continuous AI Benchmarking." FAS Policy Brief, 2025. https://fas.org/publication/codifying-expanding-continuous-ai-benchmarking/
  14. Office of the Director of National Intelligence. "Intelligence Community Directive 203: Analytic Standards." Revised 2015. https://www.dni.gov/files/documents/ICD/ICD-203.pdf
  15. National Institute of Standards and Technology. "AI Risk Management Framework (AI RMF 1.0)." NIST AI 100-1, 2023. https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf
  16. Congressional Research Service. "CDAO Realignment." CRS Report IN12615, 2025. https://www.congress.gov/crs-product/IN12615
  17. Department of Defense Inspector General. "Evaluation of CDAO." DODIG-2025-039, 2025. https://www.dodig.mil/reports.html/Article/3967388/
  18. Covington & Burling LLP. "OMB Releases Requirements for Responsible AI Procurement by Federal Agencies." October 2024. https://www.cov.com/en/news-and-insights/insights/2024/10/omb-releases-requirements-for-responsible-ai-procurement-by-federal-agencies
  19. The White House. "White House Releases New Policies on Federal Agency AI Use and Procurement." April 2025. https://www.whitehouse.gov/articles/2025/04/white-house-releases-new-policies-on-federal-agency-ai-use-and-procurement/
  20. Williams, L. "Lawmakers Propose DIU-Managed Military Testing and Evaluation Cell." C4ISRNet, May 13, 2024. https://www.c4isrnet.com/battlefield-tech/2024/05/13/lawmakers-propose-diu-managed-military-testing-and-evaluation-cell/
  21. Greenberg Traurig. "EU AI Act: Key Compliance Considerations Ahead of August 2025." July 2025. https://www.gtlaw.com/en/insights/2025/7/eu-ai-act-key-compliance-considerations-ahead-of-august-2025
  22. DLA Piper. "Latest Wave of Obligations Under the EU AI Act Take Effect." August 2025. https://www.dlapiper.com/en-us/insights/publications/2025/08/latest-wave-of-obligations-under-the-eu-ai-act-take-effect
  23. Future of Privacy Forum. "Conformity Assessment Under the EU AI Act." Working Paper, April 2025. https://fpf.org/wp-content/uploads/2025/04/OT-comformity-assessment-under-the-eu-ai-act-WP-1.pdf
  24. Nemko Digital. "EU AI Act Rules on GPAI: 2025 Update." https://digital.nemko.com/insights/eu-ai-act-rules-on-gpai-2025-update
  25. Stanford Institute for Human-Centered Artificial Intelligence. AI Index Report 2025. https://hai.stanford.edu/ai-index/2025-ai-index-report
  26. Stanford Center for Research on Foundation Models. "HELM: Holistic Evaluation of Language Models." Latest release: December 4, 2025. https://crfm.stanford.edu/helm/capabilities/latest/
  27. Menlo Ventures. "2025: The State of Generative AI in the Enterprise." https://menlovc.com/perspective/2025-the-state-of-generative-ai-in-the-enterprise/
  28. Skywork AI. "OpenRouter Review 2025." https://skywork.ai/blog/openrouter-review-2025/
  29. LiteLLM. "Documentation." https://docs.litellm.ai/
  30. LM Arena (formerly LMSYS Chatbot Arena). https://lmarena.ai/
  31. HuggingFace. "Open LLM Leaderboard." https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
  32. IBM. "HELM Enterprise Benchmark." https://github.com/IBM/helm-enterprise-benchmark
  33. Coldewey, D. "Meta Exec Denies the Company Artificially Boosted Llama 4's Benchmark Scores." TechCrunch, April 7, 2025. https://techcrunch.com/2025/04/07/meta-exec-denies-the-company-artificially-boosted-llama-4s-benchmark-scores/
  34. Coldewey, D. "Meta's Benchmarks for Its New AI Models Are a Bit Misleading." TechCrunch, April 6, 2025. https://techcrunch.com/2025/04/06/metas-benchmarks-for-its-new-ai-models-are-a-bit-misleading/
  35. Sharwood, S. "Meta's Llama 4 Benchmark 'Bait-n-Switch' Sparks Outrage." The Register, April 8, 2025. https://www.theregister.com/2025/04/08/meta_llama4_cheating/
  36. Claburn, T. "Measuring AI Models Hampered by Bad Science." The Register, November 7, 2025. https://www.theregister.com/2025/11/07/measuring_ai_models_hampered_by/
  37. Collinear AI. "Gaming the System: Goodhart's Law Exemplified in AI Leaderboard Controversy." https://blog.collinear.ai/p/gaming-the-system-goodharts-law-exemplified-in-ai-leaderboard-controversy
  38. Heikkila, M. "How to Build a Better AI Benchmark." MIT Technology Review, May 8, 2025. https://www.technologyreview.com/2025/05/08/1116192/how-to-build-a-better-ai-benchmark/
  39. Kellton. "Why Vendor Lock-In Is Riskier in the GenAI Era and How to Avoid It." 2025. https://www.kellton.com/kellton-tech-blog/why-vendor-lock-in-is-riskier-in-genai-era-and-how-to-avoid-it
  40. Goedecke, S. "Why Arena Leaderboards Are Dominated by Slop." https://www.seangoedecke.com/lmsys-slop/