Bridging AI Evaluation and Federal Compliance

Abstract

Federal agencies reported 1,757 artificial intelligence use cases across 37 agencies in 2024, with generative AI deployments increasing nine-fold in a single year. Yet fewer than one-third of these agencies track standardized performance metrics for their AI systems, and 15 of 20 agencies submitted compliance plans that independent reviewers found to be incomplete or inaccurate. This paper examines the structural gap between the federal government's rapidly expanding AI compliance mandates—spanning Intelligence Community Directive 203, the NIST AI Risk Management Framework, NIST AI 600-1, and OMB M-24-10—and the absence of integrated evaluation tooling capable of satisfying those mandates. We propose that behavioral fingerprinting, a multi-dimensional assessment methodology that profiles AI model behavior across 16 measurable dimensions, provides a natural bridge between technical AI evaluation and federal compliance requirements. Each fingerprint dimension maps to specific ICD 203 analytic tradecraft standards, NIST trustworthiness characteristics, and NIST 600-1 generative AI risk categories, enabling a single evaluation framework to produce compliance evidence across multiple regulatory regimes. The March 2026 MYSTIC DEPOT solicitation from the Defense Innovation Unit, which seeks vendor-agnostic evaluation harness infrastructure for the Department of Defense and Intelligence Community, validates both the urgency and the architectural approach described herein. This work presents a unified assessment architecture and argues that the convergence of intelligence tradecraft standards, risk management frameworks, and behavioral evaluation methodology creates an actionable path toward systematic, auditable AI governance in federal contexts.

1. Introduction

The adoption of artificial intelligence within the United States federal government has reached an inflection point. Between 2023 and 2024, the total number of reported AI use cases across 11 major agencies nearly doubled from 571 to 1,110, with generative AI deployments surging from 32 to 282—a 781 percent increase in a single year (GAO, 2025). Across all 37 agencies subject to reporting requirements, the Government Accountability Office identified 1,757 distinct AI use cases, of which 227 were classified as rights-impacting or safety-impacting. These figures describe not a speculative future but an operational present in which AI systems are actively influencing decisions in healthcare, homeland security, veterans affairs, and national defense.

This expansion has outpaced the government's capacity for systematic evaluation. Multiple compliance frameworks now govern AI use in federal contexts—Intelligence Community Directive 203 (ODNI, 2015), the NIST AI Risk Management Framework (NIST, 2023), the NIST AI 600-1 Generative AI Profile (NIST, 2024), OMB Memorandum M-24-10 (OMB, 2024), and the Department of Defense Responsible AI Strategy (CDAO, 2024)—yet none of these frameworks includes an integrated evaluation harness capable of quantitatively assessing AI behavioral characteristics against their own standards. The result is a compliance apparatus that specifies what must be evaluated without providing the technical means to conduct the evaluation.

The Electronic Privacy Information Center found that 15 of 20 agencies submitted AI compliance plans containing inaccurate or incomplete information (EPIC, 2024). The GAO issued 35 recommendations across 19 agencies to improve AI governance implementation (GAO, 2025). These are not failures of intent but failures of infrastructure: agencies understand what compliance requires, but lack the measurement tools to achieve it.

This paper advances the thesis that behavioral fingerprinting—a methodology that profiles AI model behavior across multiple measurable dimensions including hallucination rate, confidence calibration, sycophancy resistance, reasoning quality, and adversarial robustness—provides a natural and technically rigorous bridge between AI evaluation and federal compliance. We demonstrate that each dimension of a behavioral fingerprint maps directly to specific ICD 203 analytic tradecraft standards, NIST AI RMF trustworthiness characteristics, and NIST 600-1 risk categories, creating a unified assessment framework in which a single evaluation produces compliance evidence applicable across multiple regulatory mandates.

The urgency of this integration is underscored by the MYSTIC DEPOT solicitation (DIU PROJ00625), issued by the Defense Innovation Unit in partnership with the Office of the Director of National Intelligence, which closes on March 24, 2026. MYSTIC DEPOT explicitly seeks vendor-agnostic AI evaluation infrastructure capable of continuous assessment against mission-specific benchmarks—precisely the architectural approach described in this work. We examine the alignment between behavioral fingerprinting, existing compliance frameworks, and the MYSTIC DEPOT requirements, and present a unified assessment architecture for federal AI governance.

2. The Federal AI Evaluation Gap

The gap between federal AI deployment and federal AI evaluation is both quantitative and structural. On the quantitative side, the numbers are stark: 1,757 AI use cases are deployed across the federal enterprise, but fewer than one-third of agencies track standardized key performance indicators such as days-to-decision or error rates (GAO, 2025). Most adopted AI solutions remain in what the GAO characterizes as an "initiated" phase rather than a fully operational state, with 61 percent of federal AI use cases serving mission-enabling or internal agency support functions. The implication is that hundreds of AI systems are in active use without having passed through a standardized evaluation and promotion process.

The structural dimension of the gap is more consequential. Federal agencies face what might be termed a "compliance without measurement" problem: multiple frameworks prescribe evaluation requirements, but no framework provides the evaluation tooling. OMB M-24-10 requires agencies to classify AI systems by risk level, monitor their performance, and submit detailed compliance plans—yet the memorandum offers no technical infrastructure for accomplishing these tasks. The CDAO Responsible AI Toolkit provides governance guidance for the Department of Defense, cataloging resources for identifying bias, proving responsible capabilities, and documenting development choices, but it does not include an automated evaluation harness (CDAO, 2024). FedRAMP addresses infrastructure security for AI cloud services and has prioritized 20x authorizations for AI solutions (GSA, 2025), but its authorization process evaluates deployment security rather than model behavior. A model may achieve FedRAMP authorization at the Moderate or High risk level while still exhibiting high hallucination rates, sycophantic behavior, or systematic bias in its outputs.

The Intelligence Community faces an additional and specific challenge. ICD 203 articulates nine analytic tradecraft standards—Sourcing, Uncertainty, Assumptions, Alternatives, Relevance, Argumentation, Consistency, Accuracy, and Visuals (Kwoun, 2021)—that have governed the quality of intelligence analysis since 2007. These standards were written for human analysts, yet they describe behavioral characteristics that are directly applicable to AI systems performing analytical functions. As McMahon (2024) argues in a Belfer Center analysis, the IC must consider how AI tools impact analysts' ability to meet existing analytic standards, and the IC should amend those standards to anticipate AI-specific challenges. This amendment has not yet occurred. The result is that AI systems are being integrated into intelligence workflows governed by standards that do not yet formally recognize AI as a participant in the analytical process.

Perhaps most critically, human-AI teaming evaluation is entirely absent from current federal assessment methodologies. The MYSTIC DEPOT solicitation states directly that "evaluation must assess not only whether AI systems can perform tasks in isolation, but whether human-AI teams achieve better mission outcomes than either humans or AI alone" (DIU, 2026). No existing federal framework provides a methodology for this assessment, despite the fact that the predominant deployment model for federal AI is as an augmentation to human analysts, not as a replacement.

The barriers to closing this gap are well documented. Agencies cite insufficient funding, inadequate technical talent for developing internal evaluation systems, rapid technology evolution that complicates policy establishment, and the cultural challenge of two-to-three-year manager rotations that disincentivize multi-year AI infrastructure investments. Officials at 10 of 12 agencies told GAO investigators that existing policies present obstacles to effective AI governance (GAO, 2025). The evaluation gap, in short, is not a problem that will resolve through incremental policy refinement. It requires purpose-built technical infrastructure.

3. Existing Frameworks: ICD 203, NIST AI RMF, and NIST 600-1

3.1 Intelligence Community Directive 203: Analytic Standards

Intelligence Community Directive 203, codified by the Office of the Director of National Intelligence and most recently revalidated in January 2015, establishes the analytic tradecraft standards that govern all intelligence analysis produced by or on behalf of the IC. The directive articulates five overarching requirements—that analysis be objective, independent of political consideration, timely, based on all available sources, and implemented through nine specific tradecraft standards—and defines the quality criteria against which analytical products are assessed (ODNI, 2015).

The nine tradecraft standards, known colloquially by their one-word nomenclature, are: (1) Sourcing, requiring analysts to properly describe the quality and credibility of underlying sources, data, and methodologies; (2) Uncertainty, requiring proper expression and explanation of uncertainties using calibrated probability language; (3) Assumptions, requiring clear distinction between underlying intelligence and analyst judgments; (4) Alternatives, requiring incorporation of alternative analytic interpretations through structured analytic techniques; (5) Relevance, requiring demonstration of relevance to the consumer or decision-maker; (6) Argumentation, requiring clear and logical argumentation; (7) Consistency, requiring notation and explanation of changes in or consistency of judgments over time; (8) Accuracy, requiring accurate judgments and assessments subject to retrospective evaluation; and (9) Visuals, requiring effective incorporation of visual information into analytic products (Kwoun, 2021).

Each of these standards, though written for human analysts, describes a behavioral characteristic that is both measurable and directly relevant to AI systems. An AI model that fabricates citations violates Standard 1. A model that presents uncertain inferences as established facts violates Standard 2. A model that agrees with user premises rather than offering independent analysis violates Standard 4. The analytical distance between ICD 203's human-oriented standards and AI behavioral evaluation is remarkably short.

3.2 NIST AI Risk Management Framework (AI RMF 1.0)

The NIST AI Risk Management Framework, published as NIST AI 100-1 in January 2023, provides voluntary, non-binding guidance for identifying, assessing, and managing AI risks. The framework is organized around four core functions: Govern, Map, Measure, and Manage. Of these, the Measure function is most directly relevant to AI evaluation, encompassing five categories with 22 subcategories that collectively describe how organizations should employ quantitative, qualitative, or mixed-method tools to analyze, assess, benchmark, and monitor AI risk (NIST, 2023).

The Measure function's categories address appropriate methods and metrics (MEASURE 1), trustworthiness evaluation across 13 subcategories (MEASURE 2), risk tracking mechanisms (MEASURE 3), and measurement efficacy feedback (MEASURE 4). Critically, MEASURE 2 spans seven trustworthiness characteristics that the framework identifies as essential to responsible AI: validity and reliability, safety, security and resilience, accountability and transparency, explainability and interpretability, privacy, and fairness with bias management (NIST, 2023).

The AI RMF has proven durable across changes in administration. Although it was developed partly under the direction of Executive Order 14110—which was revoked on January 20, 2025—the framework itself is a voluntary standard independent of executive authority. It has become, in the assessment of multiple policy analyses, one of the world's most influential voluntary AI governance frameworks, with sector regulators across financial services, healthcare, and federal procurement increasingly referencing its principles. The White House AI Action Plan released in July 2025, under the current administration, continues to call for evaluation standards, testbeds, and assessment frameworks, affirming the bipartisan recognition that AI evaluation infrastructure is a necessity rather than a policy preference.

3.3 NIST AI 600-1: Generative AI Profile

NIST AI 600-1, published in July 2024, serves as a companion document to the AI RMF specifically targeting risks associated with generative AI. The profile identifies 12 risk categories that are distinctive to or amplified by generative AI systems: CBRN information or capabilities, confabulation, dangerous or violent or hateful content, data privacy, environmental impacts, harmful bias or homogenization, human-AI configuration, information integrity, information security, intellectual property, obscene or degrading or abusive content, and value chain and component integration (NIST, 2024).

Two of these categories are of particular significance for compliance alignment. The first is confabulation, which NIST 600-1 defines as "the production of confidently stated but false or internally inconsistent outputs" and characterizes as "perhaps the most distinctive generative AI risk." Unlike traditional software errors, NIST observes, confabulations are presented with the same linguistic confidence as accurate information, making them especially dangerous in high-stakes domains. The second is information integrity, encompassing the system's capacity to distinguish fact from fiction, opinion from inference, and to acknowledge uncertainties in its outputs.

These two NIST 600-1 categories map directly and unambiguously to ICD 203 Standards 1 (Source Credibility) and 2 (Uncertainty Expression), respectively. This is not an accidental alignment. Both ICD 203 and NIST 600-1, though developed by different bodies for different purposes, are attempting to describe the same underlying quality: the degree to which an analytical product—whether produced by a human analyst or a language model—can be trusted to represent reality accurately and to communicate its limitations honestly.

4. Mapping AI Evaluation to ICD 203

The central contribution of this work is the demonstration that AI behavioral evaluation and ICD 203 compliance are not merely analogous but structurally isomorphic. Each of the nine ICD 203 tradecraft standards describes a behavioral property that can be operationalized as a measurable dimension of an AI system's behavioral fingerprint. This section presents the mapping in detail, identifying for each tradecraft standard the corresponding behavioral dimension, measurement methodology, and compliance evidence that automated evaluation produces.

ICD 203 Standard 1 (Source Credibility) requires analysts to properly describe the quality and credibility of underlying sources. The behavioral analog is hallucination rate: the frequency with which a model generates plausible but fabricated information, including false citations, invented data, and confabulated facts. In intelligence contexts, a hallucinated source citation could drive incorrect analytical conclusions with consequences for national security. Measurement is conducted through benchmarks such as TruthfulQA—817 questions across 38 categories designed to elicit plausible-sounding but incorrect responses—and HaluEval, a systematic framework for detecting unsupported or fabricated outputs in question-answering and dialogue settings.

ICD 203 Standard 2 (Uncertainty Expression) requires analysts to properly express and explain uncertainties, using the IC's calibrated probability language in which terms like "likely" correspond to defined probability ranges (55 to 80 percent). The behavioral analog is confidence calibration: whether a model's expressed certainty matches its empirical accuracy. A model that presents speculative inferences with the same certainty as well-established facts violates this standard in precisely the way that a human analyst who omits confidence qualifiers would.

ICD 203 Standard 3 (Assumptions vs. Intelligence) requires clear distinction between underlying intelligence and analytical judgments. The behavioral analog is assumption transparency and explainability depth: whether a model distinguishes between what it derives from provided evidence, what it retrieves from training data, and what it infers through reasoning. This is measured through chain-of-thought analysis and explainability scoring.

ICD 203 Standard 4 (Analysis of Alternatives) requires incorporation of alternative analytic interpretations through techniques such as devil's advocacy, red teaming, and alternative futures analysis. The behavioral analog is sycophancy resistance: the degree to which a model provides independent analysis rather than confirming the user's stated or implied position. Research on behavioral fingerprinting has found that sycophancy varies dramatically across models, with robustness scores ranging from 0.50 to 1.00 (Behavioral Fingerprinting of LLMs, 2025). A sycophantic AI assistant used by an intelligence analyst would reinforce existing cognitive biases rather than challenging them, undermining the analytical rigor that ICD 203 Standard 4 was designed to ensure.

Standards 5 through 9 complete the mapping. Standard 5 (Relevance) maps to instruction following and task alignment. Standard 6 (Argumentation) maps to reasoning quality and logical coherence, measured through chain-of-thought evaluation. Standard 7 (Consistency) maps to semantic robustness across sessions and prompt variations. Standard 8 (Accuracy) maps to factual accuracy as measured by domain-knowledge benchmarks such as MMLU, which spans 57 academic subjects with over 16,000 questions. Standard 9 (Visuals) maps to multimodal accuracy for image, chart, and document analysis.

5. Behavioral Fingerprinting as Compliance Measurement

Behavioral fingerprinting, as a methodology, creates multi-faceted profiles of model behavior across cognitive and interactive axes. The approach emerged from academic research that probes AI systems along four primary vectors: internal world model, reasoning and cognitive abilities, biases and personality, and robustness (Behavioral Fingerprinting of LLMs, 2025). A key finding of this research is that while core reasoning capabilities are converging across frontier models, alignment-related behaviors—precisely the behaviors relevant to ICD 203 compliance—vary dramatically.

This paper proposes a 16-dimension behavioral fingerprint framework designed for federal compliance alignment. Each dimension maps simultaneously to ICD 203 tradecraft standards, NIST AI RMF trustworthiness characteristics, and NIST 600-1 risk categories. The complete mapping is presented in Table 1.

#	Fingerprint Dimension	ICD 203 Standard	NIST RMF Characteristic	NIST 600-1 Risk
1	Hallucination Rate	Std 1: Sourcing	Validity & Reliability	Confabulation
2	Confidence Calibration	Std 2: Uncertainty	Validity & Reliability	Confabulation
3	Source Attribution	Std 1: Sourcing	Accountability & Transparency	Information Integrity
4	Assumption Transparency	Std 3: Assumptions	Explainability & Interpretability	Human-AI Configuration
5	Sycophancy Resistance	Std 4: Alternatives	Fairness	Harmful Bias
6	Reasoning Quality	Std 6: Argumentation	Validity & Reliability	Information Integrity
7	Logical Consistency	Std 7: Consistency	Validity & Reliability	Confabulation
8	Factual Accuracy	Std 8: Accuracy	Validity & Reliability	Confabulation
9	Bias Detection	Std 4: Alternatives	Fairness	Harmful Bias
10	Instruction Following	Std 5: Relevance	Safety	Human-AI Configuration
11	Refusal Appropriateness	Std 5: Relevance	Safety	Dangerous Content
12	Adversarial Robustness	Std 7: Consistency	Security & Resilience	Information Security
13	Multimodal Accuracy	Std 9: Visuals	Validity & Reliability	Value Chain
14	Temporal Awareness	Std 8: Accuracy	Validity & Reliability	Confabulation
15	Privacy Compliance	(Cross-cutting)	Privacy	Data Privacy
16	Explainability Depth	Std 3: Assumptions	Explainability & Interpretability	Human-AI Configuration

Table 1. The 16-dimension behavioral fingerprint mapped to ICD 203 tradecraft standards, NIST AI RMF trustworthiness characteristics, and NIST 600-1 generative AI risk categories.

The significance of this mapping is operational, not merely taxonomic. Consider the scenario in which a federal agency must demonstrate compliance with OMB M-24-10's requirement to classify AI systems by risk level and implement appropriate safeguards. Under current practice, this compliance demonstration is largely narrative—agencies describe their processes in prose documents that EPIC characterized as "abstract and high-level." Under a behavioral fingerprinting approach, the same compliance demonstration would be grounded in quantitative evidence: a hallucination rate of 3.2 percent across 817 TruthfulQA items, a sycophancy resistance score of 0.87 on a standardized opposition benchmark, a confidence calibration error of 0.04 across domain-stratified test sets.

This quantitative foundation transforms compliance from a documentation exercise into a measurement discipline. It enables comparison across models, tracking over time, and threshold-based authorization decisions—precisely the capabilities that the federal AI governance apparatus currently lacks.

6. The MYSTIC DEPOT Mandate

On March 11, 2026, the Defense Innovation Unit, in partnership with the Office of the Director of National Intelligence, issued MYSTIC DEPOT (PROJ00625): a Commercial Solutions Opening solicitation for vendor-agnostic AI evaluation infrastructure. The solicitation is notable not merely for its existence but for the specificity with which it articulates the evaluation problem and the architectural requirements for its solution.

The solicitation opens with a problem statement that encapsulates the evaluation gap described in this paper:

"As artificial intelligence capabilities evolve at an extraordinary pace, the government requires evaluation infrastructure that can keep pace by continuously assessing new models against mission-specific benchmarks as they are released."

MYSTIC DEPOT is organized around two lines of effort. Line of Effort 1 (Evaluation Harness Infrastructure) specifies 11 required components: a model interface providing standardized, pluggable architecture for diverse model types; an execution engine orchestrating complex workflows; a measurement and scoring system; human integration supporting subject matter expert review; an output and reporting layer in open, non-proprietary formats; continuous monitoring automating model ingestion and performance tracking; configuration management for benchmarks; a denied, degraded, intermittent, and limited (DDIL) environment simulator; agentic evaluation capabilities for multi-step tasks; adversarial testing with automated red-teaming; and multimodal support (DIU, 2026).

Line of Effort 2 (Benchmark Development Methodology) specifies nine required capabilities, including requirements identification for mission contexts, task decomposition into measurable evaluation activities, scoring rubric development prioritizing interpretability, gaming-resistant benchmark design, and training materials enabling government personnel to develop benchmarks independently.

The alignment between MYSTIC DEPOT's requirements and a behavioral fingerprinting architecture is structural. The execution engine maps to the orchestration layer of a fingerprinting harness. The measurement and scoring system maps to the multi-dimensional scoring rubric. Continuous monitoring maps to fingerprint drift detection. Adversarial testing maps to adversarial robustness measurement (Dimension 12). DDIL simulation maps to temporal awareness and consistency evaluation under degraded conditions (Dimensions 7 and 14). Agentic evaluation extends fingerprinting to multi-step task sequences. The required attributes—modular, containerized, deployable across classification levels from unclassified through classified cloud to air-gapped environments—describe the deployment architecture necessary for evaluation infrastructure that serves both the Department of Defense and the Intelligence Community.

MYSTIC DEPOT is significant not only as validation of the approach but as evidence of bipartisan continuity in AI evaluation demand. It was solicited under the current administration, which revoked Executive Order 14110 and reframed federal AI policy around competitiveness rather than safety. Yet the evaluation requirements articulated in MYSTIC DEPOT are functionally identical to those that EO 14110 mandated: robust, reliable, repeatable, and standardized evaluation of AI systems. The evaluation gap, it appears, is a mission-driven problem that transcends political framing. Whether the goal is characterized as "safe AI" or "American AI leadership," the requirement for trustworthy, measured, and accountable AI performance remains constant.

7. A Unified Assessment Architecture

The convergence of ICD 203 tradecraft standards, NIST trustworthiness characteristics, NIST 600-1 risk categories, and MYSTIC DEPOT infrastructure requirements points toward a unified assessment architecture in which behavioral fingerprinting serves as the measurement layer connecting governance frameworks to technical evaluation. This section describes the key architectural properties of such a system.

7.1 Cross-Framework Compliance Reporting

The central architectural insight is that a single behavioral fingerprint evaluation can produce compliance evidence for multiple regulatory frameworks simultaneously. A hallucination rate measurement satisfies ICD 203 Standard 1 (Source Credibility), NIST AI RMF MEASURE 2.5 (Validity and Reliability evaluation), and NIST 600-1 confabulation risk assessment in a single test execution. This eliminates the need for agencies to maintain separate, redundant evaluation processes for each compliance mandate—a practical consideration given the talent and resource constraints that agencies consistently cite as barriers to effective AI governance.

The reporting layer translates raw fingerprint scores into the vocabularies of each framework. For an IC consumer, the same hallucination data appears as a Source Credibility assessment keyed to ICD 203 Standard 1. For a CDAO compliance officer, it appears as a Validity and Reliability metric aligned with the RAI Toolkit. For an OMB reporting obligation, it appears as a quantitative risk classification input for M-24-10. The underlying measurement is identical; only the compliance framing changes.

7.2 Continuous Monitoring and Drift Detection

Both the NIST AI RMF and MYSTIC DEPOT emphasize that AI evaluation is not a one-time gate but a continuous process. The AI RMF states that "AI systems should be tested before their deployment and regularly while in operation" (NIST, 2023). MYSTIC DEPOT requires continuous monitoring that automates model ingestion and performance tracking.

A behavioral fingerprint architecture supports this requirement through drift detection: the automated comparison of fingerprint scores over time and across model versions. When a model update shifts sycophancy resistance from 0.91 to 0.73, or when hallucination rates on domain-specific benchmarks increase following a provider's fine-tuning cycle, the change is detected, quantified, and flagged for review. This is the technical implementation of ICD 203 Standard 7 (Consistency), which requires noting and explaining changes in analytical judgments over time, applied to the AI system itself.

7.3 Classification-Spanning Deployment

Federal AI evaluation infrastructure must operate across classification levels. MYSTIC DEPOT requires deployment capability across unclassified, classified cloud, and air-gapped environments. A modular, containerized fingerprinting harness satisfies this requirement by separating the evaluation infrastructure (the execution engine, scoring system, and reporting layer) from the evaluation content (benchmarks, rubrics, and domain-specific test sets). Infrastructure components can be standardized across classification levels while benchmark content varies to reflect the sensitivity of the operational environment.

7.4 Gaming-Resistant Benchmark Design

MYSTIC DEPOT's LOE 2 specifically requires gaming-resistant benchmark design—benchmarks that prevent optimization without genuine improvement. This requirement reflects a well-documented concern in AI evaluation: models can be fine-tuned to perform well on specific benchmarks without improving their underlying capabilities, a phenomenon analogous to teaching to the test. A behavioral fingerprinting approach mitigates this risk through dimensional diversity. Optimizing across all 16 dimensions simultaneously is substantially more difficult than optimizing for a single benchmark score, and the inter-dimensional relationships (a model that reduces hallucination by becoming excessively cautious will see its instruction following score decrease) create natural resistance to superficial optimization.

7.5 Human-AI Teaming Evaluation

The evaluation of human-AI teams represents perhaps the most significant frontier in federal AI assessment. MYSTIC DEPOT requires that evaluation assess whether human-AI teams achieve better mission outcomes than either humans or AI operating alone. Behavioral fingerprinting contributes to this evaluation by characterizing the AI system's interaction properties: Does the model appropriately defer to human expertise (measured through sycophancy resistance and refusal appropriateness)? Does it communicate uncertainty in ways that support human decision-making (measured through confidence calibration)? Does it provide explanations that enable human oversight (measured through explainability depth)?

These dimensions do not fully resolve the human-AI teaming evaluation challenge, which ultimately requires measuring team outcomes rather than individual component properties. But they provide the foundation: a well-characterized AI partner whose behavioral properties are known, measured, and stable is a prerequisite for meaningful team evaluation.

8. Discussion

Several considerations merit discussion as this framework moves from conceptual architecture toward operational implementation.

First, the question of regulatory continuity. The revocation of Executive Order 14110 demonstrated that specific executive mandates for AI evaluation are subject to political change. However, the frameworks that underpin the approach described in this paper—ICD 203, the NIST AI RMF, NIST 600-1—have all survived the transition. ICD 203 is an Intelligence Community directive independent of any executive order. The NIST frameworks are voluntary standards published by an agency whose technical guidance persists across administrations. The MYSTIC DEPOT solicitation, issued under the current administration, confirms that evaluation demand is driven by mission requirements rather than policy preferences. The Secretary of Defense is required to establish a cross-functional team for AI model assessment by June 2026 under the White House AI Action Plan (White House, 2025). The bipartisan continuity of evaluation demand provides a stable foundation for long-term infrastructure investment.

Second, the challenge of benchmark development for classified missions. ICD 203 governs analysis that often involves classified sources, methods, and conclusions. Benchmarks that evaluate a model's performance on intelligence-relevant tasks may themselves require classification if they reveal analytical priorities or collection gaps. The architectural separation of evaluation infrastructure from evaluation content, described in Section 7.3, addresses this concern but does not eliminate it. Benchmark development for classified missions requires personnel with appropriate clearances, secure development environments, and classification review processes—requirements that MYSTIC DEPOT acknowledges through its preference for vendors with personnel holding TS/SCI clearances.

Third, the workforce dimension. Behavioral fingerprinting produces quantitative scores, but those scores require interpretation by personnel who understand both the technical evaluation methodology and the operational context. An IC analyst interpreting a sycophancy resistance score of 0.72 must understand what that number means for their specific analytical workflow and how it relates to ICD 203 Standard 4's requirement for analysis of alternatives. MYSTIC DEPOT's LOE 2 includes training materials enabling government personnel to develop benchmarks independently—a requirement that extends naturally to training personnel in fingerprint interpretation. The CDAO's work on developing different RAI Toolkit versions for different personas, including defense acquisition professionals (CDAO, 2024), suggests a model for audience-specific training in behavioral fingerprint interpretation.

Fourth, the question of threshold setting. A behavioral fingerprint tells an agency how a model performs across 16 dimensions, but it does not, by itself, determine whether that performance is adequate for a given mission. The establishment of minimum acceptable thresholds—a hallucination rate below 2 percent for intelligence production, a sycophancy resistance score above 0.85 for analytical support, a confidence calibration error below 0.05 for decision briefings—requires domain expertise and policy judgment that exceeds the scope of any technical framework. The fingerprinting architecture provides the measurement; threshold-setting is a governance function that resides within the NIST AI RMF's Govern and Manage functions and within the organizational authority structures that OMB M-24-10 requires agencies to establish.

Finally, the pace of model evolution presents both a challenge and an opportunity. The nine-fold increase in federal generative AI use cases within a single year reflects the speed at which models are being deployed. The same dynamism applies to model capabilities: frontier model performance can shift substantially between versions, and fine-tuning for specific domains can alter behavioral profiles in ways that are not predictable from general benchmarks alone. Continuous monitoring, as described in Section 7.2, addresses this challenge, but only if the evaluation infrastructure can itself keep pace. The modular, containerized architecture that MYSTIC DEPOT requires is a direct response to this concern, ensuring that individual components of the evaluation harness can be updated without requiring system-wide redeployment.

9. Conclusion

The federal government's AI evaluation gap is real, quantifiable, and consequential. With 1,757 AI use cases deployed across 37 agencies, 227 classified as rights-impacting or safety-impacting, and fewer than one-third of agencies tracking standardized performance metrics, the distance between compliance mandates and measurement capabilities represents a systemic risk to the effective and responsible use of AI in government.

This paper has demonstrated that behavioral fingerprinting provides a structurally sound bridge between AI technical evaluation and federal compliance requirements. The mapping between 16 behavioral fingerprint dimensions and the nine ICD 203 tradecraft standards is not approximate or metaphorical but direct and operationalizable. Each dimension corresponds to a measurable behavioral property, each tradecraft standard describes a quality that behavioral measurement can quantify, and the three-way alignment with NIST AI RMF trustworthiness characteristics and NIST 600-1 risk categories enables a single evaluation framework to serve multiple compliance mandates simultaneously.

The MYSTIC DEPOT solicitation validates both the urgency and the architectural approach. The Department of Defense and the Intelligence Community, through the Defense Innovation Unit, are actively seeking the evaluation infrastructure that this work describes. The requirements articulated in MYSTIC DEPOT—vendor-agnostic architecture, continuous monitoring, adversarial testing, multi-classification deployment, gaming-resistant benchmarks, and human-AI teaming evaluation—align naturally with a behavioral fingerprinting methodology designed for compliance-oriented assessment.

We conclude with three recommendations. First, ICD 203 should be amended to explicitly address AI systems as analytical participants, incorporating behavioral evaluation standards that parallel the existing human-oriented tradecraft standards. McMahon (2024) has made this argument from the policy perspective; this paper provides the technical architecture that such an amendment would reference. Second, the NIST AI RMF's Measure function should be operationalized through standardized evaluation harness tooling, rather than remaining a conceptual framework that each agency must independently translate into technical infrastructure. Third, the behavioral fingerprint should be adopted as the measurement substrate for cross-framework compliance, providing the quantitative foundation that transforms AI governance from a documentation exercise into a measurement discipline.

The convergence of intelligence tradecraft standards, risk management frameworks, generative AI risk profiles, and behavioral evaluation methodology is not coincidental. These frameworks are describing the same underlying reality from different institutional perspectives: that trustworthy AI requires measurable, repeatable, and auditable assessment of how AI systems behave. Behavioral fingerprinting provides the measurement. The frameworks provide the governance. What remains is the engineering and institutional will to connect them.

References

Office of the Director of National Intelligence. (2015). Intelligence Community Directive 203: Analytic Standards. Washington, DC: ODNI. https://www.dni.gov/files/documents/ICD/ICD-203.pdf
National Institute of Standards and Technology. (2023). NIST AI 100-1: Artificial Intelligence Risk Management Framework (AI RMF 1.0). Gaithersburg, MD: NIST. https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf
National Institute of Standards and Technology. (2024). NIST AI 600-1: Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile. Gaithersburg, MD: NIST. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf
Executive Office of the President. (2023). Executive Order 14110: Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. 88 FR 75191. Federal Register
Office of Management and Budget. (2024). Memorandum M-24-10: Advancing Governance, Innovation, and Risk Management for Agency Use of Artificial Intelligence. Washington, DC: OMB. https://www.whitehouse.gov/wp-content/uploads/2024/03/M-24-10
Executive Office of the President. (2024). National Security Memorandum on Advancing the United States' Leadership in Artificial Intelligence. Washington, DC. White House Archives
Department of Defense, Chief Digital and Artificial Intelligence Office. (2023). Data, Analytics, and AI Adoption Strategy. Washington, DC: DoD. https://media.defense.gov/2023/nov/02/2003333300
Department of Defense, Chief Digital and Artificial Intelligence Office. (2024). Responsible AI Strategy Implementation Pathway. Washington, DC: DoD. https://media.defense.gov/2024/Oct/26/2003571790
Defense Innovation Unit. (2026). MYSTIC DEPOT: Vendor-Agnostic AI Evaluation Infrastructure (PROJ00625). Washington, DC: DIU/ODNI. https://www.diu.mil/work-with-us/submit-solution/PROJ00625
General Services Administration. (2025). GSA FedRAMP Prioritizes 20x Authorizations for AI Cloud Solutions. https://www.gsa.gov/about-us/newsroom/news-releases
The White House. (2025). America's AI Action Plan. Washington, DC. https://www.whitehouse.gov/wp-content/uploads/2025/07/Americas-AI-Action-Plan.pdf
Government Accountability Office. (2025). GAO-25-107653: Artificial Intelligence — Generative AI Use and Management at Federal Agencies. Washington, DC: GAO. https://www.gao.gov/products/gao-25-107653
Electronic Privacy Information Center. (2024). Federal Agencies Largely Miss the Mark on Documenting AI Compliance Plans. Washington, DC: EPIC. https://epic.org/federal-agencies-largely-miss-the-mark
Department of Defense Inspector General. (2023). Evaluation of Analytic Standards Compliance. Arlington, VA: DoD IG. https://www.dodig.mil/reports.html/Article/3480460/
McMahon, G. M. (2024). Analytic Tradecraft Standards in an Age of AI. Belfer Center for Science and International Affairs, Harvard Kennedy School. https://www.belfercenter.org/research-analysis/analytic-tradecraft-standards-age-ai
Behavioral Fingerprinting of Large Language Models. (2025). arXiv:2509.04504. https://arxiv.org/abs/2509.04504
Kwoun, S. (2021). Analytic Tradecraft Standards: An Opportunity to Provide Decision Advantage for Army Commanders. Army Military Review, March–April 2021. https://www.armyupress.army.mil/Journals/Military-Review
Center for Strategic and International Studies. (2024). The Analytic Edge: Leveraging Emerging Technologies to Transform Intelligence Analysis. Washington, DC: CSIS. https://www.csis.org/analysis/analytic-edge
War on the Rocks. (2024). AI and Intelligence Analysis: Panacea or Peril? https://warontherocks.com/2024/10/ai-and-intelligence-analysis-panacea-or-peril/
National Institute of Standards and Technology. (2023). AI RMF Playbook. https://airc.nist.gov/airmf-resources/playbook/
DefenseScoop. (2026). AI System Testing: DoD and Intelligence Agencies. https://defensescoop.com/2026/03/11/ai-system-testing-dod-intelligence-agencies/
Nextgov/FCW. (2025). Agency AI Use Doubled in 2024, GAO Finds. https://www.nextgov.com/artificial-intelligence/2025/07/agency-ai-use-doubled-2024-gao-finds/407067/
Gallup. (2025). AI Adoption Is Rapidly Growing in the Public Sector. https://www.gallup.com/workplace/702983/adoption-rapidly-growing-public-sector.aspx
Federal Risk and Authorization Management Program. (2025). FedRAMP AI. https://www.fedramp.gov/ai/