MAP-2.3: Scientific integrity and TEVV considerations are identified and documented, including those related to experimental design, data collection and selection (e.g., availability, representativeness, suitability), system trustworthiness

To meet MAP-2.3, you must explicitly identify and document how you will preserve scientific integrity and perform TEVV (test, evaluation, verification, and validation) across experimental design, data sourcing/selection, and trustworthiness, including construct validation. Operationally, this means creating a repeatable TEVV plan and evidence trail that ties datasets, model goals, tests, and decisions together in an auditable record. (NIST AI RMF Core)

Key takeaways:

  • Treat MAP-2.3 as a documentation-and-evidence control, not a one-time “model validation” task. (NIST AI RMF Core)
  • Build a TEVV plan that covers experimental design, data representativeness/suitability, and trustworthiness claims you make about the system. (NIST AI RMF Core)
  • Retain artifacts that show what you tested, what you found, what you changed, and who approved release. (NIST AI RMF Core)

MAP-2.3 sits in the “Map” function of the NIST AI Risk Management Framework and forces a discipline many programs skip: you must be able to show, on paper, that your AI development and deployment choices are scientifically defensible and that TEVV activities were planned and executed to support system trustworthiness. The requirement is not satisfied by a single model card, an informal data scientist checklist, or a generic QA process. It expects you to identify the scientific integrity risks in your specific use case and document the tests, evaluation methods, verification checks, and validation criteria that address them. (NIST AI RMF Core)

For a Compliance Officer, CCO, or GRC lead, the fastest path is to operationalize MAP-2.3 as a control with: (1) a standard TEVV template, (2) clear ownership across product, data science, and risk, (3) gated approvals tied to release management, and (4) recurring evidence collection. Done well, MAP-2.3 becomes your defensible narrative when asked: “Why should we trust this system, given the data and design choices you made?” (NIST AI RMF Core; NIST AI RMF program page)

Regulatory text

Requirement (excerpt): “Scientific integrity and TEVV considerations are identified and documented, including those related to experimental design, data collection and selection (e.g., availability, representativeness, suitability), system trustworthiness, and construct validation.” (NIST AI RMF Core)

What the operator must do: Maintain written, reviewable documentation showing (1) how experiments were designed to avoid biased or misleading conclusions, (2) how data was collected/selected and why it is appropriate for the intended use (including representativeness and suitability), and (3) what TEVV activities were planned and executed to support the trustworthiness of the system, including evidence that your constructs are valid (you measured what you intended to measure). (NIST AI RMF Core)

Plain-English interpretation (what MAP-2.3 really demands)

MAP-2.3 requires an auditable chain from intent → construct → data → experiment → TEVV → trustworthiness claims → release decision.

In practice, examiners, internal audit, and second-line reviewers will probe four failure modes:

  1. Unstated constructs: Teams can’t explain what the model is truly predicting (construct validation gap). (NIST AI RMF Core)
  2. Data mismatch: Training or evaluation data does not reflect the deployment population or conditions (representativeness/suitability gap). (NIST AI RMF Core)
  3. Weak experimental design: Tests are ad hoc, cherry-picked, or not reproducible (scientific integrity gap). (NIST AI RMF Core)
  4. Trustworthiness claims without proof: “Accurate,” “fair,” “robust,” or “safe” is asserted without TEVV evidence and acceptance criteria. (NIST AI RMF Core)

Who it applies to (entity and operational context)

Applies to: Any organization developing or deploying AI systems under the NIST AI RMF. (NIST AI RMF Core)

Operational contexts where MAP-2.3 becomes high-friction:

  • Regulated decisioning (credit, hiring, healthcare triage, insurance, fraud) where representativeness and construct validity are frequently challenged.
  • LLM/RAG systems where data provenance, evaluation design, and safety testing are often immature.
  • Third-party AI components (hosted models, APIs, data brokers) where you still need TEVV evidence to justify trustworthiness claims you operationalize.

Control ownership (typical):

  • First line: Product + ML engineering + data science (design and execution of TEVV).
  • Second line: GRC / Model risk / Compliance (standards, challenge, evidence expectations).
  • Third line: Internal audit (independent testing of control design and operating effectiveness).

What you actually need to do (step-by-step)

Step 1: Define the “trustworthiness claim set” for the system

Document which trustworthiness characteristics you claim are necessary for the use case (for example: accuracy, robustness, reliability, safety, explainability, privacy alignment, and resilience), and tie each to a risk statement and a measurable acceptance criterion. Keep this short and system-specific. (NIST AI RMF Core)

Output: Trustworthiness Claims & Acceptance Criteria sheet.

Step 2: Document construct validation (what is the model measuring?)

Write a one-page “construct brief”:

  • Target label definition (what counts, what doesn’t).
  • Known proxies and why they are acceptable (or prohibited).
  • Stakeholder review notes (domain owner sign-off).
  • Known limitations and potential unintended measurement.

This is where many teams fail: they validate metrics but never validate the construct. (NIST AI RMF Core)

Output: Construct brief with approvals.

Step 3: Formalize experimental design for TEVV

Create a TEVV protocol that makes tests reproducible:

  • Hypotheses (what you expect, what would falsify it).
  • Dataset splits and rationale (time-based split, random split, stratification).
  • Baselines and comparators.
  • Primary metrics and secondary “harm” metrics.
  • Statistical considerations and sensitivity checks (qualitative if you can’t support quant detail).
  • Reproducibility requirements (seed control, version pinning, environment capture).

Output: TEVV protocol (version controlled). (NIST AI RMF Core)

Step 4: Data collection and selection documentation (availability, representativeness, suitability)

Maintain a dataset record that addresses:

  • Source and provenance (internal system, third party, synthetic generation, user feedback loops).
  • Data availability constraints and missingness patterns.
  • Representativeness analysis: how the data compares to the deployment population and edge cases you care about.
  • Suitability: why this data supports the construct and the intended use, and where it does not.
  • Selection/exclusion rules and rationale (including de-duplication, filtering, labeling guidelines).

If you rely on third-party data or a third-party model provider, document what you can verify independently and what you must accept contractually, then reflect that in residual risk. (NIST AI RMF Core)

Output: Dataset dossier(s) + third-party due diligence addendum where relevant.

Step 5: Execute TEVV with traceable results

Run tests and preserve the evidence:

  • Test runs mapped to acceptance criteria.
  • Failure triage notes (what failed, why, severity).
  • Remediation actions (data changes, training changes, guardrails, prompt changes).
  • Regression testing results after fixes.
  • Release recommendation with sign-offs.

Output: TEVV results package tied to a specific release candidate. (NIST AI RMF Core)

Step 6: Put MAP-2.3 on rails (governance + recurring evidence)

Operationalize as a control:

  • Assign a control owner.
  • Define when TEVV is required (new system, major model update, material data change, new use case, new deployment region).
  • Add gating to SDLC/MLops (no promotion to production without required artifacts).
  • Set a recurring cadence to refresh TEVV when drift or upstream changes occur.

If you need a pragmatic system to track owners, collect recurring evidence, and prove control operation across many AI use cases and third parties, Daydream is a natural place to map MAP-2.3 to a policy, procedure, owner, and evidence workflow without rebuilding the same checklists in spreadsheets. (NIST AI RMF Core)

Required evidence and artifacts to retain

Use this as your MAP-2.3 evidence checklist:

Artifact What it proves Owner
Trustworthiness claims + acceptance criteria You defined what “trustworthy” means for this system Product/Risk
Construct brief (construct validation) You validated what is being measured/predicted Domain owner + Data science
TEVV protocol (experimental design) Tests were pre-defined and reproducible Data science/ML eng
Dataset dossier (availability/representativeness/suitability) Data selection was reasoned and documented Data engineering/Data governance
TEVV run logs + results Tests were executed and results recorded ML eng
Issue log + remediation notes Failures were managed and fixed Product/Eng
Release sign-offs and residual risk acceptance Decision accountability Product + Risk/Compliance

All artifacts should be version-controlled and linked to the system version and data version they cover. (NIST AI RMF Core)

Common exam/audit questions and hangups

Expect variants of:

  • “Show me how you decided which trustworthiness attributes matter for this use case.” (NIST AI RMF Core)
  • “Where is construct validation documented, and who approved the label/proxy choices?” (NIST AI RMF Core)
  • “How do you know the evaluation data represents production conditions?” (NIST AI RMF Core)
  • “Which TEVV tests are required before release, and what are the pass/fail thresholds?” (NIST AI RMF Core)
  • “What changed since the last validation, and how did you decide whether re-testing was needed?” (NIST AI RMF Core)

Hangups that slow reviews:

  • TEVV artifacts exist but are scattered across notebooks, tickets, and chat.
  • Tests were run, but acceptance criteria were written after the fact.
  • Dataset provenance is incomplete, especially where third parties are involved.

Frequent implementation mistakes and how to avoid them

  1. Mistake: Treating TEVV as only accuracy testing.
    Avoidance: Map each trustworthiness claim to at least one verification/validation activity and retain evidence. (NIST AI RMF Core)

  2. Mistake: No construct validation, only metric validation.
    Avoidance: Require a construct brief and domain-owner sign-off before model training begins. (NIST AI RMF Core)

  3. Mistake: Representativeness discussed qualitatively with no structured comparison.
    Avoidance: Standardize a representativeness section in the dataset dossier (population, sampling frame, known skews, expected deployment differences). (NIST AI RMF Core)

  4. Mistake: “We can’t test because it’s a third-party model.”
    Avoidance: Document what you tested at the system level (prompting, guardrails, business rules, monitoring) and what you rely on contractually; capture residual risk acceptance. (NIST AI RMF Core)

Enforcement context and risk implications (practical, non-speculative)

NIST AI RMF is a framework, not a regulator, so this requirement is typically enforced through internal governance, customer diligence, procurement requirements, and sector-specific regulators using documentation and testing expectations as a benchmark. Your real risk is not “MAP-2.3 fines”; it is failed model governance reviews, procurement blocks, contractual non-compliance, and inability to defend your system’s trustworthiness claims under scrutiny. (NIST AI RMF Core; NIST AI RMF program page)

Practical execution plan (30/60/90 days)

First 30 days (foundation)

  • Name a MAP-2.3 control owner and approvers (product, data science, risk).
  • Publish templates: Trustworthiness claims sheet, TEVV protocol, dataset dossier, construct brief.
  • Pilot on one high-visibility AI system and collect artifacts end-to-end. (NIST AI RMF Core)

By 60 days (operationalize and gate)

  • Embed TEVV and data documentation into SDLC/MLops release gates.
  • Build an evidence register that ties artifacts to system versions and change events.
  • Train reviewers on what “good” looks like and how to challenge construct and representativeness claims. (NIST AI RMF Core)

By 90 days (scale and sustain)

  • Expand to remaining in-scope systems and third-party AI dependencies.
  • Add recurring triggers for re-TEVV (material data changes, model updates, new deployment context).
  • Run an internal audit-style tabletop: pick a system, request the MAP-2.3 package, and test retrieval speed and completeness. (NIST AI RMF Core)

Frequently Asked Questions

What counts as TEVV evidence for an LLM feature where classic accuracy metrics don’t fit?

TEVV can include structured red-teaming results, safety evaluations tied to acceptance criteria, and regression tests for known failure modes. MAP-2.3 cares that you identified and documented the approach and can reproduce it. (NIST AI RMF Core)

Do we need to document representativeness if the system is internal-only?

Yes, if the system impacts people, operations, or decisions, you still need to show the data matches the operational population and conditions. Internal-only systems still fail when training data differs from real workflows. (NIST AI RMF Core)

How detailed should construct validation be?

One page is often enough if it clearly defines the target, identifies proxies, and includes domain-owner approval and known limitations. The key is proving you measured what you intended to measure. (NIST AI RMF Core)

What if we can’t access a third party model’s training data?

Document the limitation, test what you can at the system boundary, and record what assurances you rely on from the third party. Capture residual risk acceptance tied to the trustworthiness claims you still make. (NIST AI RMF Core)

Who should approve MAP-2.3 documentation?

Assign approvals to the people accountable for the construct (business/domain), the technical validity (data science/engineering), and risk acceptance (risk/compliance). Approvals should align to your release governance. (NIST AI RMF Core)

Can we satisfy MAP-2.3 with a single model card?

A model card can be part of the evidence set, but MAP-2.3 expects documented experimental design, data selection rationale, TEVV execution evidence, and construct validation. Most model cards don’t cover all of that with sign-offs. (NIST AI RMF Core)

Frequently Asked Questions

What counts as TEVV evidence for an LLM feature where classic accuracy metrics don’t fit?

TEVV can include structured red-teaming results, safety evaluations tied to acceptance criteria, and regression tests for known failure modes. MAP-2.3 cares that you identified and documented the approach and can reproduce it. (NIST AI RMF Core)

Do we need to document representativeness if the system is internal-only?

Yes, if the system impacts people, operations, or decisions, you still need to show the data matches the operational population and conditions. Internal-only systems still fail when training data differs from real workflows. (NIST AI RMF Core)

How detailed should construct validation be?

One page is often enough if it clearly defines the target, identifies proxies, and includes domain-owner approval and known limitations. The key is proving you measured what you intended to measure. (NIST AI RMF Core)

What if we can’t access a third party model’s training data?

Document the limitation, test what you can at the system boundary, and record what assurances you rely on from the third party. Capture residual risk acceptance tied to the trustworthiness claims you still make. (NIST AI RMF Core)

Who should approve MAP-2.3 documentation?

Assign approvals to the people accountable for the construct (business/domain), the technical validity (data science/engineering), and risk acceptance (risk/compliance). Approvals should align to your release governance. (NIST AI RMF Core)

Can we satisfy MAP-2.3 with a single model card?

A model card can be part of the evidence set, but MAP-2.3 expects documented experimental design, data selection rationale, TEVV execution evidence, and construct validation. Most model cards don’t cover all of that with sign-offs. (NIST AI RMF Core)

Operationalize this requirement

Map requirement text to controls, owners, evidence, and review workflows inside Daydream.

See Daydream