MEASURE-2.11: Fairness and bias – as identified in the map function – are evaluated and results are documented.

MEASURE-2.11 requires you to test for fairness and bias risks you already identified during the MAP function, then document the methods, results, decisions, and remediation actions. Operationally, this means defining protected groups and fairness metrics, running repeatable evaluations across the AI lifecycle, escalating failures, and retaining audit-ready evidence. 1

Key takeaways:

  • Test the specific fairness/bias risks you mapped, not generic “bias checks.” 1
  • Make results decision-grade: thresholds, exceptions, remediation, and sign-offs. 1
  • Treat documentation as a control: versioned artifacts tied to model releases and monitoring. 1

Footnotes

  1. NIST AI RMF Core

Compliance leaders usually fail MEASURE-2.11 in one of two ways: they either do nothing beyond a high-level “bias statement,” or they run a one-time fairness analysis that can’t be repeated, explained, or tied to real decisions. MEASURE-2.11 is narrower and more operational: fairness and bias risks are first identified in MAP, then evaluated in MEASURE, and the results are documented so they can be governed, audited, and improved over time. 1

To operationalize this requirement quickly, you need three things: (1) a clear scope (which AI systems, which decisions, which impacted populations), (2) a repeatable evaluation approach (datasets, metrics, and testing cadence tied to releases and drift), and (3) documentation that stands on its own (what you tested, why you chose those tests, what you found, what you changed, and who accepted residual risk). 1

This page gives requirement-level implementation guidance for a CCO, Compliance Officer, or GRC lead who needs to stand up defensible fairness/bias evaluation controls without turning the program into a research project. It also shows how to structure evidence so an examiner can trace MAP risks through to MEASURE results and governance actions. 2

Regulatory text

Requirement: “Fairness and bias – as identified in the map function – are evaluated and results are documented.” 1

What the operator must do:

  1. Take the fairness and bias risks you documented in MAP (your AI risk and impact understanding) and convert them into concrete testable hypotheses and evaluation plans. 1
  2. Perform evaluations using appropriate data, metrics, and qualitative review where needed. 1
  3. Document results in a way that supports decisions: release/no-release, mitigations, monitoring requirements, and formal risk acceptance if gaps remain. 1

Plain-English interpretation (what MEASURE-2.11 is really asking)

You must be able to prove, with records, that you checked whether the AI system behaves unfairly across relevant groups and contexts you already identified as at-risk. MEASURE-2.11 is not satisfied by a policy statement, a single slide, or an unscoped metric. It is satisfied when a reviewer can follow a line from: mapped bias risk → evaluation method → results → action taken → residual risk decision. 1

Who it applies to (entity and operational context)

Applies to: organizations developing, integrating, procuring, or deploying AI systems where outputs influence decisions, opportunities, access, or treatment of individuals or groups. 2

Most relevant operational contexts:

  • AI used in customer eligibility, pricing, underwriting, fraud, collections, hiring, screening, content moderation, student evaluation, healthcare prioritization, identity verification, or any workflow with disparate impact potential.
  • AI sourced from a third party (model/API/provider) where you still bear governance responsibilities for deployment decisions and monitoring in your environment.

Control owners (typical RACI):

  • Accountable: GRC/Compliance or Model Risk Management (MRM), depending on your operating model.
  • Responsible: Data Science/ML Engineering for tests; Product/Business Owner for decisioning context; Legal/Privacy for protected class definitions and constraints.
  • Consulted: HR, DEI, Customer Ops, Ethics committee (if you have one).
  • Informed: Internal Audit, Risk committee.

What you actually need to do (step-by-step)

1) Convert MAP outputs into a fairness test plan

Start with what your MAP function already identified: impacted populations, plausible bias mechanisms, and where harm could occur. Then create a Fairness & Bias Evaluation Plan per AI system/model release with:

  • Decision/use case statement (what the model is used for, and what it is not used for).
  • Impacted parties and environments (production channels, geographies, languages, accessibility constraints).
  • Protected or sensitive attributes you will evaluate or a documented rationale if you cannot collect them directly and must use proxies or alternative analyses.
  • Failure modes mapped to tests (examples below).

Keep this plan versioned and tied to a specific model/version so it’s auditable. 1

2) Define groups, labels, and “fairness hypotheses” you can test

You need a testable structure, even if the final analysis includes qualitative judgment:

  • Groups: define the cohorts you will compare (for example, by demographic attribute, geography, language, disability status, or other risk-relevant segmentation identified in MAP).
  • Outcomes: define the outcome being evaluated (approval, ranking position, false positive rate, content removal, etc.).
  • Hypotheses: write plain statements such as “False positives are higher for Group A than Group B in scenario X.”

If you cannot collect protected attributes, document what you did instead (proxy analysis, stratified error analysis by geography/language, or targeted user testing) and why. The requirement is evaluation plus documentation, not perfect data access. 1

3) Select evaluation methods that match the risk

Use a mix of methods. A typical set includes:

  • Pre-deployment testing: benchmark performance and error profiles across groups.
  • Scenario testing: stress test edge cases that reflect MAP-identified harms (dialect variation, low-light images, non-standard names, assistive tech).
  • Human review sampling: where fairness cannot be reduced to a numeric metric (for example, content moderation consistency).
  • Post-deployment monitoring: detect drift and emergent bias after release.

Tie each method to the MAP risk register entry so you can show traceability. 1

4) Set decision rules: thresholds, exceptions, and escalation

Evaluations only matter if they drive decisions. Define:

  • Pass/fail or “requires review” criteria for fairness checks.
  • Exception process (who can approve shipment with known fairness gaps, under what constraints, and what compensating controls are required).
  • Escalation path to Risk/Compliance when outcomes suggest potential harm.

Avoid “we’ll look at it” language. Document a decision rule that forces action. 1

5) Run the evaluation at the right moments in the lifecycle

Trigger fairness/bias evaluations:

  • Before first production deployment.
  • At each material model update (new training data, new features, changed objective).
  • When the use case changes (new population, new geography, new channel).
  • When monitoring indicates drift or complaints indicate disparate impact.

You are building a control, not a one-off assessment. 1

6) Document results in a structured, reviewer-friendly way

Your results package should include:

  • What version was tested (model, data, code commit where possible).
  • What datasets were used, including known limitations and representativeness notes.
  • Metrics/analyses performed and why they were chosen for the mapped risks.
  • Findings, including any gaps or uncertainty.
  • Remediation actions (data changes, feature changes, thresholding, post-processing, UI/UX changes, human-in-the-loop adjustments).
  • Residual risk and sign-off (business owner and compliance/risk as appropriate).

This documentation is what makes MEASURE-2.11 auditable. 1

7) Operationalize recurring evidence collection (so it doesn’t rot)

Assign an owner and a cadence for evidence capture:

  • Each release generates a fairness/bias evaluation report.
  • Monitoring produces periodic fairness signals and incident tickets when thresholds are breached.
  • Exceptions and sign-offs are logged.

If you track this in Daydream, the practical setup is a control mapped to MEASURE-2.11 with an owner, recurrence, and required attachments so evidence is collected the same way each cycle and survives team turnover. 1

Required evidence and artifacts to retain (audit-ready)

Maintain a minimum “traceability set” per AI system/version:

Artifact What it proves Owner
MAP outputs (risk register entries on fairness/bias) The risks were identified and scoped GRC/MRM
Fairness & Bias Evaluation Plan You translated mapped risks into tests DS + GRC
Data documentation (sources, time window, sampling, limitations) Inputs are understood; caveats are recorded DS/Data Eng
Evaluation notebooks/scripts or test harness outputs Tests are repeatable DS/ML Eng
Results report (with interpretation) Outcomes were evaluated and documented DS + Product
Remediation log / change record Findings drove action Product/Eng
Exception approvals & residual risk acceptance Governance decisions were made Business + Compliance
Monitoring dashboards/alerts + incident tickets Ongoing evaluation after deployment MLOps/GRC

Keep artifacts versioned and linked to deployment approvals. 1

Common exam/audit questions and hangups

Expect reviewers to probe:

  • “Show me the fairness/bias risks you identified in MAP and the specific tests you ran for each.” 1
  • “Who approved release when results showed differences across groups?”
  • “How do you know results are still valid after data drift or a model update?” 1
  • “What do you do when you cannot collect protected class attributes?”
  • “Where are the records for prior versions, not just the current model?”

Hangup pattern: teams present a generic fairness dashboard with no linkage to mapped risks or business decisions. MEASURE-2.11 expects traceability and governance records. 1

Frequent implementation mistakes (and how to avoid them)

  1. Testing only overall accuracy and calling it “fairness.”
    Fix: require group-based error analysis aligned to mapped harms, plus qualitative review for subjective domains. 1

  2. Running ad hoc analyses with no version control.
    Fix: standardize a test harness and require results packages tied to model versions and releases. 1

  3. No decision thresholds or exception governance.
    Fix: publish decision rules and a documented risk acceptance workflow. 1

  4. Ignoring downstream human and process bias.
    Fix: include workflow testing (review queues, overrides, explanations, UI) where bias can enter even if the model is stable. 1

  5. Treating third-party models as “not our problem.”
    Fix: contract for disclosures where possible, but still evaluate in your context with your data and monitor outcomes post-deployment. 2

Enforcement context and risk implications

No public enforcement cases were provided in the source catalog for this requirement. Your practical risk is examination failure, governance findings, reputational harm, and increased litigation/regulatory exposure if fairness issues affect protected groups. MEASURE-2.11 reduces that risk by making fairness evaluation repeatable and provable, not informal. 1

A practical 30/60/90-day execution plan

First 30 days (establish control and scope)

  • Inventory in-scope AI systems and identify which have MAP-documented fairness/bias risks. 1
  • Assign control owner(s) and publish a Fairness & Bias Evaluation Plan template with required fields and sign-offs. 1
  • Pick one high-impact system as a pilot and run an end-to-end evaluation with documentation.

By 60 days (make it repeatable)

  • Build a standardized evaluation workflow (datasets, scripts, review checklist, results report format).
  • Define decision rules and an exception process; route exceptions to an established risk forum.
  • Stand up monitoring requirements for deployed systems and define alert triage ownership. 1

By 90 days (scale and audit-proof)

  • Expand evaluations across remaining in-scope systems based on risk ranking from MAP.
  • Implement recurring evidence collection and linkage to release management (deployment gates).
  • Internal audit-style review: trace a sample from MAP risk → evaluation → action → sign-off, and fix documentation gaps. 1

Frequently Asked Questions

Do we need to prove the model is “unbiased” to satisfy MEASURE-2.11?

No. You need to evaluate the fairness and bias risks identified in MAP and document results, decisions, and remediation. If residual risk remains, document acceptance and controls. 1

What if we cannot collect protected class data?

Document the constraint, then evaluate using feasible approaches like proxy segmentation, geography/language stratification, scenario testing, and targeted user studies tied to mapped harms. Record the limitations and any compensating controls. 1

How often should we run fairness/bias evaluations?

Run them at minimum before initial deployment and at material changes to model, data, or use case, and repeat when monitoring or incidents indicate drift or harm. Tie the triggers to your release and monitoring processes. 1

Does this requirement apply to third-party AI systems we buy?

Yes in practice, because you still deploy and govern outcomes. Ask third parties for relevant documentation, but also evaluate performance and fairness in your context and retain your own records. 2

What documentation format works best for auditors?

A versioned results packet that ties MAP risks to specific tests, datasets, outputs, mitigations, and sign-offs. Auditors prioritize traceability over presentation polish. 1

How do we operationalize this without slowing releases?

Create standardized test harnesses and templates, define clear pass/review thresholds, and integrate the fairness evaluation into release gates. Track exceptions and evidence in a system of record so teams don’t rebuild documentation each cycle. 1

Footnotes

  1. NIST AI RMF Core

  2. NIST AI RMF program page

Frequently Asked Questions

Do we need to prove the model is “unbiased” to satisfy MEASURE-2.11?

No. You need to evaluate the fairness and bias risks identified in MAP and document results, decisions, and remediation. If residual risk remains, document acceptance and controls. (Source: NIST AI RMF Core)

What if we cannot collect protected class data?

Document the constraint, then evaluate using feasible approaches like proxy segmentation, geography/language stratification, scenario testing, and targeted user studies tied to mapped harms. Record the limitations and any compensating controls. (Source: NIST AI RMF Core)

How often should we run fairness/bias evaluations?

Run them at minimum before initial deployment and at material changes to model, data, or use case, and repeat when monitoring or incidents indicate drift or harm. Tie the triggers to your release and monitoring processes. (Source: NIST AI RMF Core)

Does this requirement apply to third-party AI systems we buy?

Yes in practice, because you still deploy and govern outcomes. Ask third parties for relevant documentation, but also evaluate performance and fairness in your context and retain your own records. (Source: NIST AI RMF program page)

What documentation format works best for auditors?

A versioned results packet that ties MAP risks to specific tests, datasets, outputs, mitigations, and sign-offs. Auditors prioritize traceability over presentation polish. (Source: NIST AI RMF Core)

How do we operationalize this without slowing releases?

Create standardized test harnesses and templates, define clear pass/review thresholds, and integrate the fairness evaluation into release gates. Track exceptions and evidence in a system of record so teams don’t rebuild documentation each cycle. (Source: NIST AI RMF Core)

Operationalize this requirement

Map requirement text to controls, owners, evidence, and review workflows inside Daydream.

See Daydream