MEASURE-2.2: Evaluations involving human subjects meet applicable requirements (including human subject protection) and are representative of the relevant population.

9 min readLast verified: February 2026By Isaac Silverman

To operationalize measure-2.2: evaluations involving human subjects meet applicable requirements (including human subject protection) and are representative of the relevant population. requirement, you must (1) route any AI evaluation involving people through an appropriate human-subjects compliance workflow, and (2) design sampling/recruitment so participants reflect the real-world population affected by the AI system, then retain proof. This is an execution-and-evidence requirement, not a policy statement. ¹

Key takeaways:

Treat “human subjects” broadly: user studies, red-teaming with participants, A/B tests, and monitoring that collects identifiable feedback can all qualify.
“Representative” means your participant mix matches the intended users and impacted groups, not just whoever is easiest to recruit.
Auditors will ask for artifacts: review/approval record, consent language, recruitment/sampling plan, and demographic/coverage analysis tied to system purpose.

MEASURE-2.2 sits in the NIST AI Risk Management Framework’s MEASURE function and is aimed at making AI evaluations credible, safe for participants, and decision-useful. The requirement is short, but operationally it touches legal, privacy, security, research practices, product experimentation, and model risk management. If your teams run usability tests, bias/fairness evaluations with people, human-in-the-loop labeling, clinical-style studies, employee pilots, or customer A/B experiments that affect outcomes, you need a single control that gates these activities and produces consistent evidence. ¹

Two failure modes show up in real programs. First: teams run “research” without human subject protections because it looks like product testing. Second: teams run evaluations that are statistically neat but operationally meaningless because the sample is not representative of who will actually be impacted, especially for high-impact decisions or safety-sensitive contexts. MEASURE-2.2 forces you to solve both: compliance with applicable human subject requirements and defensible representativeness. ¹

This page gives requirement-level implementation guidance a CCO, GRC lead, or compliance officer can put into a control owner’s hands and audit quickly.

Regulatory text

Requirement (excerpt): “Evaluations involving human subjects meet applicable requirements (including human subject protection) and are representative of the relevant population.” ¹

Operator meaning: If an evaluation uses people as participants or collects data about them for evaluation purposes, you must:

identify and meet the applicable human-subject protection requirements for your context (internal policy, contractual terms, sector rules, and any required review/approval pathway), and
design the evaluation so participants reflect the population that will use or be affected by the AI system. ¹

Plain-English interpretation (what “counts”)

What is an “evaluation involving human subjects” in AI work?

Treat these as in-scope unless you have a documented rationale:

Usability testing of AI-enabled features with employees, customers, or the public
Red-team exercises using participants, especially if collecting sensitive feedback
A/B tests or live experiments that change model outputs for subsets of users
Human-in-the-loop decisioning pilots (e.g., agents reviewing AI recommendations)
Post-deployment monitoring that asks users for outcome feedback tied to identity

If people are exposed to risk (psychological, reputational, economic), if personal data is collected, or if outcomes change for a group during testing, you should assume MEASURE-2.2 applies and route the activity through your defined review channel. ¹

What “representative of the relevant population” means operationally

“Relevant population” is the set of people who will:

use the system (end users), and/or
be evaluated or affected by the system’s outputs (impacted individuals)

Representativeness is not “demographics only.” It can include language, disability/access needs, domain expertise, device constraints, geography, and other factors that change performance or harm risk. Document what “representative” means for your system and why. ¹

Who it applies to

Entities: Any organization developing, deploying, or operating AI systems that evaluate those systems using people. This includes regulated and non-regulated businesses, and it includes evaluations run by third parties on your behalf. ¹

Operational contexts where it triggers most often:

Product teams running experiments and feature rollouts
Data science teams running human evaluation of model outputs
Trust & safety or responsible AI teams running fairness or harm studies
UX research and customer research functions
Third parties conducting testing, labeling, or user research

What you actually need to do (step-by-step)

Below is a control workflow you can implement as a lightweight gate with strong evidence.

Step 1: Define the “Human-Subjects Evaluation Intake” trigger

Create a short intake form (ticket, GRC workflow, or experiment template) that asks:

Are people participating directly (study, interview, test) or indirectly (live experiment affecting outputs)?
Are you collecting personal data or sensitive attributes?
Is this a high-impact use case (employment, housing, credit, education, healthcare, public services) as defined internally?
Who is the relevant population (users and impacted individuals)?

Output: a documented determination: in scope / out of scope, plus rationale. ¹

Step 2: Assign a control owner and review path

Name a single accountable owner (commonly: Responsible AI lead, Research governance, or Compliance) and define reviewers by risk:

Privacy (data collection, retention, notices)
Security (study tooling, data handling, access control)
Legal/Compliance (contract, notices, sector constraints)
Research governance / ethics review body (where established)

If you do not have an IRB-equivalent body, you still need a defined internal review and approval record for human subject protection decisions. ¹

Step 3: Human subject protection package (minimum viable set)

For each in-scope evaluation, require a “protection package” that includes:

Purpose and study protocol (what happens to participants)
Risk assessment (what could go wrong for participants)
Consent/notice language and participant communications
Data handling plan (collection, minimization, retention, deletion)
Adverse event process (how participants can report issues; how you respond)
Eligibility criteria and exclusion criteria with justification

Tie each item to the evaluation’s risk. Keep it practical; auditors want consistency and completeness, not academic formatting. ¹

Step 4: Representativeness plan (define, recruit, measure, adjust)

Create a standard template with four required fields:

Population definition: Who will use and who will be affected, in plain terms.
Representation dimensions: List the factors that could change performance or harm (examples: language, age bracket, disability/access, domain expertise, region, device type).
Recruitment and sampling method: How you will recruit and avoid convenience bias. If you cannot recruit certain groups, document constraints and mitigations.
Coverage analysis: After recruitment, document what you achieved versus what you intended, plus implications for interpreting results.

This is where most teams fail: they do not pre-commit to what “representative” means, then cannot defend the study when results are challenged. ¹

Step 5: Run the evaluation with controlled changes and traceability

Operational controls to require:

Version control for model, prompts, feature flags, and evaluation scripts
Traceability from participant to condition (without exposing identity broadly)
Documentation of deviations from protocol and why they occurred
Clear stopping criteria for elevated participant risk (pause/escalate)

Output: a final evaluation report that ties results to the population definition and limitations. ¹

Step 6: Closeout and governance actions

Require a closeout checklist:

Were participants debriefed if needed?
Was data handled per plan (retention/deletion executed)?
What decision did the evaluation support (ship/no-ship/mitigations)?
What residual risk remains due to representativeness gaps?

Feed outcomes into your AI risk register and model/system documentation. ¹

Required evidence and artifacts to retain

Keep these in a single “evaluation evidence packet” per study:

Intake record with in-scope determination and approvers
Study protocol, risk assessment, and consent/notice artifacts
Data handling plan and proof of execution (retention/deletion confirmation)
Recruitment/sampling plan, eligibility criteria, and screeners
Representativeness definition and coverage analysis (planned vs achieved)
Tooling and access control evidence for study data
Final evaluation report, including limitations and decisions taken
Third-party contracts/SOWs and data processing terms if outsourced

If you use Daydream to track third-party risk and due diligence, store the evidence packet link inside the third party record and map the control to the study owner and recurring evidence collection cadence so audits do not turn into scavenger hunts. ¹

Common exam/audit questions and hangups

Expect reviewers to test three things:

Scope discipline: “Show me how you decide whether an evaluation involves human subjects.”
Protection completeness: “Where is consent documented, and what participant risks did you assess?”
Representativeness justification: “How do you know the sample reflects the relevant population, and what are the limitations?”

Hangups that stall audits:

Teams call it “product telemetry” and skip review even though the test changes outcomes for real users.
“Representative” is asserted with no written definition tied to system context.
Third parties recruit participants with opaque panels and no auditable sampling detail. ¹

Frequent implementation mistakes (and how to avoid them)

Mistake: No single gate. Teams run studies in multiple tools with no consistent approvals.
Fix: One intake workflow required for any human-subject evaluation, including third-party run testing. ¹
Mistake: Consent text copied from marketing templates. It fails to describe actual evaluation risks and data use.
Fix: Maintain a consent/notice library reviewed by Legal/Privacy; require study-specific addenda. ¹
Mistake: Representativeness treated as demographics-only. The model fails for accents, devices, literacy levels, or domain expertise.
Fix: Define representation dimensions based on how the system can fail in context, then measure coverage. ¹
Mistake: Third-party studies are “out of sight, out of mind.”
Fix: Contractually require the same protection package and representation reporting, and collect it as deliverables. ¹

Enforcement context and risk implications

No public enforcement cases were provided in the source catalog for this requirement, so you should treat MEASURE-2.2 as a defensibility expectation rather than a standalone penalty hook. The practical risk is indirect but serious: if an AI-related incident occurs, inability to show human subject protections and population representativeness weakens your position with regulators, customers, and litigants, and can force product rollbacks or delayed launches due to untrusted results. ¹

Practical 30/60/90-day execution plan

First 30 days (Immediate foundation)

Appoint a control owner and define the approval path for in-scope evaluations.
Publish a one-page scoping rule and intake form (include examples of in-scope activities).
Create minimum templates: protocol, consent/notice, data handling plan, representativeness plan, closeout checklist. ¹

By 60 days (Operationalize across teams and third parties)

Embed the intake gate into experiment tooling or the product release process for AI features.
Train UX research, data science, and product leads on scoping and required artifacts.
Update third-party SOW language to require deliverables: recruitment method, consent approach, and representativeness reporting. ¹

By 90 days (Audit readiness and continuous operation)

Run a tabletop review of a recent evaluation: can you produce the full evidence packet quickly?
Add a periodic control test: sample completed evaluations and verify approvals, consent, data handling execution, and coverage analysis.
Tie findings to your AI risk register and track remediation items to closure. ¹

Frequently Asked Questions

Does MEASURE-2.2 apply to internal employee pilots?

Yes if employees are participants in an evaluation or if the pilot changes outcomes for them in a way that creates risk, and you collect data about them for evaluation. Treat it as in-scope unless you document a rationale for exclusion. ¹

We only do A/B tests in production. Is that “human subjects” work?

If the experiment affects what real users experience or the decisions they receive, it is an evaluation involving people in the loop. Route it through the intake, document notices/consent approach, and justify representativeness of the exposed cohort. ¹

What if we cannot recruit a fully representative sample?

Document the constraint, the resulting limitation on findings, and mitigation steps (extra monitoring, conservative launch criteria, targeted follow-up studies). Auditors mainly want to see you recognized and managed the gap. ¹

How do we handle third parties running studies for us?

Put requirements in the contract and make the evidence packet a deliverable: protocol, consent/notice, recruitment method, and coverage analysis. Store it with your third-party due diligence record so you can produce it on request. ¹

Is collecting sensitive attributes required for representativeness?

Not always. Define the representation dimensions needed to evaluate performance and harm, then use the least sensitive data possible, with appropriate protections and documentation. ¹

What artifact usually convinces auditors fastest?

A complete evaluation packet that shows the approvals, consent/notice, data handling plan, and a clear planned-vs-achieved representativeness analysis tied to the relevant population definition. ¹

NIST AI RMF Core

Frequently Asked Questions

Does MEASURE-2.2 apply to internal employee pilots?

We only do A/B tests in production. Is that “human subjects” work?

What if we cannot recruit a fully representative sample?

How do we handle third parties running studies for us?

Is collecting sensitive attributes required for representativeness?

Not always. Define the representation dimensions needed to evaluate performance and harm, then use the least sensitive data possible, with appropriate protections and documentation. (Source: NIST AI RMF Core)

What artifact usually convinces auditors fastest?

A complete evaluation packet that shows the approvals, consent/notice, data handling plan, and a clear planned-vs-achieved representativeness analysis tied to the relevant population definition. (Source: NIST AI RMF Core)

Operationalize this requirement

Map requirement text to controls, owners, evidence, and review workflows inside Daydream.

See Daydream

Regulatory text

Plain-English interpretation (what “counts”)

What is an “evaluation involving human subjects” in AI work?

What “representative of the relevant population” means operationally

Who it applies to

What you actually need to do (step-by-step)

Step 1: Define the “Human-Subjects Evaluation Intake” trigger

Step 2: Assign a control owner and review path

Step 3: Human subject protection package (minimum viable set)

Step 4: Representativeness plan (define, recruit, measure, adjust)

Step 5: Run the evaluation with controlled changes and traceability

Step 6: Closeout and governance actions

Required evidence and artifacts to retain

Common exam/audit questions and hangups

Frequent implementation mistakes (and how to avoid them)

Enforcement context and risk implications

Practical 30/60/90-day execution plan

First 30 days (Immediate foundation)

By 60 days (Operationalize across teams and third parties)

By 90 days (Audit readiness and continuous operation)

Frequently Asked Questions

Does MEASURE-2.2 apply to internal employee pilots?

We only do A/B tests in production. Is that “human subjects” work?

What if we cannot recruit a fully representative sample?

How do we handle third parties running studies for us?

Is collecting sensitive attributes required for representativeness?

What artifact usually convinces auditors fastest?

Footnotes

Frequently Asked Questions

Related Resources

Operationalize this requirement