Data preparation

10 min readLast verified: March 2026By Isaac SilvermanOur methodology

ISO/IEC 42001 Annex A.7.6 requires you to define and run repeatable processes to prepare data before it is used in any AI system, including cleaning, labeling, transformation, augmentation, and bias mitigation (ISO/IEC 42001:2023 Artificial intelligence — Management system). To operationalize it, stand up a controlled “data prep pipeline” with documented steps, quality checks, bias tests, approvals, and retained evidence for every dataset and model release.

Key takeaways:

Document a standard, repeatable data preparation workflow and make teams follow it for each AI use case (ISO/IEC 42001:2023 Artificial intelligence — Management system).
Treat bias mitigation and labeling quality as controlled activities with defined acceptance criteria and sign-off (ISO/IEC 42001:2023 Artificial intelligence — Management system).
Retain artifacts that prove what changed, why it changed, and who approved it, per dataset version and model release.

“Data preparation” is where many AI risks are created or removed. Bad labels, silent transformations, contaminated training corpora, or untracked augmentation can degrade model performance, introduce unfair outcomes, or make results non-reproducible. Control A.7.6 in ISO/IEC 42001 is plain: you must define and implement processes for preparing data for use in AI systems (ISO/IEC 42001:2023 Artificial intelligence — Management system).

For a Compliance Officer, CCO, or GRC lead, the operational goal is not to “improve data quality” in the abstract. The goal is to ensure that every AI dataset entering training, fine-tuning, evaluation, or production inference has (1) an approved purpose, (2) known provenance, (3) documented preparation steps, (4) objective quality gates, and (5) retained evidence to support audits and incident response.

This page translates Annex A.7.6 into a requirement-level playbook: who it applies to, the minimum viable processes to implement, what auditors ask, common failure modes, and a practical execution plan. It also flags where third parties affect compliance, because many organizations source, label, host, or enrich AI data externally.

Regulatory text

Requirement (verbatim): “The organization shall define and implement processes for preparing data for use in AI systems.” (ISO/IEC 42001:2023 Artificial intelligence — Management system)

Operator interpretation: You need a documented, repeatable, enforced set of steps that governs how raw data becomes “AI-ready” data. That process must cover the real work teams do: cleaning, labeling, transformations, augmentation, and bias mitigation. It also needs controls that prove it happened as designed, not just a policy on paper (ISO/IEC 42001:2023 Artificial intelligence — Management system).

Plain-English interpretation (what the requirement demands)

You are expected to:

Define a standard data preparation workflow (who does what, in what order, with what tools, and under what acceptance criteria).
Implement it in daily operations so teams cannot bypass it for “quick experiments” that later become production.
Control high-risk steps (labeling, augmentation, bias mitigation) with explicit criteria, review, and approval.
Keep evidence so you can reproduce a dataset and explain how it was prepared for a given model release (ISO/IEC 42001:2023 Artificial intelligence — Management system).

A practical way to think about A.7.6: auditors will ask, “Show me how Dataset X was prepared for Model Release Y, and prove the process was followed.”

Who it applies to

Entity scope: The control applies to any organization that prepares data used in AI systems, including AI providers and AI users (ISO/IEC 42001:2023 Artificial intelligence — Management system).

Operational contexts that are in scope:

Training or fine-tuning (internal or external models).
Evaluation and benchmarking datasets used for go/no-go decisions.
Retrieval-augmented generation (RAG) corpora: document ingestion, chunking, embeddings, metadata tagging.
Production inference inputs when you pre-process, normalize, enrich, or filter user data before it hits an AI model.
Any pipeline where a third party performs labeling, enrichment, data collection, scraping, or data hosting that changes the final dataset your AI uses.

Common “surprise” in audits: A proof-of-concept dataset that later became “temporarily” production. If it feeds a live AI system, it is in scope.

What you actually need to do (step-by-step)

1) Establish a single standard: the Data Preparation Standard (DPS)

Create a short, enforceable standard that defines:

Dataset types (training, fine-tuning, eval, RAG corpus, inference pre-processing).
Mandatory steps per type (cleaning, labeling, transformation, augmentation, bias checks).
Required approvals (data owner, model owner, and compliance or risk sign-off for higher-risk use cases).
Minimum documentation and evidence to retain (see “Artifacts” section).
Exception process (who can approve deviations, with time limits and compensating controls) (ISO/IEC 42001:2023 Artificial intelligence — Management system).

Keep it operational. Teams should be able to run it without interpretation debates.

2) Map and inventory your “data prep pipeline” end-to-end

Document the actual pipeline, not the ideal one:

Where raw data comes from (internal systems, third parties, public sources).
Where it lands (data lake, labeling platform, feature store, vector DB).
What transforms happen (normalization, deduping, PII removal, tokenization, feature engineering, embedding).
Who touches it (engineers, data scientists, labelers, third parties).
What gets promoted to “approved for AI use” and how that promotion happens (ISO/IEC 42001:2023 Artificial intelligence — Management system).

Deliverable: a simple flow diagram plus a RACI. Auditors want clarity on control ownership.

3) Implement dataset versioning and lineage as non-negotiable gates

Define:

Dataset identifiers (unique ID), versioning rules, and immutable snapshots for anything used in model decisions.
Lineage fields: source systems, extraction date/time, preparation code version, label schema version, and approver identities.
“Promotion” states: draft → validated → approved for AI use → deprecated.

If your pipeline lacks versioning, you will struggle to prove reproducibility and control.

4) Define quality checks and acceptance criteria per dataset type

Make checks measurable and reviewable, for example:

Cleaning: missing values rules, dedup thresholds, outlier handling approach, forbidden fields list.
Labeling: label taxonomy, instructions, edge-case handling, label QA sampling approach, dispute resolution workflow.
Transformations: documented feature definitions, encoding schemes, text normalization rules, chunking parameters for RAG, embedding model used.
Augmentation: allowed augmentation techniques, prohibited synthetic data scenarios, documentation of what was generated and why.
Bias mitigation: required bias testing approach and mitigation actions when issues are found (ISO/IEC 42001:2023 Artificial intelligence — Management system).

You do not need perfection. You need defined gates that match risk and are consistently executed.

5) Control third-party contributions (labeling, enrichment, data sourcing)

Where third parties contribute:

Contractually require adherence to your labeling and preparation specs.
Require evidence deliveries (label guidelines, QA results, change logs, personnel access controls where relevant).
Perform intake QA before accepting third-party outputs into your approved dataset inventory.
Record provenance: third party name, dataset batch IDs, and acceptance sign-off.

If a third party changes your training data, they change your model risk profile. Treat them as part of the control boundary.

6) Formalize approvals tied to model releases (not just dataset creation)

Add a release gate: “This model version was trained/evaluated on these dataset versions, and those datasets were prepared under the DPS.” Capture:

dataset IDs and versions
summary of preparation steps performed
exceptions granted
approval signatures (or system approvals) (ISO/IEC 42001:2023 Artificial intelligence — Management system)

This is what makes the control auditable.

7) Monitor drift and re-preparation triggers

Define triggers for re-running preparation steps:

material source changes (new upstream system, schema changes)
new population segments
new labeling taxonomy
significant performance or fairness issues found in monitoring

Operationally, this becomes a change management hook: changes in data inputs require data prep re-validation before the next release.

Required evidence and artifacts to retain

Retain evidence per dataset version and per model release:

Core documents

Data Preparation Standard (DPS) and exception procedure (ISO/IEC 42001:2023 Artificial intelligence — Management system).
Data prep pipeline diagram and RACI.
Dataset inventory/register with owners and approved-use status.

Per dataset version (minimum)

Data source log (systems/third parties, extraction method, scope).
Preparation run record: scripts/notebooks or pipeline job IDs, code version, parameters.
Transformation and feature definitions (or RAG ingestion parameters).
Labeling package: label schema, labeling instructions, QA approach, issue log.
Bias mitigation record: tests run, findings, mitigations, residual risk acceptance.
Approval record: validated/approved sign-off, exceptions granted.

Per model release

Dataset-to-model traceability matrix (dataset IDs/versions used for train/eval).
Go/no-go checklist including confirmation that A.7.6 artifacts are complete.
Change log summary (what changed in data preparation since last release).

If you want to streamline collection, tools like Daydream can centralize dataset evidence requests, third-party deliverables, and approval workflows so teams do not chase artifacts across notebooks, tickets, and shared drives.

Common exam/audit questions and hangups

Expect these:

“Show me your defined process for preparing data for AI systems.” (Policy vs. real workflow gap is a common hangup.)
“Pick one model in production. Provide dataset versions, preparation steps, and approvals.”
“How do you ensure labeling quality and consistency across labelers and time?”
“Where do you test for bias, and what do you do when you find it?”
“What happens when upstream data changes?”
“Which third parties influence training or evaluation data, and how do you validate their outputs?” (ISO/IEC 42001:2023 Artificial intelligence — Management system)

Frequent implementation mistakes (and how to avoid them)

A policy with no pipeline controls.
Fix: implement technical gates (versioning, promotion states, required metadata) so bypass is hard.
No dataset versioning for “temporary” experiments.
Fix: define a rule that any dataset used in a decision, demo to customers, or production must be versioned and archived.
Labeling treated as clerical work.
Fix: controlled label taxonomy, QA sampling, and escalation routes. Require sign-off for label schema changes.
Augmentation without documentation.
Fix: maintain an augmentation log with technique, rationale, and linkage to the dataset version.
Bias mitigation as a one-time check.
Fix: define bias checks at dataset preparation and re-run when triggers occur (source changes, new segments).
Third-party data accepted “as is.”
Fix: intake QA and contractual evidence requirements for third-party labeling/enrichment.

Enforcement context and risk implications

No public enforcement cases were provided in the source material for this requirement. Practically, weak data preparation controls increase the risk of:

non-reproducible model behavior (hard to investigate incidents)
quality regressions after data refreshes
fairness and bias issues from skewed or poorly labeled data
security and privacy exposure from mishandled sensitive fields during preprocessing

From a governance standpoint, A.7.6 is a foundation control: if you cannot prove how data was prepared, downstream controls (testing, monitoring, incident response) become harder to execute credibly (ISO/IEC 42001:2023 Artificial intelligence — Management system).

Practical execution plan

First 30 days (Immediate)

Appoint owners: data prep process owner, dataset owners for in-scope AI systems.
Draft and approve the Data Preparation Standard (DPS) with minimum required steps and artifacts (ISO/IEC 42001:2023 Artificial intelligence — Management system).
Inventory in-scope AI systems and identify the datasets that feed them (train/eval/RAG/inference preprocessing).
Pick one high-impact AI use case and run the full evidence trail as a pilot.

Next 60 days (Near-term)

Implement dataset versioning and an “approved for AI use” promotion gate.
Build templates: preparation run record, labeling pack, bias mitigation record, approval form.
Add third-party intake requirements for any external labeling/enrichment/data sourcing.
Train data science and engineering teams on the DPS with real examples and “what to store where.”

Next 90 days (Operationalize)

Expand from the pilot to all in-scope AI systems.
Add change management triggers: upstream schema changes and dataset refreshes require re-validation.
Tie model release approvals to dataset versions and completed artifacts.
Start lightweight internal audits: sample a dataset per quarter and verify traceability, approvals, and completeness of evidence (ISO/IEC 42001:2023 Artificial intelligence — Management system).

Frequently Asked Questions

Does A.7.6 apply if we only use third-party foundation models and don’t train anything?

Yes if you prepare data used by the AI system, such as building a RAG corpus, transforming inputs, or curating evaluation sets (ISO/IEC 42001:2023 Artificial intelligence — Management system). Your process must cover those preparation steps and retain evidence.

What counts as “data preparation” for RAG?

Ingestion, cleaning, deduplication, chunking, metadata tagging, embedding generation, and filtering rules all qualify as preparation because they shape what the model can retrieve. Treat these as controlled transformations with versioning and approvals (ISO/IEC 42001:2023 Artificial intelligence — Management system).

Do we need bias mitigation for every dataset?

The requirement expects processes that include bias mitigation as relevant to AI use (ISO/IEC 42001:2023 Artificial intelligence — Management system). Define when bias checks are mandatory based on the use case and data type, then document and enforce that decision rule.

How do we handle fast-moving experimentation without slowing teams down?

Set a clear line: experimentation can be lighter, but anything used for production, customer outputs, or formal performance claims must pass the full DPS gates. Provide templates and automation so evidence capture is part of the workflow, not extra admin.

What evidence is most often missing in audits?

Dataset lineage, the exact transformation parameters used, and labeling QA records are common gaps. Fix this with a required “preparation run record” tied to dataset version IDs and enforced in your promotion gate.

Can we outsource labeling and still meet A.7.6?

Yes, but you still own the control. Require the third party to follow your label taxonomy and QA rules, and perform intake checks before approving labeled data for AI use (ISO/IEC 42001:2023 Artificial intelligence — Management system).

Frequently Asked Questions

Does A.7.6 apply if we only use third-party foundation models and don’t train anything?

What counts as “data preparation” for RAG?

Do we need bias mitigation for every dataset?

How do we handle fast-moving experimentation without slowing teams down?

What evidence is most often missing in audits?

Can we outsource labeling and still meet A.7.6?

Authoritative Sources

ISO/IEC 42001:2023

Operationalize this requirement

Map requirement text to controls, owners, evidence, and review workflows inside Daydream.

See Daydream

Regulatory text

Plain-English interpretation (what the requirement demands)

Who it applies to

What you actually need to do (step-by-step)

1) Establish a single standard: the Data Preparation Standard (DPS)

2) Map and inventory your “data prep pipeline” end-to-end

3) Implement dataset versioning and lineage as non-negotiable gates

4) Define quality checks and acceptance criteria per dataset type

5) Control third-party contributions (labeling, enrichment, data sourcing)

6) Formalize approvals tied to model releases (not just dataset creation)

7) Monitor drift and re-preparation triggers

Required evidence and artifacts to retain

Common exam/audit questions and hangups

Frequent implementation mistakes (and how to avoid them)

Enforcement context and risk implications

Practical execution plan

First 30 days (Immediate)

Next 60 days (Near-term)

Next 90 days (Operationalize)

Frequently Asked Questions

Does A.7.6 apply if we only use third-party foundation models and don’t train anything?

What counts as “data preparation” for RAG?

Do we need bias mitigation for every dataset?

How do we handle fast-moving experimentation without slowing teams down?

What evidence is most often missing in audits?

Can we outsource labeling and still meet A.7.6?

Frequently Asked Questions

Authoritative Sources

Related Resources

Operationalize this requirement