Acquisition of data

10 min readLast verified: March 2026By Isaac SilvermanOur methodology

To meet the ISO/IEC 42001 acquisition of data requirement, you need a defined, repeatable process for sourcing AI data that proves the data was obtained lawfully, ethically, and with appropriate consent or authorization, plus controls for third-party data purchases and internal data reuse. Your goal is audit-ready traceability from “where data came from” to “why you’re allowed to use it.” ¹

Key takeaways:

Build a single intake-to-approval workflow for all AI data sources (internal, third party, open/public).
Require documented rights/consent, purpose alignment, and provenance before data enters training, fine-tuning, or evaluation.
Retain evidence that decisions were made, risks were assessed, and restrictions flow into engineering controls.

“Acquisition of data” sounds narrow, but for ISO/IEC 42001 it becomes the front door to your AI risk posture: if you cannot show you had the right to obtain and use data, everything downstream (model performance, privacy, security, fairness, and IP posture) inherits that defect. Annex A Control A.7.3 sets a simple expectation: establish processes for acquiring data for AI systems. ¹

For a Compliance Officer, CCO, or GRC lead, the fastest way to operationalize this is to treat “data acquisition” like third-party onboarding plus privacy/IP gating, then bind it to technical enforcement. A process that lives only in policy will fail under audit if engineering teams can still pull datasets from ad hoc sources, scrape without review, or accept third-party datasets without rights, restrictions, and provenance captured.

This page gives you requirement-level guidance you can implement quickly: who needs to follow it, what decisions must be made before data is ingested, what artifacts to retain, and how auditors typically test it. It also includes a practical execution plan and common mistakes seen in real AI programs.

Regulatory text

Control requirement: “The organization shall establish processes for acquiring data for AI systems.” ¹

Operator interpretation (what you must do):

Define and run a documented process that governs how data is sourced for AI use cases.
Ensure acquisition aligns with legal and ethical expectations, including consent and authorization where relevant. ¹
Make the process repeatable: it must work for internal data reuse, third-party datasets, and data collected directly from individuals or customers.
Produce evidence. Auditors will look for proof that the process is followed, not just that it exists.

Plain-English interpretation

You need a controlled “data intake” pipeline for AI. Before any dataset is used for training, fine-tuning, retrieval, evaluation, or monitoring, someone must answer and document:

Where did this data come from?
Do we have the right to use it for this AI purpose?
What restrictions apply (use limits, retention, onward sharing, geography, confidentiality)?
Who approved it, based on what review?
How do those restrictions get enforced in tooling and workflows?

If you can’t answer those questions quickly, you do not have a compliant acquisition process under this control. ¹

Who it applies to

Entity scope: Organizations that build, deploy, or use AI systems, including AI providers and AI users. ¹

Operational scope (where this bites in practice):

Product/ML teams sourcing training or fine-tuning corpora.
Data engineering ingesting external datasets into lakes/warehouses used by AI.
Procurement / third-party risk buying datasets, labeling services, or data enrichment.
Privacy, legal, and security approving collection methods, rights, and restrictions.
Business owners requesting new data sources for AI features, analytics, or automation.

If AI teams can acquire data without a gating step (ticket, review, or contract check), your control is effectively absent even if a policy exists.

What you actually need to do (step-by-step)

1) Define “AI data acquisition” for your organization

Create a short scoping statement that covers:

Data obtained from third parties (purchase, license, exchange, partnership)
Data collected directly (apps, devices, forms, customer interactions)
Internal data reuse (repurposing operational data for AI training/evaluation)
Public/open datasets and web data (including scraping, if relevant)

Make the scope explicit so teams don’t treat it as “only purchased datasets.”

2) Stand up a single intake workflow (one front door)

Implement one request path for any new dataset intended for AI:

A form or ticket with required fields (source, owner, purpose, dataset description, sensitivity, geography, planned use)
Automatic routing to the right approvers (legal/privacy/security/data governance)
A clear “stop/go” decision and conditions of use

Practical tip: keep the form short but non-negotiable. Teams will bypass a long questionnaire.

3) Require provenance and rights checks before ingestion

Set minimum checks that must be completed and documented:

Provenance: origin, supplier, collection method, chain of custody, and whether data has been modified/aggregated.
Rights and authorization: license terms, consent/notice basis (if applicable), contractual permissions, and restrictions on AI training or derivative works.
Purpose alignment: documented statement that the intended AI use matches allowed use.
Third-party risk linkage: if a third party supplies data or labeling, connect the dataset approval to third-party due diligence and contract review.

This is where many programs fail: they record “vendor name” but not “what we are allowed to do with the data.”

4) Classify the dataset and set handling requirements

Define a lightweight classification for AI datasets (example categories):

Public / open
Internal confidential
Personal data / sensitive personal data
Regulated data (if applicable to your business context)
High-risk datasets (contains minors, biometrics, precise location, or other high-impact elements as defined internally)

Then map each category to handling rules:

Allowed storage zones
Encryption and access controls
Retention and deletion requirements
Restrictions on use for training vs. evaluation vs. RAG retrieval

5) Bind approval conditions to technical controls

Write approvals so they can be enforced:

Data location (approved bucket/project/workspace)
Access group and least privilege roles
Tagging/labels in catalog (e.g., “no-train”, “evaluation-only”, “no-export”)
Logging requirements (who accessed, when exported)
Deletion triggers when rights expire or purpose ends

If your approval says “evaluation-only” but engineers can still feed it into training pipelines, auditors will treat the control as weak.

6) Establish third-party dataset contract minimums

For data acquired from third parties, set standard contract clauses or review points:

Clear grant of rights for intended AI uses
Restrictions on onward sharing, model training, and derivative outputs (where relevant)
Data quality representations and known limitations disclosures (where offered)
Termination, return, and deletion obligations
Audit rights or reporting expectations where feasible
Indemnities appropriate to your risk appetite (coordinate with counsel)

Your process should force a legal review before procurement can sign.

7) Operationalize with a dataset register (inventory)

Maintain an AI dataset register that records:

Dataset name, owner, and system(s) using it
Source and acquisition method
Rights/consent basis reference
Classification and restrictions
Approval date, approvers, and expiry/renewal triggers

This register becomes your audit index. Without it, you will scramble to reconstruct provenance across emails and notebooks.

8) Monitor and enforce ongoing compliance

Add recurring controls:

Periodic access reviews for restricted datasets
Renewal checks for expiring licenses
Drift checks: confirm the dataset isn’t being used outside the approved purpose
Exception management: documented, time-bound exceptions with compensating controls and approvals

Where Daydream fits

Most teams stall on evidence collection and traceability: dataset intake tickets, contract artifacts, approvals, and linkage to technical enforcement live in different systems. Daydream can act as the control spine by centralizing dataset intake, mapping each dataset to AI systems and third parties, and keeping audit-ready evidence tied to each approval decision.

Required evidence and artifacts to retain

Auditors typically want objective proof that acquisition is controlled and repeatable. Retain:

Data acquisition policy/standard (scope, roles, minimum checks) ¹
Data intake workflow artifacts: completed request forms/tickets, approvals, conditions of use
Contracts and license terms for third-party datasets; amendments and renewals
Consent/authorization records or documented basis for collection/processing where relevant ¹
Dataset register/inventory tied to AI systems
Data classification results and handling requirements
Access control evidence: group membership, approvals, access review outputs
Technical enforcement evidence: tags/labels, pipeline guardrails, logging configurations
Exceptions: justification, risk acceptance, compensating controls, expiry

Common exam/audit questions and hangups

Expect auditors to test both governance and engineering reality:

“Show me your process for acquiring AI training data.” (They will look for a workflow, not a PDF.) ¹
“Pick a model in production. Where did its training/evaluation data come from?”
“How do you confirm you had rights to use this dataset for training?”
“What prevents a developer from importing an unapproved dataset?”
“How do you handle open/public datasets and web-collected data?”
“Where are data acquisition approvals stored, and how long do you retain them?”

Hangup: teams can describe the process verbally but cannot produce a single place where datasets, approvals, and restrictions are recorded.

Frequent implementation mistakes (and how to avoid them)

Process exists only for procurement-purchased data.
Fix: include internal data reuse and “found data” (public/open, scraped, partner-shared) in scope.
Approvals are generic (“approved”) without conditions.
Fix: require explicit allowed uses (train, fine-tune, evaluate, RAG only), storage zone, and expiry/renewal triggers.
No linkage from rights to technical enforcement.
Fix: require dataset tags and pipeline guardrails. Make “no tag, no ingest” the engineering rule.
Dataset inventory is missing or stale.
Fix: make registration automatic from the intake ticket, or block production use until the dataset is in the register.
Third-party risk is disconnected from dataset acquisition.
Fix: tie each dataset to the supplying third party and require due diligence completion before final approval.

Enforcement context and risk implications

No public enforcement cases were provided in the source catalog for this requirement, so this page does not list specific actions.

Operationally, weak data acquisition controls create predictable failure modes:

Using data outside license/authorization constraints
Inability to prove consent or permitted use for AI
Surprise restrictions discovered late (after training costs and product commitments)
Uncontrolled inclusion of sensitive data in training or evaluation sets

Treat this as a root-cause control. If it fails, downstream controls become harder to defend.

Practical 30/60/90-day execution plan

First 30 days (Immediate)

Assign an owner for the data acquisition process (GRC or data governance) and define RACI with legal, privacy, security, and ML leads.
Publish a one-page standard: what counts as AI data acquisition, what must be reviewed, and what evidence must be retained. ¹
Implement a single intake form/ticket and require it for any new AI dataset.
Start a minimal dataset register for current production AI systems (even if incomplete).

By 60 days (Near-term)

Add approval routing and minimum checks: provenance, rights/consent basis, purpose alignment, classification, and restrictions.
Standardize third-party dataset contract review checkpoints with legal.
Put basic technical enforcement in place: approved storage zones, access groups, required tagging.
Run a retrofit exercise for top-priority AI systems: document sources and rights for existing datasets.

By 90 days (Operationalize and scale)

Expand the dataset register to cover all AI systems (training, evaluation, RAG, monitoring).
Add recurring controls: license renewal reminders, access reviews, exception review cadence.
Conduct an internal audit-style walkthrough: pick an AI system, trace all datasets from source to enforcement evidence.
If tooling sprawl is blocking evidence, consolidate intake, inventory, and evidence capture in a platform such as Daydream.

Frequently Asked Questions

Do we need this process if we only buy AI models and don’t train them?

Yes, if you acquire data for AI use cases like retrieval (RAG), evaluation, monitoring, or fine-tuning. If your organization supplies no data to any AI system, document that boundary and keep evidence of the decision. ¹

Does “acquisition of data” include internal data pulled from our own databases?

Yes. Internal reuse still requires purpose alignment, authorization, and restrictions because the original collection context may not match AI training or evaluation needs. Record the source system, owner approval, and allowed uses. ¹

How should we handle open/public datasets?

Treat them as a distinct acquisition path with provenance, license/terms review, and documentation of permitted uses. Keep a copy or snapshot of the applicable terms and record where the dataset was obtained.

What evidence is “enough” for an auditor?

You need a trace from dataset source to approval to enforcement: intake record, rights/authorization basis, classification and restrictions, and proof those restrictions are applied in access controls or pipelines. A dataset register makes this tractable.

What if engineering already trained models on data with unclear rights?

Open an exception or remediation record, stop further use until reviewed, and perform a retroactive provenance and rights assessment. Document the decision and apply constraints or replacement datasets as needed. ¹

Can we decentralize approvals to product teams?

Yes, if you standardize the checks, require evidence capture, and keep independent oversight (privacy/legal/security) for higher-risk data sources. Decentralization without consistent artifacts will fail audit sampling.

ISO/IEC 42001:2023 Artificial intelligence — Management system

Frequently Asked Questions

Do we need this process if we only buy AI models and don’t train them?

Does “acquisition of data” include internal data pulled from our own databases?

How should we handle open/public datasets?

What evidence is “enough” for an auditor?

What if engineering already trained models on data with unclear rights?

Can we decentralize approvals to product teams?

Authoritative Sources

ISO/IEC 42001:2023

Operationalize this requirement

Map requirement text to controls, owners, evidence, and review workflows inside Daydream.

See Daydream

Regulatory text

Plain-English interpretation

Who it applies to

What you actually need to do (step-by-step)

1) Define “AI data acquisition” for your organization

2) Stand up a single intake workflow (one front door)

3) Require provenance and rights checks before ingestion

4) Classify the dataset and set handling requirements

5) Bind approval conditions to technical controls

6) Establish third-party dataset contract minimums

7) Operationalize with a dataset register (inventory)

8) Monitor and enforce ongoing compliance

Where Daydream fits

Required evidence and artifacts to retain

Common exam/audit questions and hangups

Frequent implementation mistakes (and how to avoid them)

Enforcement context and risk implications

Practical 30/60/90-day execution plan

First 30 days (Immediate)

By 60 days (Near-term)

By 90 days (Operationalize and scale)

Frequently Asked Questions

Do we need this process if we only buy AI models and don’t train them?

Does “acquisition of data” include internal data pulled from our own databases?

How should we handle open/public datasets?

What evidence is “enough” for an auditor?

What if engineering already trained models on data with unclear rights?

Can we decentralize approvals to product teams?

Footnotes

Frequently Asked Questions

Authoritative Sources

Related Resources

Operationalize this requirement