Data resources
To meet the ISO 42001 data resources requirement, you must maintain an accurate, reviewable inventory of every data resource your AI systems use across the lifecycle, including training, validation/testing, and operational (production) data. The inventory must show where data came from, how it is accessed, and how it is governed so you can manage risk, changes, and accountability.
Key takeaways:
- Build a data resource inventory per AI system that covers training, validation/testing, and operational data.
- Tie each dataset to ownership, source, purpose, access path, and change controls so it survives audits and model updates.
- Keep evidence: dataset “datasheets,” lineage, access logs, approvals, and refresh/change history.
“Data resources” sounds simple until an auditor asks: “Which exact datasets shaped this model’s behavior, and who approved their use?” Annex A Control A.4.3 in ISO/IEC 42001 requires you to identify and document the data resources used by AI systems. That requirement is less about documentation for its own sake and more about operational control: you cannot manage data quality, privacy, IP rights, bias risk, retention, or incident response if you cannot name the data, locate it, and show how it flows into the AI system.
For most organizations, the fastest path is to treat this as an inventory-and-lineage control anchored to each AI system, not a generic enterprise data catalog effort. You need a practical register that a compliance officer, model owner, security team, and auditor can all read. It should answer: what data is used, where it came from, where it sits, who owns it, how it is accessed, what transformations occur, and what happens when the dataset changes. This page gives requirement-level steps, audit-ready artifacts, and a phased execution plan you can implement without waiting on a full data governance program.
Regulatory text
Requirement (excerpt): “The organization shall identify and document the data resources utilized by AI systems.” 1
Operator meaning: For each AI system in scope, you must be able to (1) list the data resources the system uses and (2) produce documentation that a reviewer can trace from dataset source to AI use. Documentation needs to cover the major data phases called out in the plain-language summary: training data, validation/testing data, and operational data. 1
Plain-English interpretation
You need a living, version-aware inventory of datasets and other data inputs that influence your AI system’s outputs. If data changes, access changes, or a new dataset is introduced, your documentation must change too. The goal is repeatability: another competent person should be able to reconstruct “what data was used” for a given model version and timeframe without guessing.
Who this applies to
Entity types: AI providers, AI users, and organizations operating AI systems. 1
Operational contexts where this becomes “real”:
- You build models (first-party development), including fine-tuning or retraining.
- You deploy third-party models but provide your own prompts, retrieval data, tools, or feedback data (you still have operational data resources).
- You use RAG (retrieval augmented generation) where internal documents, knowledge bases, or ticket histories shape answers.
- You run monitoring/feedback loops where user interactions are captured and later used for improvements.
Important scoping point: “Data resources” is broader than “datasets in the training pipeline.” It includes production inputs (prompts, documents retrieved, sensor streams), reference data (lookup tables, taxonomies), labels/ground truth, evaluation sets, and feedback data that drives updates.
What you actually need to do (step-by-step)
Step 1: Define “AI system” boundaries and name owners
Create or reuse an AI system register, then assign:
- System owner (business accountability)
- Model owner (technical accountability)
- Data owner/steward for each major dataset
- Control owner for the data resources inventory itself
If ownership is unclear, the inventory will rot. Auditors spot stale registers quickly.
Step 2: Map data resource categories per AI system
For each AI system, enumerate data resources across the lifecycle:
- Training data (raw and curated)
- Validation/testing data (holdout sets, red-team corpora, eval prompts)
- Operational data (live inputs, retrieved documents, user prompts, telemetry, logs) This matches the intent described in the requirement summary. 1
Practical technique: run a 60–90 minute workshop with engineering + product + security and whiteboard the system’s data flow. Then convert the outputs into the inventory fields below.
Step 3: Document minimum required fields in a “Data Resources Register”
Keep it simple but complete. For each data resource, capture:
| Field | What “good” looks like | Why auditors care |
|---|---|---|
| Data resource name + unique ID | Stable ID, human-readable name | Prevents confusion across similarly named tables/buckets |
| AI system(s) using it | Explicit linkage | Proves completeness per system |
| Lifecycle role | Training / validation-testing / operational | Ensures you didn’t only document training |
| Source type | Internal / third party / public / customer-provided | Drives privacy, IP, and contract analysis |
| System of record + location | Database, bucket, SaaS app; region if relevant | Supports security and residency controls |
| Collection method | API, batch export, manual upload, web scrape, sensors | Affects legality, consent, and integrity |
| Transformations | Cleaning, labeling, embedding generation, feature extraction | Enables lineage and reproducibility |
| Access path + auth | Service accounts, roles, network path | Links to least privilege and monitoring |
| Data owner/steward | Named accountable role | Enables approvals and change control |
| Refresh/change pattern | On-demand, periodic, event-driven; versioning approach | Supports impact analysis |
| Retention/deletion rule | What is kept, where, and for how long (policy link) | Aligns with governance requirements |
| Known constraints | License limits, consent limits, prohibited uses | Prevents misuse over time |
Step 4: Add lineage and version control (the part most teams miss)
Your register should answer two audit questions:
- “Which model version used which dataset version?”
- “Can you reproduce the evaluation inputs that drove go/no-go?”
Operationalize this by:
- Storing dataset version identifiers (table snapshot ID, object version, git/DVC hash, export timestamp).
- Logging pipeline runs and linking them to dataset versions.
- Recording evaluation set versions and the test harness used.
If you cannot do full reproducibility, document the limitation and implement compensating controls (for example, change approvals plus enhanced monitoring after data updates).
Step 5: Put a change gate in front of new or changed data resources
Implement a lightweight workflow:
- New dataset request (or material change) triggers review by data owner + security/privacy + AI system owner.
- Review checks: permitted purpose, access controls, sensitivity, third-party terms, and whether model documentation needs updates.
- Approval is recorded and linked in the register.
This is where a system like Daydream fits naturally: it can route dataset onboarding through third-party due diligence (for external sources), track approvals, and keep evidence tied to the AI system record so audits don’t turn into inbox archaeology.
Step 6: Monitor for drift in the inventory (ongoing control)
Documentation breaks when engineering moves fast. Add:
- A recurring attestation from AI system owners that the inventory is current.
- Automated detection where possible (cloud asset inventory for buckets, data warehouse schema change alerts, pipeline job changes).
- A KPI that is qualitative but operational, such as “all production AI endpoints have documented operational data inputs,” and treat exceptions as issues to remediate.
Required evidence and artifacts to retain
Retain artifacts that prove both identification and documentation:
Core artifacts
- Data Resources Register 2
- Data flow diagram(s) showing data sources → pipelines → model/service endpoints
- Dataset documentation (“datasheet” style) for high-risk datasets: provenance, fields, sensitivity, allowed uses
- Access control evidence: role mappings, service account list, approvals
- Change control records: tickets/PRs, approvals, release notes tied to data changes
- Data processing agreements and third-party terms for external datasets (where applicable)
- Logs/records linking model versions to dataset versions (pipeline run metadata)
Retention tip: Keep “point-in-time” snapshots for significant model releases. Audits often focus on what was true at release time, not what is true today.
Common exam/audit questions and hangups
Expect questions like:
- “Show me every data source used in production by this AI feature, including retrieval sources and logging.”
- “How do you know engineers didn’t add a new dataset without review?”
- “Who owns this dataset, and what are the permitted uses?”
- “Can you tie a customer incident to the exact data resources involved?”
- “How do you separate training data from evaluation data to avoid contamination?”
Hangups that slow exams:
- Inventory lists systems but not datasets, or lists datasets but not where they are used.
- “Operational data” is omitted entirely (prompts, retrieved docs, feedback).
- No versioning evidence, so the organization cannot reconstruct historical state.
Frequent implementation mistakes (and how to avoid them)
-
Treating this as a one-time spreadsheet exercise
Fix: make the register a controlled record with an owner, change workflow, and periodic review. -
Only documenting training data
Fix: explicitly require operational inputs (RAG corpora, prompts, logs, feedback) in the template. -
No third-party data governance
Fix: tag each data resource as internal vs third party, then link third-party data to contract terms and due diligence artifacts. -
Lineage described in prose, not traceable identifiers
Fix: store dataset version IDs and pipeline run IDs that can be queried and reproduced. -
Shadow data copies (exports sitting in buckets, analyst extracts)
Fix: add “known replicas/derivatives” to each data resource entry and control who can create them.
Enforcement context and risk implications
No public enforcement cases were provided in the source catalog for this requirement. Practically, weak data resource documentation increases your risk in three ways:
- Incident response risk: you cannot quickly scope which data contributed to a harmful output.
- Regulatory and contractual risk: you may breach third-party data terms or internal use limitations without noticing.
- Model risk: you may be unable to explain or reproduce behavior changes after data updates, which undermines governance and stakeholder trust.
Practical execution plan (30/60/90)
Use a phased rollout without hard timing claims. Treat these as maturity steps you can map to your internal calendar.
First 30 days (Immediate)
- Assign ownership for the data resources register and pick a system of record (GRC tool, controlled wiki, or ticketed database).
- Select the initial AI systems in scope (start with customer-facing and high-impact).
- Run data flow workshops for the first set of AI systems.
- Publish the register template and required fields, including operational data.
By 60 days (Near-term)
- Complete inventory coverage for the initial AI systems, including dataset owners and access paths.
- Implement a change gate for new/changed data resources (ticket + approvals).
- Attach third-party agreements and usage constraints to external data resources.
- Add dataset version identifiers and pipeline run metadata capture for new model releases.
By 90 days (Operationalize)
- Expand coverage to remaining AI systems.
- Add periodic attestations and exception handling (document gaps as issues with owners and due dates).
- Introduce automated signals where feasible (schema change alerts, cloud bucket discovery, pipeline job changes).
- Test the control: pick one AI system and simulate an audit request, then measure how long it takes to produce dataset lineage and approvals.
Frequently Asked Questions
Do we need to document prompts and user inputs as “data resources”?
If prompts, conversation history, or user inputs are stored and later used by the AI system (for retrieval, fine-tuning, evaluation, or analytics), treat them as operational data resources and document where they are stored, who can access them, and how they are governed.
We use a third-party model. Does A.4.3 still apply?
Yes if your AI system uses your data resources with that model (RAG sources, logs, feedback, tool outputs). Document the data you provide to the model and, where available, the third party’s data-related terms that affect your permitted use.
How detailed does “identify and document” need to be?
Detailed enough that a reviewer can trace from the AI system to specific data sources and understand provenance, ownership, access, and change history. If you cannot reconstruct what data was used for a release, auditors will treat the documentation as incomplete.
Do we need a full enterprise data catalog to comply?
No. Start with an AI-system-centric register that covers the datasets and inputs that materially affect AI behavior. You can later integrate it with a broader data catalog if your organization has one.
How do we handle rapidly changing operational data like logs or streaming data?
Document the stream/source, schemas, storage locations, retention rules, and the controls around access and downstream use. Add a note on what aspects are not reproducible and what compensating controls you apply (approvals, monitoring, and release testing).
What evidence is most persuasive in an audit?
A current register tied to each AI system, plus lineage/version records that link model releases to dataset versions and approvals. Auditors also value clear ownership and a visible change workflow over long narrative documents.
Footnotes
Frequently Asked Questions
Do we need to document prompts and user inputs as “data resources”?
If prompts, conversation history, or user inputs are stored and later used by the AI system (for retrieval, fine-tuning, evaluation, or analytics), treat them as operational data resources and document where they are stored, who can access them, and how they are governed.
We use a third-party model. Does A.4.3 still apply?
Yes if your AI system uses your data resources with that model (RAG sources, logs, feedback, tool outputs). Document the data you provide to the model and, where available, the third party’s data-related terms that affect your permitted use.
How detailed does “identify and document” need to be?
Detailed enough that a reviewer can trace from the AI system to specific data sources and understand provenance, ownership, access, and change history. If you cannot reconstruct what data was used for a release, auditors will treat the documentation as incomplete.
Do we need a full enterprise data catalog to comply?
No. Start with an AI-system-centric register that covers the datasets and inputs that materially affect AI behavior. You can later integrate it with a broader data catalog if your organization has one.
How do we handle rapidly changing operational data like logs or streaming data?
Document the stream/source, schemas, storage locations, retention rules, and the controls around access and downstream use. Add a note on what aspects are not reproducible and what compensating controls you apply (approvals, monitoring, and release testing).
What evidence is most persuasive in an audit?
A current register tied to each AI system, plus lineage/version records that link model releases to dataset versions and approvals. Auditors also value clear ownership and a visible change workflow over long narrative documents.
Authoritative Sources
Operationalize this requirement
Map requirement text to controls, owners, evidence, and review workflows inside Daydream.
See Daydream