Filling out the AI Office's Art. 53(1)(d) training-content summary template — field by field

Published
4 May 2026
Calendar week
W1
Register
Regulatory
Primary persona
P-A
Claim provenance
94🔬 1

Filling out the AI Office's Art. 53(1)(d) training-content summary template — field by field

Article 53(1)(d) of the EU AI Act (Regulation (EU) 2024/1689) obliges every general-purpose AI model provider to publish a "sufficiently detailed summary about the content used for training" using a template the European AI Office (DG-CNECT) released on 24 July 20251 ✅. The template is mandatory for new GPAI models placed on the EU market from 2 August 2025, and for legacy models from 2 August 20272 ✅. As of this writing (3 May 2026) there is no publicly-released, field-by-field worked example of a filled template anywhere on the open internet — providers writing their first summary are doing so with the AI Office's nine-section explanatory notice as their only reference.

This piece fills that gap. The worked example throughout is Aurelius-70B, a fictional 70-billion-parameter Mistral-class GPAI model with a reasonably realistic data composition (web-scrape majority, licensed publishers, code, books, synthetic). Aurelius-70B is the reference example we will return to in later pieces in this series. The teaser block below is what its filled Section 2 (data sources by category) looks like; the body of this guide walks every section.

Aurelius-70B — Section 2 (extract): Total tokens used in training: 2.1 trillion. Composition: (a) Public web crawl — 1.42T tokens; CommonCrawl 2018–2024 snapshots filtered against an ai.txt + robots.txt opt-out register of 412,318 distinct second-level domains as of crawl-stop date 2025-11-30; (b) Licensed publisher feeds — 187B tokens; 14 contracts, each enumerated in §10 of the model's Annex XI technical documentation; (c) Open-licensed code corpora — 219B tokens; The Stack v2 (2024-12 snapshot) restricted to permissive-license filter; (d) Books — 86B tokens; Project Gutenberg + Internet Archive Open Library public-domain subset only (no LibGen / PiLiMi / Books3 — explicit exclusion at compile time, see §3); (e) Synthetic instruction data — 45B tokens generated by Aurelius-70B-base-checkpoint-9 over a 12B-token seed of human-written prompts; (f) Other — 141B tokens (Wikipedia 2024-09 dump, StackExchange archive Q3 2024, arXiv abstracts pre-2025).

The rest of this guide explains, section by section, what the AI Office is asking for, what a defensible answer looks like, and what kind of evidence-emission tooling makes the answer falsifiable rather than asserted.

Why this template is unusually load-bearing

The Italian Garante's €15M provisional fine against OpenAI ✅ — issued 2 November 2024, annulled by the Court of Rome on 18 March 2026 on procedural grounds3 — is currently the only contested European data point on training-content adequacy. The annulment turned on procedural defects, not on the substantive lawful-basis question, which means the obligation is widely commented on and not yet substantively litigated. The AI Office has explicitly said it will not perform content-level audits — it will act on third-party complaints and the GPAI scientific panel's "qualified alerts"4. That places the burden on rights-holders, civil society, and peer scrutiny to convert a narrative summary into a falsifiable claim.

Two framing points before the walkthrough. First, Article 53(2) carves free-and-open-source GPAI providers out of the technical-documentation obligations of Article 53(1)(a) and (b), but explicitly does not carve them out of (c) copyright policy or (d) training-content summary5 — the obligation is agnostic to the model's release licence. Second, penalties under Article 99(4) for non-compliance with Articles 53–55 reach up to €15M or 3% of worldwide annual turnover, whichever is higher6 ✅.

Section 1 — General information about the model

The AI Office's first section asks for unambiguous identification of the model and the legal entity behind it. The mandatory fields are:

  • Provider legal name + EU representative (where the provider is established outside the Union)
  • Model name + version + release date
  • Model architecture family (transformer / mixture-of-experts / state-space / diffusion / etc.)
  • Modality coverage (text / image / audio / multi-modal)
  • Whether the model is placed on the market as part of a service (chat assistant, API) or distributed as weights
  • Whether the provider considers the model to fall under Article 51's systemic-risk presumption (≥10²⁵ FLOPs of compute used for training)7

For Aurelius-70B: provider — Aurelius AI SAS (fictional, Paris); EU representative — n/a (already EU-established); model — Aurelius-70B v1.0, released 2026-01-15; architecture — dense decoder-only transformer, 70B parameters; modality — text only; placement — both API service and downloadable weights under Apache 2.0; systemic-risk presumption — yes (training compute estimated at 2.4 × 10²⁵ FLOPs8 ⚠; recorded in the Article 55 systemic-risk assessment). Being precise here is defensive: misstating the compute figure to dodge the Article 55 obligations is exactly the surface a regulator would interrogate first.

What veric emits here: the model identifier, version, and release date are configuration, not derived facts. veric's contribution is the provenance certificate (D1) field that pins this section to a specific git SHA and dbt manifest hash, so a regulator can cryptographically verify that the published summary corresponds to the build that produced the released weights. The relevant tag types are T10 source_system (carrying the build identifier through the pipeline) and the certificate-level crate_name field that the Custom Root layer emits per build (per the product-surface doc §"Layer 1 — Substrate primitives" P4).

Section 2 — Data sources by category

This is the section that most providers will spend the most time on, and the one that most third-party scrutiny will land on. The AI Office's explanatory notice asks for a categorical breakdown of training-data sources with token counts per category, temporal coverage of each source, and acquisition route (publicly-scraped / licensed / proprietary / synthetic / user-contributed). Sub-fields to populate per category:

Sub-fieldWhat goes here
Source type"public web crawl" / "licensed feed" / "open dataset" / "synthetic" / "user-contributed" / "proprietary"
Volumetokens (text), images, hours (audio), or other native unit
Temporal rangeearliest → latest item ingestion date
Top-N source identifiersfor web crawl: the snapshot identifiers + cutoff date; for licensed: counterparty name (or category if confidentiality applies); for open: the dataset name + version pin
Acquisition routescrape / contracted / public download / generated

Aurelius-70B's filled Section 2 was the teaser at the top of this piece. The discipline points worth flagging:

  1. "Web crawl" is not a sufficient identifier, even though it is the path of least resistance. The Aurelius-70B entry names the CommonCrawl snapshot range (2018–2024) and the crawl-stop date (2025-11-30) precisely so that a complainant can correlate against publicly-archived ai.txt / robots.txt entries at that date. Without the date, the Section 3 opt-out claim in the next section is unverifiable.
  2. Licensed publisher feeds should be enumerated even where individual contract terms are confidential — the AI Office notice permits aggregating to category, but the count of contracts is not commercially sensitive and is the field a rights-holder will scrutinise.
  3. Synthetic data must declare its generator's lineage under Article 53(1)(d), also explicitly called out by California AB 2013 §22757(b)(8) (effective 1 January 2026)9 ✅. Aurelius-70B's synthetic block declares the producing checkpoint and names the human-written seed prompts — recursion without a base case is a red flag.
  4. Excluded sources are evidence too. Aurelius-70B enumerates the book corpora it did not use (LibGen, PiLiMi, Books3) — the same corpora at the heart of the Bartz v. Anthropic settlement (Sep 2025, $1.5B; ~500K pirated works at ~$3K per work)10 ✅. "We did not train on X" carries weight only if verifiable; Section 3 explains the mechanism.

What veric emits here: Section 2 is the canonical use case for the provenance certificate (D1) — shipped today via veric attest — plus the datasheet / Croissant manifest (D4) (Roadmap, ships W3 per the GTM shipping plan §2). The token-count and temporal-range entries are compile-time-derivable summaries — pulled from the dbt manifest hash + source-of-record register without human transcription. The relevant tag types are T2 license, T5 jurisdiction, T10 source_system, and T7 synthetic_origin. The veric v0.1 wedge for Section 2 is the per-build D1 binding the per-category totals to a git SHA; once D4 lands, the structured manifest is mechanically derivable from the same build, so a regulator can in principle re-execute the manifest against source snapshots and verify the totals (regulatory deep-dive §3.2 GAP-2).

Section 3 — Scraping behaviour and TDM opt-out respect

Article 53(1)(c) of the AI Act binds GPAI providers to a policy to comply with Union copyright law, including the text-and-data-mining reservation of rights under Article 4(3) of Directive (EU) 2019/790 (the DSM Directive)11 ✅. The AI Office template's Section 3 operationalises this by asking providers to describe:

  • Which crawler(s) were used (custom / GPTBot-style / third-party such as CommonCrawl)
  • How robots.txt directives were respected at fetch time
  • How TDM opt-out signals (X-Robots-Tag: noai, the proposed ai.txt convention, schema.org noai/noimageai, the TDM Reservation Protocol HTTP header) were detected and honoured
  • The effective opt-out register: count of distinct domains where opt-out was respected, with the cut-off date
  • The contact point published by the provider for rights-holders to assert opt-outs post-hoc

Aurelius-70B's Section 3: in-house crawler aurelius-bot/3.2; robots.txt honoured at fetch with a 24-hour TTL refresh; ai.txt + X-Robots-Tag: noai + TDM Reservation Protocol header all consulted; opt-out register of 412,318 distinct second-level domains as of 2025-11-30; rights-holder contact at [email protected] with a 30-day response SLA per GPAI Code of Practice Copyright Measure 412. The section also notes that the compile-time exclusion list extended beyond the opt-out register to include the LibGen / PiLiMi / Books3 corpora, the Sci-Hub mirror set, and 27 publisher domains under active licensing negotiation as of the crawl-stop date.

This last point is the load-bearing one. A claim like "we excluded LibGen" is forensically meaningful only if the exclusion was enforced at the level where the corpus was assembled, not asserted in a model card afterwards. The mechanism that makes the claim falsifiable is a forbidden-flow attestation (D2) at compile time: a machine-checkable certificate that the URL prefix libgen.* (or whatever identifier) does not appear as the source of any row that reached the training table, evaluated over every execution path of the ingestion pipeline.

What veric will emit here: the forbidden-flow attestation (D2) (Roadmap, ships W4–5 per GTM §2) is the canonical artefact for Section 3's exclusion claims. Composition: the substrate primitives for tag-flow propagation through transforms (P2) and forbidden-flow refutation (P3), over T2 license and T4 tdm_optout_signal. The modelled D2 output for Aurelius-70B is a per-exclusion certificate of the form (license=copyright_restricted ∨ tdm_optout_signal=true, sink=training_table, ⊥) — readable as "no row matching the predicate reaches the training table." Until D2 lands, the v0.1 substrate for Section 3 is the licence-tag schema seeded by veric init plus the D8 PR-time CI diff (shipped via garrick0/veric-action@v1) flagging any pipeline change that would alter what reaches the training table. D2 is the artefact that will convert a Section 3 narrative claim into a regulator-receivable proof.

Section 4 — Copyright opt-out respect (Art 4(3) DSM Directive)

Section 4 overlaps with Section 3 but is nominally distinct: where Section 3 covers the behaviour of the crawler, Section 4 covers the policy posture of the provider towards opt-outs raised after training. The AI Office notice expects the provider to describe:

  • The opt-out reception channel (the contact, the form, the SLA)
  • The decision procedure when an opt-out is raised against material already incorporated in the training corpus
  • Whether the provider commits to retraining triggers, fine-tuning unlearning, or other remedial steps when opt-outs accumulate above a defined threshold
  • Cross-reference to the Article 53(1)(c) copyright-policy document required separately

Aurelius-70B's Section 4 references the published copyright policy at aurelius.eu/copyright-policy-v2 (a separate Article 53(1)(c) deliverable) and states: opt-outs received at [email protected] are logged within 24 hours; the source domain is added to the next crawl's opt-out register; the affected URLs join the build-time exclusion list enforced by the same compile-time mechanism described in Section 3; if cumulative excluded volume exceeds 0.5% of corpus tokens between major training runs, an interim retraining trigger is evaluated by the model-risk committee per Article 55. The 0.5% threshold is 🔬 modelled, derived from the volume below which an unlearning fine-tune is empirically more cost-effective than a full retrain (documented in the model's internal retraining-trigger-policy-v1.md).

The European Data Protection Board (EDPB) Opinion 28/2024, issued 18 December 2024 ✅, makes a parallel point for personal data: unlawfully processed data in development phase contaminates downstream deployment unless the model is genuinely anonymised13. The Article 4(3) DSM opt-out and the GDPR Article 17 erasure right share an operational requirement — the provider must trace a given source from its ingestion through every derivative artefact and prove that downstream artefacts no longer carry its information.

What veric emits here: Section 4's opt-out handling description is policy text that veric does not generate. The evidence backing the policy commitments is on the roadmap: the per-opt-out forbidden-flow attestation (D2) for the specific URL set (Roadmap, ships W4–5 per GTM §2) and, after the opt-out is propagated through the next build, the erasure-completeness certificate (D7) that proves the previously-ingested rows do not survive in any retained derivative dataset (Roadmap, ships W12). D7 is the deliverable that will distinguish graph-attested erasure from workflow-attested erasure and is the modelled substrate for satisfying the EDPB Opinion 28/2024 contamination concern in a falsifiable form. Tag types in play: T2 license, T4 tdm_optout_signal, and (when the opt-out is also a personal-data erasure request) T3 consent_status. The v0.1 contribution today is the licence- and opt-out-tagged build, with D2 and D7 closing the evidence loop on the W4–5 / W12 ship dates.

Section 5 — Licence-tag handling

Section 5 asks the provider to describe how source-licence information is captured, propagated, and respected through the training pipeline. The fields the AI Office notice flags:

  • The licence-vocabulary used (SPDX identifiers, RAIL variants, custom)
  • The granularity at which licences are tracked (per dataset / per record / per URL)
  • How licences for derivative datasets are computed (the union, the most-restrictive, manual override)
  • The handling of "unknown licence" sources
  • Whether licence terms are enforced at training time (e.g. exclusion of non-commercial-only sources from a commercial-deployment training run)

For Aurelius-70B: SPDX identifiers as the canonical vocabulary; RAIL variants captured as a parallel attribute where applicable; granularity is per-URL for web sources, per-dataset for licensed feeds and open corpora; derivative licences computed as the strictest applicable basis through every join (the contributor-most-restrictive rule); unknown-licence sources are tagged license=unknown and excluded by the compile-time policy from the training table. One explicit derogation: the synthetic-data block, generated under the aurelius-internal licence by the provider's own checkpoint, is permitted to enter training subject to the upstream-seed-prompt licence chain in Section 2 entry (e).

The granularity question (per-record vs per-dataset) is load-bearing. The regulatory deep-dive §3.2 GAP-4 notes that today's licence-tagging is overwhelmingly dataset-level, while reality is that one column may rely on consent and another on contract; one URL in a CommonCrawl snapshot may have an opt-out signal and another may not. A substrate that carries licences at row-or-column granularity through every transform makes the contributor-most-restrictive rule mechanically enforceable rather than aspirational.

What veric emits here: Section 5 is almost entirely composed from veric's licence-tag handling. The relevant tag is T2 license (with values from the SPDX vocabulary plus the RAIL extensions) — seeded today by veric init and threaded through the v0.1 D1 provenance certificate. The substrate primitives in play are P1 (schema-level semantic tagging) and P2 (tag-flow propagation through transforms — the contributor-most-restrictive rule is computed by the substrate, not asserted). The output that will back Section 5 in structured form is the datasheet / Croissant manifest (D4) with the licence-vocabulary, granularity, and derivative-rule fields populated mechanically from the build (Roadmap, ships W3 per GTM §2).

Section 6 — Data-curation methodology

Annex IV §2(d) of the AI Act asks for "data cleaning methodologies" as part of the technical documentation pack14; Section 6 of the AI Office's GPAI summary is the public-facing analogue. The AI Office expects:

  • Deduplication strategy (exact-match, near-duplicate / MinHash, semantic)
  • Quality filtering (perplexity thresholds, classifier-based filtering, language identification, toxicity filtering)
  • Personal-data redaction approach (Section 7 covers the obligation; Section 6 covers the technique)
  • Annotation procedures and annotator population characteristics where annotation was used (this overlaps with AI Act Article 10(2)(c) for high-risk systems but is also expected for GPAI)
  • Aggregation, normalisation, and any other transformations that materially shape what enters training

For Aurelius-70B: deduplication via MinHash-LSH (n=128 hash functions, Jaccard threshold 0.85) yielding a 17.4% reduction from raw to deduplicated corpus ⚠ (plausibly representative for a CommonCrawl-heavy corpus; flagged inferred because Aurelius-70B is fictional); quality filtering via a Cl-XX-style classifier trained on Wikipedia-vs-CommonCrawl as the high-quality reference; language identification via fastText restricted to 27 languages; toxicity filtering with a Perspective-API-style classifier at threshold 0.85; annotation used only for the 12B-token human-written seed prompts feeding the synthetic block, with annotator demographics in the separate Datasheet (D4).

The discipline point: curation methodology is where memorisation risk gets bounded or amplified. The Carlini et al. memorisation work (USENIX Security 2021; quantification 2022) is the empirical anchor15 — duplicated training text is disproportionately memorised, and verbatim regurgitation is the testable surface on which copyright disputes (NYT v. OpenAI; the surviving direct-infringement claims in Andersen v. Stability AI) increasingly turn16. A Section 6 entry that names a specific deduplication method with a specific threshold is a falsifiable claim; "we deduplicated" is not.

What veric emits here: Section 6 has both a configuration-text portion (the choice of deduplication method, the choice of classifier) and a derivable-summary portion (the row-count reductions through each filter). The derivable portion will land naturally as a datasheet / Croissant manifest (D4) field — Croissant 1.1's preprocessing-step manifest covers the row-count diff per transform17 (D4 ships W3 per GTM §2). The relevant tag types are T6 pii / special_category and T10 source_system. The v0.1 contribution today is the per-build D1 binding the chosen thresholds to a git SHA — once D4 lands, the row-count diff per transform is mechanically published. veric does not opine on whether the chosen thresholds are correct; it ensures the published thresholds match the build.

Section 7 — Personal-data processing summary

Where personal data is incorporated in the training corpus, GDPR Articles 5, 6, 9, 17, 30, and 35 all apply concurrently with the AI Act obligation. Section 7 of the AI Office summary asks the provider to describe:

  • Whether personal data was knowingly incorporated
  • The lawful basis relied on (consent / contract / legal obligation / vital interests / public task / legitimate interests; Article 6(1)(f) requires a documented Legitimate Interests Assessment per EDPB Opinion 28/202418)
  • Whether any GDPR Article 9 special-category data was incorporated, and on what basis (the AI Act Article 10(5) carve-out applies to bias-correction in high-risk systems and is not a general training-data carve-out)
  • The technical and organisational measures applied (pseudonymisation, anonymisation, access controls, deletion-on-no-longer-needed)
  • The data-subject-rights workflow — specifically the GDPR Article 17 erasure path

Aurelius-70B's Section 7: personal data was incorporated in the public-web-crawl portion; the legitimate-interests basis under Article 6(1)(f) is documented in the LIA appended as Annex C of the Article 53(1)(a) technical documentation; no Article 9 special-category data was knowingly retained — pre-training filtering targeted face-image detectors, name-entity detectors, and a special-category-keyword filter, with filtered output set aside in a non-training-bound staging area; pseudonymisation applied to author-identifier fields where present; the data-subject-rights workflow channels Article 17 requests through [email protected].

EDPB Opinion 28/2024 §3 holds that unlawfully processed data in the development phase contaminates downstream deployment unless the model is genuinely anonymised — and the standard for "genuinely anonymised" is contested (the Hamburg DPA's August 2024 discussion paper on Llama anonymity is one read; the EDPB Opinion is the more binding one)19. A Section 7 entry that asserts anonymisation without naming the specific technique and the residual re-identification risk assessment will not survive scrutiny.

What veric emits here: Section 7 is the section that most directly motivates the erasure-completeness certificate (D7) (per the product-surface doc §D7). The lawful-basis text is policy; the falsifiable substrate is the column-level T1 lawful_basis tag (with values consent / legitimate_interest / contract / legal_obligation / vital_interest / public_task / art_9_exception / none) and the T6 pii / special_category tag. The substrate primitive that converts the Article 17 erasure path from workflow-attested to graph-attested is P7 (erasure-completeness — reverse-direction reachability) — given a row deletion, P7 proves no path remains from the removed row to any pinned downstream artefact. D7 is the deliverable that the DPO produces in response to a data-subject access request, replacing the "we ran the delete script and trust me" workflow with a graph proof. (P7 ships in veric v0.2 in week 9 of the GTM-doc shipping plan, with D7 in v0.3 at week 12; for the immediate Section 7 deliverable, the Section 7 evidence at v0.1 is the lawful-basis tagging plus the forbidden-flow attestation (D2) for the special-category exclusion claim.)

Section 8 — Representative samples and corpus size

Section 8 asks the provider to publish a representative sample of the training corpus alongside summary statistics. The AI Office notice is explicit that the sample is intended to enable third-party scrutiny — it is not a confidentiality-preserving aggregate. The fields:

  • Total corpus size (tokens / images / hours / native unit)
  • Per-category composition (echoing Section 2 numerically)
  • A representative sample (the AI Office indicates "in the order of" several thousand items, structured as a downloadable archive linked from the summary)
  • Documentation of the sampling methodology (random / stratified / time-stratified)
  • Statistical-property summary per category where feasible (length distributions, language distribution, topic distribution)

Aurelius-70B's Section 8: total corpus 2.1T tokens; per-category breakdown matches Section 2 verbatim; representative sample of 8,000 documents stratified by category and source-domain published as a tar.gz at aurelius.eu/training-content-summary/sample-v1.tar.gz; sampling methodology — uniform random within each category, with per-document deduplication so no document appears twice; statistical-property summary — mean document length 2,847 tokens, median 1,210 tokens, language distribution 89.4% English, 4.1% French, 1.9% German, 1.4% Spanish, 3.2% other ⚠ (figures inferred from comparable open models; derived from the actual corpus at publish time for a real Aurelius-70B summary).

The act of publishing the representative sample is itself a copyright surface. The AI Office notice anticipates this — the sample is to be drawn from sources where redistribution is compatible with the source licence; for licensed publisher feeds the sample may need to be a structural extract (token-level statistics, redacted titles) rather than full text. Aurelius-70B's sample draws predominantly from public-domain books, openly-licensed code, and Wikipedia, with licensed-publisher and post-2024 web-crawl content represented through structural metadata only.

What veric emits here: Section 8's summary statistics are derived facts — they fall out of the build. The shipped artefact is the provenance certificate (D1) (for the per-category totals tied to a specific git SHA, available today via veric attest); the structured per-category statistical properties land with the datasheet / Croissant manifest (D4) (Roadmap, ships W3 per GTM §2). The representative-sample selection is a separate emission step that veric will scope via the T2 license tag — only sources where the licence allows redistribution feed the sample-emitter; the rest are summarised structurally. veric does not generate the sample's documents itself; it pins the selection logic to the build so that the published sample is reproducibly derived from the same corpus that produced the model.

Section 9 — Further information

The final section is the open-ended one. The AI Office notice expects providers to use Section 9 to disclose anything that bears on the "sufficient detail" standard but does not fit elsewhere — most commonly:

  • Cross-references to the Article 53(1)(c) copyright policy and the Article 53(1)(a) technical documentation
  • Cross-references to model cards and datasheets published independently
  • Material differences between training-corpus composition for the base model and any fine-tuned variants
  • Known limitations of the summary itself (e.g. "data sources before 2018-01-01 are best-effort reconstructed from incomplete provenance records")
  • Update history of the summary (revision dates and the changes made)

Aurelius-70B Section 9 cross-references: the copyright policy at aurelius.eu/copyright-policy-v2; the Annex XI technical documentation lodged with the AI Office (reference AIO-GPAI-2026-0017); the model card at huggingface.co/aurelius-ai/Aurelius-70B/blob/main/README.md; the Croissant 1.1 datasheets per training-data category at aurelius.eu/training-content-summary/datasheets/. It notes that Aurelius-70B-Chat (the chat-tuned variant) uses an additional 8B tokens of human preference data not represented in this summary and points to the Chat-specific Article 53(1)(d) summary. Version history: v1.0 published 2026-01-15 with the model release; v1.1 anticipated on completion of the Q2 2026 retraining trigger evaluation.

The discipline point: resist the temptation to use Section 9 as a marketing surface. The AI Office's standard is sufficient detail, not sufficient narrative. Material that does not bear on the falsifiability of Sections 1–8 does not belong in Section 9.

What veric emits here: Section 9 is the index of the rest of the GPAI documentation pack. The cross-references it carries are stable artefact identifiers — git SHAs, hash-chained certificates, AI Office filing numbers, dataset version pins. The most veric-specific Section 9 entry is the continuous provenance ledger (D9) reference (Roadmap, ships W12+ per GTM §2) — a modelled append-only store of every D1, D2, and D7 certificate issued for the model over time, indexed by model version. Once D9 lands, the DPO will publish the ledger URL in Section 9 so that a regulator under an Article 17 / OCR / FTC investigation can self-serve the historical provenance state. Until then, the v0.1 Section 9 entry can cross-reference the per-build D1 hashes individually.

Putting it together — what a compile-time-derivable summary changes

The pattern across all nine sections is consistent. Roughly half of every section is configuration text — the lawful-basis choice, the deduplication method, the publisher-contract structure — that the provider authors and that no compiler can derive. The other half is derived from the build: token counts, per-category breakdowns, exclusion claims, opt-out registers, statistical-property summaries, cross-references. A compile-time-derivable summary is one in which the second half of every section is mechanically reproduced from the same build that produced the model weights, signed against the dbt manifest hash and pinned to a git SHA.

The wedge that this opens is the one the regulatory deep-dive §3.2 GAP-2 names: today's training-content summary is author-attested with no independent re-derivation channel. A compile-time summary lets the AI Office's scientific panel — or a rights-holder, or a civil-society auditor — re-execute the build manifest against the source-of-record snapshots and verify the published numbers. That is the difference between a summary that asserts a property and a summary that proves one.

The Italian Garante's annulled fine, the Bartz settlement, EDPB Opinion 28/2024, and the active NYT and Authors Guild dockets converge on the same point: substantive training-content adequacy will be tested in 2026–2027, and the provider whose summary is verifiable against the build will have a materially better defensive posture than the provider whose summary is asserted alongside it.


Take it further

  • Free artefact — Aurelius-70B's filled Article 53(1)(d) summary as PDF + JSON at samples.veric.dev/template-2025-07-24/. Use it as a structural baseline.
  • CLI todaypip install veric, then veric init to seed the T1–T10 tag schema, and veric attest <model_name> to emit a first cut of the provenance certificate (D1) against your current build (the D8 PR-time CI diff is shipped via the GitHub Action at garrick0/veric-action@v1).
  • CLI roadmap — D2 (forbidden-flow attestation, week 4–5), D4 (Croissant 1.1 datasheet, week 3), D6 (Art. 53(1)(d) emitter pinned to the 24 Jul 2025 template, week 5–8), D7 (erasure-completeness certificate, week 12), D9 (continuous provenance ledger, week 12+). See the GTM shipping plan at docs/g2m/vertical-opportunities/ai-provenance-gtm-motion-2026-05-02.md §2 for the week-by-week roadmap.
  • NewsletterAI Act Weekly curates regulators / courts / practitioners moving the GPAI compliance frontier. Mondays, ten-minute read.
  • Read next — [BACK-REF: R-1] for the Article 53 reference subpage; [FORWARD-REF: R-3] for the Annex IV walkthrough (the high-risk-system documentation pack, distinct from the GPAI summary).

Sources

Alphabetised by issuing body. Footnote anchors retained in the body point at the same [^N] IDs; the listing order below is for reader navigation, not anchor resolution.

California Legislature

Carlini et al. (arXiv preprints)

European AI Office (DG-CNECT)

European Data Protection Board

European Union (Official Journal)

Hamburg Commissioner for Data Protection (HmbBfDI)

Hoffmann et al. (arXiv preprint)

Italian Garante per la protezione dei dati personali

MLCommons

US courts (Bartz / NYT / Andersen dockets)

Footnotes

  1. European AI Office (DG-CNECT). Commission presents template for general-purpose AI model providers to summarise data used to train their models. 24 July 2025. https://digital-strategy.ec.europa.eu/en/news/commission-presents-template-general-purpose-ai-model-providers-summarise-data-used-train-their

  2. European Union. Regulation (EU) 2024/1689 (the AI Act), Articles 53(1)(d) and 113. Official Journal of the European Union, 12 July 2024. https://eur-lex.europa.eu/eli/reg/2024/1689/oj

  3. Italian Garante per la protezione dei dati personali. Sanction order against OpenAI (provvedimento n. 9870832), 2 November 2024. https://www.garanteprivacy.it/home/docweb/-/docweb-display/docweb/10085455 — Annulment by the Court of Rome on 18 March 2026 reported on procedural grounds; substantive lawful-basis question undecided. ⚠ The annulment changes the deck framing — cite carefully; secondary coverage at https://www.dataprotectionreport.com/2025/01/the-edpb-opinion-on-training-ai-models-using-personal-data-and-recent-garante-fine-lawful-deployment-of-llms/.

  4. European AI Office (DG-CNECT). AI Act FAQ — Scope and supervision of GPAI models. https://digital-strategy.ec.europa.eu/en/faqs/

  5. European Union. Regulation (EU) 2024/1689 (the AI Act), Article 53(2). Official Journal of the European Union, 12 July 2024. https://eur-lex.europa.eu/eli/reg/2024/1689/oj

  6. European Union. Regulation (EU) 2024/1689 (the AI Act), Article 99(4). Official Journal of the European Union, 12 July 2024. https://eur-lex.europa.eu/eli/reg/2024/1689/oj

  7. European Union. Regulation (EU) 2024/1689 (the AI Act), Article 51 (presumption of systemic risk for GPAI models trained with cumulative compute ≥ 10²⁵ FLOPs). https://eur-lex.europa.eu/eli/reg/2024/1689/oj

  8. 2.4 × 10²⁵ FLOPs ⚠ derived from a chinchilla-optimal token-to-parameter ratio of ~30 (Hoffmann et al., Training Compute-Optimal Large Language Models, NeurIPS 2022, https://arxiv.org/abs/2203.15556) applied to the 2.1T-token corpus reported in Section 2 of this summary. C ≈ 6 × N × D = 6 × (7 × 10¹⁰) × (2.1 × 10¹²) ≈ 8.8 × 10²³ for the dense-decoder forward+backward pass; the Aurelius-70B figure assumes ~28× over-train relative to chinchilla-optimal plus inference-time RL-from-AI-feedback cycles. Real Aurelius-70B figure would be reported from the actual training cluster's wall-clock × FLOP/sec measurement.

  9. California Legislature. AB 2013 — Generative artificial intelligence: training data transparency, §22757(b)(8). Signed 28 September 2024; effective 1 January 2026. https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=202320240AB2013

  10. Bartz et al. v. Anthropic PBC, No. 3:24-cv-05417 (N.D. Cal., Alsup J.). Settlement preliminarily approved 25 September 2025; payments through September 2027. https://www.anthropiccopyrightsettlement.com/faq — supplementary counsel narrative at https://www.susmangodfrey.com/wins/susman-godfrey-secures-1-5-billion-settlement-in-landmark-ai-piracy-case/.

  11. European Union. Directive (EU) 2019/790 on copyright in the Digital Single Market, Article 4(3). Official Journal of the European Union. https://eur-lex.europa.eu/eli/dir/2019/790/oj

  12. European AI Office and chairs of the GPAI Code of Practice. General-Purpose AI Code of Practice — Copyright Chapter, Measure 4 (rights-holder complaint contact and response procedure). 10 July 2025. https://code-of-practice.ai/

  13. European Data Protection Board. Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI models, §3. 18 December 2024. https://www.edpb.europa.eu/system/files/2024-12/edpb_opinion_202428_ai-models_en.pdf

  14. European Union. Regulation (EU) 2024/1689 (the AI Act), Annex IV §2(d). https://eur-lex.europa.eu/eli/reg/2024/1689/oj

  15. Carlini, N., Tramèr, F., Wallace, E., Jagielski, M., et al. Extracting Training Data from Large Language Models. USENIX Security 2021. https://arxiv.org/abs/2012.07805 — and Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramèr, F., Zhang, C. Quantifying Memorization Across Neural Language Models. 2022. https://arxiv.org/abs/2202.07646

  16. The New York Times Co. v. Microsoft Corp. and OpenAI, S.D.N.Y. — docket history at https://www.courtlistener.com/docket/68117049/the-new-york-times-company-v-microsoft-corporation/. Andersen v. Stability AI Ltd., N.D. Cal. — surviving direct-infringement and induced-infringement claims as of August 2024; class certification pending. https://www.courtlistener.com/docket/66732129/andersen-v-stability-ai-ltd/

  17. MLCommons. Croissant 1.1 — RAI Extension and Preprocessing Steps Manifest. February 2026. https://mlcommons.org/2026/02/croissant-1-1-standard/

  18. European Data Protection Board. Opinion 28/2024, §2 (three-step legitimate-interests test for AI development and deployment). 18 December 2024. https://www.edpb.europa.eu/system/files/2024-12/edpb_opinion_202428_ai-models_en.pdf

  19. Hamburg Commissioner for Data Protection and Freedom of Information (HmbBfDI). Diskussionspapier: Große Sprachmodelle und personenbezogene Daten. July 2024. https://datenschutz-hamburg.de/news/diskussionspapier-grosse-sprachmodelle-und-personenbezogene-daten — partial endorsement in EDPB Opinion 28/2024 (case-by-case).