The setup
The document was a structured summary of EU AI Act compliance obligations: 37 text chunks covering prohibited AI practices, high-risk system requirements, provider and deployer duties, transparency rules, and general-purpose AI model obligations. We ran it through three models using Sovaign's ingestion pipeline, keeping the chunking identical across all three runs.
The models: phi3:mini (small, fast, runs locally at no cost), Mistral Nemo 12B (research, multi-lingual, larger, also local, also free to run on your own hardware), and Claude Sonnet 4.5 (a frontier API model — not free, but more power and speed).
The task: extract the compliance concepts from each chunk. No hints, no templates. Read the text, tell us what matters.
What each model produced
| Metric | phi3:mini | Mistral Nemo 12B | Claude Sonnet 4.5 |
|---|---|---|---|
| Chunks with entities | 46% (17/37) | 95% (35/37) | 73% (27/37) |
| Total entities | 30 | 145 | 156 |
| Avg per successful chunk | 1.8 | 4.1 | 5.8 |
| Best single chunk | 3 | 20 | 32 |
| Time per chunk | ~12s | ~29s | ~6s |
| Cost | €0 | €0 | $0.28 |
Phi3:mini is fast and free but extracts very generic labels: "Artificial Intelligence System," "Manufacturer Entity." Useful for a rough pass, not useful for building a knowledge graph you can actually query.
Mistral Nemo takes longer but extracts with real specificity. From a single chunk on deployer obligations it pulled: Post-Remote Biometric Identification, Annex III Systems, Worker Representatives, Human Oversight, Suspension, Monitoring — 20 entities that map directly to compliance concepts. But sometimes a bit quirky, and also missing important concepts. It covered 95% of chunks.
Claude Sonnet extracted the most granular entities — 32 from a single chunk on prohibited practices, including every protected characteristic named in the regulation (Race, Political opinions, Union membership, Religion/beliefs, Sex life, Sexual orientation). When it extracted, it went deep. But it only extracted from 73% of chunks — lower than Mistral.
Why Sonnet's lower success rate is not a failure
The 10 chunks Sonnet skipped were things like #### Importers,
### 3) High-risk AI systems — obligations by role, and
Deployers must, in particular: *(Art. 26)*. Section headers,
structural fragments, article pointers with no content behind them.
Mistral extracted from all of these. #### Importers became the
entity Importers. Deployers must, in particular: yielded
Deployers and Art. 26.
Sonnet skipped them on the grounds that there is nothing to extract — a label is not a concept. This turns out to matter when you ask the resulting graph questions. A graph populated with structural labels will appear complete while actually being shallow. Sonnet's approach produces fewer nodes but each node carries real information.
The two philosophies produce different results:
| Sonnet | Mistral Nemo | |
|---|---|---|
| Headers | Skips (no content) | Extracts labels as entities |
| Best chunk | 32 entities | 20 entities |
| Weakest chunks | 0 (refuses to force it) | 2–3 (extracts labels) |
The practical takeaway — and why this experiment happened
We deliberately disable auto-reload on our API token budget. The day a feature test set consumed the remaining balance unexpectedly quickly — costs 10x higher than anticipated that day — the frontier lab API became unavailable. The question became: we always wanted local as an option — which local model is good enough to continue development without it?
That turned out to be Mistral Nemo (then — we wanted to test an EU model since that's where we are). Not perfect, but specific enough to build a meaningful knowledge graph, at no per-query cost, running entirely on our own hardware. The answer is now documented. When the API is available it's faster and extracts deeper; when it isn't, local processing continues.
The Sovaign gateway handles this routing. The graph stores what the models produce. The choice of model for a given run is a trade-off between cost, depth, speed, and whether you need data to stay on-premises — and that trade-off should be a conscious decision, not something dictated by which API happens to have remaining credit.