← Blog
Three models, investigated after a budget ran out

Three models, investigated after a budget ran out

In January 2026 we ran the same EU AI Act compliance document through three AI models and measured what each extracted. The numbers are interesting. But the more useful finding is what the differences say about how to think about model choice — and why having a local fallback matters more than it might seem.

The setup

The document was a structured summary of EU AI Act compliance obligations: 37 text chunks covering prohibited AI practices, high-risk system requirements, provider and deployer duties, transparency rules, and general-purpose AI model obligations. We ran it through three models using Sovaign's ingestion pipeline, keeping the chunking identical across all three runs.

The models: phi3:mini (small, fast, runs locally at no cost), Mistral Nemo 12B (research, multi-lingual, larger, also local, also free to run on your own hardware), and Claude Sonnet 4.5 (a frontier API model — not free, but more power and speed).

The task: extract the compliance concepts from each chunk. No hints, no templates. Read the text, tell us what matters.

What each model produced

Metric phi3:mini Mistral Nemo 12B Claude Sonnet 4.5
Chunks with entities 46% (17/37) 95% (35/37) 73% (27/37)
Total entities 30 145 156
Avg per successful chunk 1.8 4.1 5.8
Best single chunk 3 20 32
Time per chunk ~12s ~29s ~6s
Cost €0 €0 $0.28

Phi3:mini is fast and free but extracts very generic labels: "Artificial Intelligence System," "Manufacturer Entity." Useful for a rough pass, not useful for building a knowledge graph you can actually query.

Mistral Nemo takes longer but extracts with real specificity. From a single chunk on deployer obligations it pulled: Post-Remote Biometric Identification, Annex III Systems, Worker Representatives, Human Oversight, Suspension, Monitoring — 20 entities that map directly to compliance concepts. But sometimes a bit quirky, and also missing important concepts. It covered 95% of chunks.

Claude Sonnet extracted the most granular entities — 32 from a single chunk on prohibited practices, including every protected characteristic named in the regulation (Race, Political opinions, Union membership, Religion/beliefs, Sex life, Sexual orientation). When it extracted, it went deep. But it only extracted from 73% of chunks — lower than Mistral.

Why Sonnet's lower success rate is not a failure

The 10 chunks Sonnet skipped were things like #### Importers, ### 3) High-risk AI systems — obligations by role, and Deployers must, in particular: *(Art. 26)*. Section headers, structural fragments, article pointers with no content behind them.

Mistral extracted from all of these. #### Importers became the entity Importers. Deployers must, in particular: yielded Deployers and Art. 26.

Sonnet skipped them on the grounds that there is nothing to extract — a label is not a concept. This turns out to matter when you ask the resulting graph questions. A graph populated with structural labels will appear complete while actually being shallow. Sonnet's approach produces fewer nodes but each node carries real information.

The two philosophies produce different results:

Sonnet Mistral Nemo
Headers Skips (no content) Extracts labels as entities
Best chunk 32 entities 20 entities
Weakest chunks 0 (refuses to force it) 2–3 (extracts labels)

The practical takeaway — and why this experiment happened

We deliberately disable auto-reload on our API token budget. The day a feature test set consumed the remaining balance unexpectedly quickly — costs 10x higher than anticipated that day — the frontier lab API became unavailable. The question became: we always wanted local as an option — which local model is good enough to continue development without it?

That turned out to be Mistral Nemo (then — we wanted to test an EU model since that's where we are). Not perfect, but specific enough to build a meaningful knowledge graph, at no per-query cost, running entirely on our own hardware. The answer is now documented. When the API is available it's faster and extracts deeper; when it isn't, local processing continues.

The Sovaign gateway handles this routing. The graph stores what the models produce. The choice of model for a given run is a trade-off between cost, depth, speed, and whether you need data to stay on-premises — and that trade-off should be a conscious decision, not something dictated by which API happens to have remaining credit.