Three models, investigated after a budget ran out

The setup

The document was a structured summary of EU AI Act compliance obligations: 37 text chunks covering prohibited AI practices, high-risk system requirements, provider and deployer duties, transparency rules, and general-purpose AI model obligations. We ran it through three models using Sovaign's ingestion pipeline, keeping the chunking identical across all three runs.

The models: phi3:mini (small, fast, runs locally at no cost), Mistral Nemo 12B (research, multi-lingual, larger, also local, also free to run on your own hardware), and Claude Sonnet 4.5 (a frontier API model — not free, but more power and speed).

The task: extract the compliance concepts from each chunk. No hints, no templates. Read the text, tell us what matters.

What each model produced

Metric	phi3:mini	Mistral Nemo 12B	Claude Sonnet 4.5
Chunks with entities	46% (17/37)	95% (35/37)	73% (27/37)
Total entities	30	145	156
Avg per successful chunk	1.8	4.1	5.8
Best single chunk	3	20	32
Time per chunk	~12s	~29s	~6s
Cost	€0	€0	$0.28

Phi3:mini is fast and free but extracts very generic labels: "Artificial Intelligence System," "Manufacturer Entity." Useful for a rough pass, not useful for building a knowledge graph you can actually query.

Mistral Nemo takes longer but extracts with real specificity. From a single chunk on deployer obligations it pulled: Post-Remote Biometric Identification, Annex III Systems, Worker Representatives, Human Oversight, Suspension, Monitoring — 20 entities that map directly to compliance concepts. But sometimes a bit quirky, and also missing important concepts. It covered 95% of chunks.

Claude Sonnet extracted the most granular entities — 32 from a single chunk on prohibited practices, including every protected characteristic named in the regulation (Race, Political opinions, Union membership, Religion/beliefs, Sex life, Sexual orientation). When it extracted, it went deep. But it only extracted from 73% of chunks — lower than Mistral.

Why Sonnet's lower success rate is not a failure

The 10 chunks Sonnet skipped were things like #### Importers, ### 3) High-risk AI systems — obligations by role, and Deployers must, in particular: *(Art. 26)*. Section headers, structural fragments, article pointers with no content behind them.

Mistral extracted from all of these. #### Importers became the entity Importers. Deployers must, in particular: yielded Deployers and Art. 26.

Sonnet skipped them on the grounds that there is nothing to extract — a label is not a concept. This turns out to matter when you ask the resulting graph questions. A graph populated with structural labels will appear complete while actually being shallow. Sonnet's approach produces fewer nodes but each node carries real information.

The two philosophies produce different results:

	Sonnet	Mistral Nemo
Headers	Skips (no content)	Extracts labels as entities
Best chunk	32 entities	20 entities
Weakest chunks	0 (refuses to force it)	2–3 (extracts labels)

The practical takeaway — and why this experiment happened

We deliberately disable auto-reload on our API token budget. The day a feature test set consumed the remaining balance unexpectedly quickly — costs 10x higher than anticipated that day — the frontier lab API became unavailable. The question became: we always wanted local as an option — which local model is good enough to continue development without it?

That turned out to be Mistral Nemo (then — we wanted to test an EU model since that's where we are). Not perfect, but specific enough to build a meaningful knowledge graph, at no per-query cost, running entirely on our own hardware. The answer is now documented. When the API is available it's faster and extracts deeper; when it isn't, local processing continues.

The Sovaign gateway handles this routing. The graph stores what the models produce. The choice of model for a given run is a trade-off between cost, depth, speed, and whether you need data to stay on-premises — and that trade-off should be a conscious decision, not something dictated by which API happens to have remaining credit.