Data Pipeline
From EU source document to grounded answer — five stages, zero hallucination, your infrastructure.
1. Pipeline Overview
Every answer Pauhu generates is traceable to a specific paragraph in a specific EU document. The pipeline has five stages. Each stage produces verifiable output. Nothing is generated from memory or training data alone — everything is grounded in source text.
EU Sources (20) Annotation Engine Paragraph Index
┌──────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ EUR-Lex │ │ │ │ │
│ TED │ │ STAM Standoff │ │ Semantic vectors│
│ CURIA │─────▶│ Text Annotation │───▶│ + structured │
│ IATE │ │ Model │ │ metadata (D1) │
│ + 16 more │ │ │ │ │
└──────────────┘ └──────────────────┘ └────────┬─────────┘
│
▼
Grounded Answer Laine Search Engine
┌──────────────┐ ┌──────────────────┐
│ │ │ │
│ FiD Answer │◀─────│ 26ms paragraph │
│ + citations │ │ retrieval │
│ │ │ │
└──────────────┘ └──────────────────┘
2. Data Ingestion
Pauhu ingests data from 20 EU institutional sources and 28 national law databases. Each source has a dedicated sync process that polls for new and updated documents.
| Source Category | Examples | Sync Frequency |
|---|---|---|
| Primary legislation | EUR-Lex (regulations, directives, decisions) | Every 4 hours (weekdays) |
| National transposition | 28 national law databases (Finlex, Legifrance, etc.) | Every 15 minutes |
| Case law | CURIA (Court of Justice) | Daily |
| Procurement | TED (Tenders Electronic Daily) | Every 6 hours |
| Terminology | IATE (2.4 million terms, 24 languages) | Daily |
| Statistics | Eurostat, ECB | Weekly / daily |
| Regulatory agencies | ECHA, EMA, EPO | Daily / weekly |
Documents are stored in their original format (XML, HTML, JSON) with full metadata: CELEX identifier, publication date, document type, language, and official journal reference. SHA-256 checksums verify integrity at ingestion.
What happens at ingestion
- The sync process fetches new or updated documents from the source API (CELLAR SPARQL for EUR-Lex, REST APIs for others)
- Each document receives a unique storage key:
{product}/{celex_or_id}-{language}.xml - A SHA-256 checksum is computed and stored alongside the document
- The document is queued for annotation
3. Annotation (STAM)
Raw documents are not useful for search or question answering. The annotation engine transforms each document into a structured, searchable form using STAM (Standoff Text Annotation Model) — an open standard for layered text annotation.
What STAM produces
For each document, the annotation engine produces a sidecar JSON file containing standoff annotations. The original text is never modified — annotations reference character offsets in the source document.
The document is split into individual paragraphs. Each paragraph gets a unique ID, character offsets, and structural metadata (article number, section, recital).
Each paragraph is classified across 21 topic domains (agriculture, energy, finance, law, etc.) using a fine-tuned ONNX classifier. Multi-label: a paragraph about renewable energy subsidies scores for both “Energy” and “Finance”.
Each paragraph is classified as an obligation (“Member States shall…”), prohibition (“shall not…”), permission (“may…”), or exemption. This is the legal force of the text.
EU-specific entities: institution names, legal references (CELEX, ECLI), CPV procurement codes, ECHA substance identifiers, dates, and monetary amounts.
Links between documents: “as amended by Regulation (EU) 2024/1689” is resolved to a CELEX identifier and linked bidirectionally.
Paragraphs are matched against 2.4 million IATE terms. When “acquis communautaire” appears, the IATE entry with translations in all 24 EU languages is attached.
Annotation output format
The annotation sidecar is a JSON file stored alongside the source document:
{
"source": "eurlex/32024R1689-en.xml",
"checksum": "sha256:a7f3c...",
"paragraphs": [
{
"id": "art-1-para-1",
"offsets": [1204, 1847],
"text": "This Regulation lays down...",
"topics": ["law", "science"],
"deontic": "obligation",
"entities": [
{ "type": "celex", "value": "32024R1689", "offsets": [12, 24] }
],
"iate_terms": [
{ "id": "IATE-3567894", "term": "artificial intelligence system" }
]
}
]
}
4. Paragraph Indexing
After annotation, each paragraph is indexed in two systems that work together:
| Index Type | Purpose | Technology |
|---|---|---|
| Structured index (D1) | Exact-match queries: CELEX lookup, date ranges, topic filters, deontic modality filters | SQLite-compatible relational database |
| Semantic index (Vectorize) | Meaning-based queries: “rules about AI transparency in healthcare” | BGE-M3 embeddings, 1024 dimensions, cosine similarity |
Paragraph-level granularity
Most legal search engines index entire documents. Pauhu indexes individual paragraphs. This matters because:
- A single EU regulation can be 200+ pages. Returning the entire document is not an answer.
- The generation engine needs the specific paragraph that contains the evidence, not the whole regulation.
- Paragraph-level indexing enables precise citations: “Article 6(1)(a) of Regulation (EU) 2024/1689” instead of “the AI Act”.
What gets indexed per paragraph
Structured index (D1):
celex_id, language, paragraph_id, article_number,
topics[], deontic_modality, publication_date,
entities[], cross_references[], word_count
Semantic index (Vectorize):
paragraph_text → BGE-M3 embedding (1024 floats)
metadata: celex_id, language, paragraph_id
5. Semantic Search
When a user queries Pauhu, the Laine search engine executes a hybrid search across both indexes in under 26 milliseconds (p95). This is the left hemisphere — analytical comprehension.
Search flow
- Query encoding: The user’s question is encoded into a 1024-dimensional vector using the same BGE-M3 model used at indexing time
- Semantic retrieval: The vector index returns the top-N most similar paragraphs by cosine similarity
- Structured filtering: Results are filtered by language, date range, topic, product, and any active user filters
- Rank fusion: Semantic scores and structured relevance signals are combined via reciprocal rank fusion
- Passage return: The top 3–10 paragraphs, with full metadata and source attribution, are returned
The search engine is the “left hemisphere” of the Sovereign Brain architecture. It comprehends the query and finds evidence. It does not generate text.
6. Grounded Generation
The generation engine is the “right hemisphere” — it reads the retrieved paragraphs and produces a fluent answer with citations. It uses the Fusion-in-Decoder (FiD) architecture.
How FiD works
- Input: The user’s question + 3–10 retrieved paragraphs (each with its CELEX ID and article reference)
- Encoding: Each paragraph is encoded independently by the encoder
- Fusion: The decoder attends to all encoded paragraphs simultaneously — it can cross-reference information across multiple documents
- Output: A natural-language answer with inline citations: “According to Article 6(1) of Regulation (EU) 2024/1689, high-risk AI systems must…”
Model-agnostic design
The FiD pattern (retrieve → ground → generate → cite) is the product. The underlying model is swappable. The default model runs entirely in the browser via ONNX Runtime — no external API calls, no data leaving your network. You can also connect your own LLM via the model adapter container.
Grounding guarantee
- The generation engine only sees text from retrieved paragraphs. It cannot access the internet, training data, or any other source.
- Every claim in the generated answer must cite a specific paragraph. Uncited claims are flagged by the confidence scorer.
- If the search engine returns no relevant paragraphs, the system responds with “No relevant information found in the current dataset” rather than generating a speculative answer.
7. Multilingual Flow
EU legislation exists in 24 official languages. Here is how annotations flow across languages:
English-first annotation, cross-language transfer
- English is annotated first. The annotation engine processes the English version of each document. This produces the highest-quality annotations because the NLP models perform best on English text.
- Structural alignment. EU documents have identical paragraph structure across all 24 language versions (same article numbers, same recitals). The annotation engine aligns paragraphs across languages using document structure, not machine translation.
- Annotation projection. Structural annotations (topic, deontic modality, cross-references) are projected from the English version to all parallel versions. A paragraph classified as “obligation” in English is “obligation” in Finnish, French, and all other languages — because it is the same legal provision.
- Language-specific NER. Named entity recognition runs independently per language, because entity surface forms differ (e.g., “Court of Justice” vs. “Cour de justice” vs. “Tuomioistuin”).
- Multilingual embeddings. The BGE-M3 model produces embeddings in a shared vector space across all 24 languages. A Finnish query retrieves relevant paragraphs regardless of whether the source paragraph is in Finnish, English, or any other EU language.
Translation in the pipeline
Translation is not part of the core data pipeline. The annotation and indexing pipeline processes each language version as it arrives from the EU source. Machine translation (Helsinki-NLP OPUS-MT, ONNX format) is available as a separate container for on-demand translation of search results and answers.
8. Data Sovereignty
In a sovereign deployment (on-premise container), the entire pipeline runs on your infrastructure. Here is exactly what stays where:
| Component | Location | Network Access |
|---|---|---|
| Source documents | Your container storage volume | Outbound only: EU source APIs for sync |
| STAM annotation sidecars | Your container storage volume | None (processed locally) |
| Structured index (D1) | SQLite file on your volume | None |
| Semantic index (Vectorize) | Vector database on your volume | None |
| ONNX models (NLP, search, generation) | Pre-loaded in container image | None |
| User queries | Your container, your memory | None (never transmitted) |
| Generated answers | Your container, your memory | None (never transmitted) |
| Audit log | Your container, SHA-256 chained | None |
Air-gapped mode
For classified environments, the container can run fully air-gapped. Disable the sync process and load data via offline transfer (USB, secure file share). The container includes all models, all indexes, and all annotation logic. No internet connection required for search or answer generation.
What the container phones home
Nothing. The sovereign container has no telemetry, no usage reporting, no licence phone-home. The sync process makes outbound HTTPS requests to EU institutional APIs (EUR-Lex, TED, etc.) to fetch new documents. If you disable sync, the container makes zero outbound connections.
9. Freshness and Sync
The sovereign container includes 23 automated sync processes that keep your data current. Each process polls its EU source API on a schedule:
| Source | Sync Frequency | Typical Latency |
|---|---|---|
| EUR-Lex | Every 4 hours (weekdays) | < 1 hour after Official Journal publication |
| National law (28 databases) | Every 15 minutes | Same day |
| TED procurement | Every 6 hours | < 6 hours after notice publication |
| IATE terminology | Daily | < 24 hours |
| All other sources | Daily or weekly | < 24 hours |
After sync, new documents automatically flow through annotation and indexing. The entire pipeline — fetch, annotate, index — typically completes within minutes for incremental updates.
Configurable freshness
You control sync frequency via the admin panel at /pauhu on the gateway container (port 8090). Options:
- Real-time: Poll every 15 minutes (highest network usage)
- Standard: Poll per source schedule (default, recommended)
- Manual: Disable automatic sync, trigger updates via API or admin panel
- Offline: No sync, data loaded via offline transfer
10. Annotation Inheritance
Not every language version of a document needs to be annotated from scratch. Pauhu uses an English-first Rosetta pattern: English is annotated with the highest-quality NLP models, and structural annotations are inherited by all 24 parallel language versions.
Why English first?
- NLP models (topic classification, deontic modality, NER) perform best on English text — the training data is richest.
- EU documents have identical structure across all 24 languages: same article numbers, same recitals, same paragraph boundaries. The legal content is the same — only the language differs.
- Annotating English first and projecting structural annotations to other languages produces better results than running weaker models on each language independently.
What is inherited
| Annotation Layer | Inherited? | Rationale |
|---|---|---|
| Topic classification (21 domains) | Yes | Legal topic does not change across translations |
| Deontic modality | Yes | “shall” in English = “doit” in French = same legal force |
| Cross-references (CELEX links) | Yes | CELEX identifiers are language-independent |
| Paragraph structure (offsets, article numbers) | Yes | Identical document structure across all languages |
| Named entity recognition | No | Entity surface forms differ per language |
| Terminology matching (IATE) | No | IATE entries are language-specific |
How inheritance works
When a non-English version of a document arrives, the annotation engine checks whether the English version has already been annotated. If yes, it copies inheritable annotations (topic, deontic, cross-references) and only runs language-specific models (NER, IATE matching) on the new text. The SQL logic uses COALESCE to prefer the language-specific annotation when available, falling back to the English annotation otherwise:
SELECT
p.paragraph_id,
COALESCE(NULLIF(local.topic, ''), en.topic) AS topic,
COALESCE(NULLIF(local.deontic, ''), en.deontic) AS deontic,
local.entities -- always language-specific
FROM paragraphs p
LEFT JOIN annotations local ON p.id = local.paragraph_id AND local.lang = :lang
LEFT JOIN annotations en ON p.id = en.paragraph_id AND en.lang = 'en'
Current status: multilingual rollout
The initial index was populated with English annotations only. Non-English paragraphs are being added through a three-phase rollout:
- Multilingual indexing — the indexing pipeline is being updated to process all 24 language versions, not just English. Annotations from the English version are inherited by parallel language versions at indexing time.
- Backfill — existing English-only documents are being re-processed to add annotations for all available language versions. This is a one-time operation covering the full 4.7M+ document corpus.
- Verification — cross-language annotation consistency is validated: a paragraph classified as “obligation” in English must carry the same classification in all 24 language versions.
11. Vectorize Embedding
After annotation, each paragraph is embedded into a 1024-dimensional vector space for semantic search. The embedding step converts human-readable text into numerical representations that capture meaning.
Embedding model: BGE-M3
Pauhu uses BGE-M3 (BAAI General Embedding — Multi-lingual, Multi-granularity, Multi-function) for all paragraph embeddings:
- 1024 dimensions per vector — balances expressiveness and storage efficiency
- Cosine similarity for ranking — measures angle between vectors, robust to document length variation
- 114 languages supported — covers all 24 EU official languages plus 90 additional languages
- Shared vector space — a Finnish question retrieves semantically similar paragraphs in any language
Embedding pipeline
The full path from source document to searchable vector:
- Storage: Source documents are stored in per-product object storage with full metadata
- Annotation: The annotation engine produces STAM sidecar JSON (paragraphs, topics, entities, cross-references)
- Structured index: Paragraph metadata is written to the relational database (CELEX, language, topics, deontic modality)
- Embedding: The embedding service encodes each paragraph using BGE-M3. The raw model output (Float32Array) is normalised via
Array.from()to ensure correct serialisation before storage - Vector index: The 1024-float vector is stored in the vector database alongside the paragraph’s metadata (CELEX ID, language, paragraph ID) using cosine similarity
- Query-time: The same BGE-M3 model encodes the user’s query, ensuring query and document vectors are in the same space
Why BGE-M3?
A query in Bulgarian retrieves relevant paragraphs originally written in Danish. No translation step needed — the shared vector space handles it natively.
BGE-M3 captures semantic nuance in regulatory text: “mandatory reporting obligation” and “required notification duty” map to nearby vectors, while “voluntary disclosure” maps far away.
12. Adaptive Model Loading
The sovereign container adapts its model loading strategy based on the available device memory. This ensures the system runs efficiently on everything from a developer laptop to a dedicated GPU server.
Three loading tiers
| Tier | Device Memory | Models Loaded | Use Case |
|---|---|---|---|
| Lite | < 4 GB | Search + embeddings only (BGE-M3, ONNX quantized) | Browser-native search, no generation |
| Standard | 4–16 GB | Search + FiD generation (mT5-small ONNX) + NMT (OPUS-MT selected pairs) | Full search + answer generation, selected translation pairs |
| Full | > 16 GB | All models: search, FiD, NMT (552 pairs), topic classifiers, NER, specialist models | Production sovereign deployment, all features enabled |
Progressive download
Models are downloaded progressively, not all at once. The system starts with the search models (needed immediately for queries) and downloads generation and translation models in the background. This means:
- Search is available within seconds of container start
- Answer generation becomes available once the FiD model loads (typically 10–30 seconds)
- Translation pairs load on demand — only the language pairs you actually use are downloaded
Why 300 MB matters
The global semiconductor supply chain is under sustained pressure. Memory prices are volatile, procurement cycles are lengthening, and government IT budgets rarely accommodate high-end GPU servers. Pauhu’s FiD generation model fits in 300 MB of DRAM — less memory than a typical browser tab consumes. This is a deliberate design decision: a model that runs on commodity hardware is a model that every organisation can deploy without special procurement.
Browser-native advantage
In the Lite and Standard tiers, all inference runs inside the browser via ONNX Runtime for WebAssembly. No server, no GPU, no dedicated infrastructure — the user’s own device does the work. For government IT departments, this means:
- No additional servers to provision, secure, or maintain
- No inference costs — the compute is already on every desk
- Data never leaves the device. The query, the model, and the answer all stay in the browser process
- Works on any modern browser (Chrome, Firefox, Edge, Safari) without plugins or extensions
13. Works Alongside Your Tools
Pauhu does not replace your existing software. It sits alongside it — a browser sidebar that adds EU regulatory intelligence to whatever you are already working on.
Browser sidebar overlay
The Pauhu sidebar runs as a browser extension or a standalone tab. When you are reading a PDF in your document management system, drafting a contract in your word processor, or reviewing a tender in your procurement platform, the sidebar provides:
- Contextual search: Highlight text in any application, and the sidebar searches 4.7 million EU documents for relevant regulations, case law, and terminology
- Grounded answers: Ask a question about what you are reading, and the FiD engine generates a cited answer from official EU sources
- Terminology lookup: Select a term and see its official definition in all 24 EU languages from the IATE terminology database
- Translation: Select a passage and translate it using the Helsinki-NLP OPUS-MT models — running locally, not through a cloud service
No vendor lock-in
Pauhu does not require you to migrate your documents, change your workflow, or adopt a new platform. It works with your existing tools via a standard browser. If you stop using Pauhu, nothing changes in your existing systems — you simply close the sidebar.
No per-seat tax
In the sovereign deployment, the container serves everyone on your network. There is no per-user licensing, no seat counting, and no usage metering. One deployment, unlimited internal users. The subscription covers the container and data updates — not the number of people who use it.
14. 3-Level Topic Hierarchy
Every document in the pipeline is automatically classified into a 3-level topic hierarchy derived from the EU’s official EuroVoc thesaurus (SKOS metadata). No manual tagging is needed — topic annotations are extracted from the source metadata that EU institutions already publish with each document.
Level 1: Domain (21)
├── 04 Politics ── broad subject area
├── 12 Law ── broad subject area
└── 20 Trade ── broad subject area
...
Level 2: Micro-Thesaurus (~127)
├── 12 Law
│ ├── MT 1216 Criminal law ── topical group
│ ├── MT 1221 Criminal procedure
│ └── MT 1231 Civil law
...
Level 3: Descriptor (~6,800)
├── MT 1216 Criminal law
│ ├── acquittal ── specific concept
│ ├── criminal liability
│ ├── extradition
│ └── statute of limitations
...
How it works
- Source metadata: EUR-Lex, TED, CORDIS, and other EU sources publish EuroVoc descriptors in their document metadata (SKOS RDF). The annotation worker reads these descriptors during ingestion.
- Hierarchy resolution: Each descriptor maps to a micro-thesaurus, and each micro-thesaurus maps to a domain. The pipeline stores all 3 levels per document.
- Search filtering: Users can filter search results by domain, micro-thesaurus, or descriptor. This narrows millions of documents to the precise legal topic.
15. Topics API
The Topics API exposes the 3-level hierarchy for programmatic access. Use it to build topic filters, faceted search interfaces, or domain-specific dashboards.
GET /v1/topics
Returns the list of all 21 top-level domains.
curl https://staging.pauhu.eu/v1/topics
// Response
{
"domains": [
{ "id": "04", "label": "Politics", "mt_count": 7 },
{ "id": "12", "label": "Law", "mt_count": 8 },
{ "id": "20", "label": "Trade", "mt_count": 5 },
...
],
"total": 21
}
GET /v1/topics/:domain
Returns all micro-thesauri within a domain.
curl https://staging.pauhu.eu/v1/topics/12
// Response
{
"domain": { "id": "12", "label": "Law" },
"micro_thesauri": [
{ "id": "1216", "label": "Criminal law", "descriptor_count": 48 },
{ "id": "1221", "label": "Criminal procedure", "descriptor_count": 35 },
{ "id": "1231", "label": "Civil law", "descriptor_count": 62 },
...
],
"total": 8
}
GET /v1/topics/:domain/:mt
Returns all descriptors within a micro-thesaurus.
curl https://staging.pauhu.eu/v1/topics/12/1221
// Response
{
"domain": { "id": "12", "label": "Law" },
"micro_thesaurus": { "id": "1221", "label": "Criminal procedure" },
"descriptors": [
{ "id": "1109", "label": "acquittal" },
{ "id": "839", "label": "criminal investigation" },
{ "id": "5765", "label": "European arrest warrant" },
...
],
"total": 35
}
Filtering search results by topic
Add the eurovoc_mt parameter to any search query to narrow results to a specific micro-thesaurus:
// Search only within "Criminal procedure" (MT 1221)
curl "https://staging.pauhu.eu/v1/search?q=extradition&eurovoc_mt=1221"
// This returns only documents tagged with MT 1221 descriptors,
// filtering out results from other legal areas like civil law or
// administrative law.
eurovoc_mt=1216, you get only criminal liability results.
16. Current Status
The pipeline is live and processing data continuously. This section provides a snapshot of current indexing progress.
| Metric | Value | Notes |
|---|---|---|
| Products with vectors | 14 / 24 | Remaining 10 products queued for indexing |
| Total vectors | 4,775 | Growing as indexers process annotation queue |
| Annotation consistency | Target 80% | Measured against expert-annotated evaluation set |
| VEC_EMPTY status | CLOSED | Float32Array serialization fix deployed (Array.from() normalization) |
| R2 objects | ~4.8M | Across all 24 product buckets |
| Sync frequency | 15 min – weekly | Varies by product (EUR-Lex: 4h, national law: 15min, ECHA: weekly) |
Float32Array objects that do not serialize correctly through the indexing queue. The fix normalizes embeddings via Array.from() before storage, ensuring all 1024 dimensions are preserved. This fix is deployed and verified across all 14 indexed products.
Indexing progress by product
Products are indexed across 3 workers based on binding limits. Each worker runs every 5 minutes, processing annotated documents from the queue and inserting vectors into the search index.
| Worker | Products | Status |
|---|---|---|
| Indexer A | commission, consilium, cordis, curia, dataeuropa, dpp, ecb, echa, ema, epo | Live |
| Indexer B | europarl, eurlex, eurostat, iate, lex, oeil, publications, ted, whoiswho, wiki | Live |
| Indexer C | code, osm, weather, news | Live |
17. FAQ
How large is the full dataset?
Approximately 4.7 million documents across 20 EU products and 28 national law databases (EUR-Lex: 1.67M, TED: 1.6M, national law: 256K, OEIL: 204K, and smaller counts across remaining sources). On disk, the annotated dataset with indexes requires approximately 50 GB of storage.
Can I select only specific data sources?
Yes. The admin panel lets you enable or disable individual sources. If you only need EUR-Lex and TED, disable the other 18 sources. Sync, annotation, and indexing will only process your selected sources.
What happens when a document is amended?
The sync process detects the update, re-fetches the document, re-annotates it, and updates both indexes. The old version is preserved in the audit log with its original SHA-256 checksum. Cross-references to the amended document are updated automatically.
Can I add my own documents to the pipeline?
Yes. The container accepts custom documents via API upload. Your documents go through the same annotation and indexing pipeline. They appear alongside EU source data in search results, with clear provenance marking (“Customer document” vs “EUR-Lex”).
How do I verify the pipeline is working?
The admin panel at /pauhu shows pipeline health: last sync time per source, annotation queue depth, index size, and embedding count. The /health endpoint returns machine-readable status for integration with your monitoring tools.
What is the annotation accuracy?
Topic classification: 94% F1 on the annotated evaluation set. Deontic modality: 91% F1. Named entity recognition: 89% F1. These are measured against expert-annotated EU legal documents. All annotations include a confidence score — low-confidence annotations are flagged for human review.
Sovereign Brain Architecture · Installation Guide · FiD Architecture