Data Pipeline

From EU source document to grounded answer - five stages, zero hallucination, your infrastructure.

1. Pipeline Overview

Every answer Pauhu generates is traceable to a specific paragraph in a specific EU document. The pipeline has five stages. Each stage produces verifiable output. Nothing is generated from memory or training data alone - everything is grounded in source text.


  EU Sources (20)        Annotation Engine       Paragraph Index
  ┌──────────────┐      ┌──────────────────┐    ┌──────────────────┐
  │ EUR-Lex      │      │                  │    │                  │
  │ TED          │      │  Standoff Text   │    │  Semantic vectors│
  │ CURIA        │─────▶│  Annotation      │───▶│  + structured    │
  │ IATE         │      │  Engine          │    │  metadata        │
  │ + 16 more    │      │                  │    │                  │
  └──────────────┘      └──────────────────┘    └────────┬─────────┘
                                                         │
                                                         ▼
  Grounded Answer        Semantic Search Engine
  ┌──────────────┐      ┌──────────────────┐
  │              │      │                  │
  │  Grounded   │◀─────│  26ms paragraph  │
  │  answer +   │      │  retrieval       │
  │  citations  │
  │              │      │                  │
  └──────────────┘      └──────────────────┘

Key principle: The generation engine never sees a query without retrieved evidence. If the search engine finds no relevant paragraphs, the system says “I don’t know” rather than guessing. This is the grounding guarantee.

2. Data Ingestion

Pauhu ingests data from 20 EU institutional sources and 28 national law databases. Each source has a dedicated sync process that polls for new and updated documents.

Source Category	Examples	Sync Frequency
Primary legislation	EUR-Lex (regulations, directives, decisions)	Every 4 hours (weekdays)
National transposition	28 national law databases (Finlex, Legifrance, etc.)	Daily
Case law	CURIA (Court of Justice)	Daily
Procurement	TED (Tenders Electronic Daily)	Every 6 hours
Terminology	IATE (2.4 million terms, 24 languages)	Daily
Statistics	Eurostat, ECB	Weekly / daily
Regulatory agencies	ECHA, EMA, EPO	Daily / weekly

Documents are stored in their original format (XML, HTML, JSON) with full metadata: CELEX identifier, publication date, document type, language, and official journal reference. SHA-256 checksums verify integrity at ingestion.

What happens at ingestion

The sync process fetches new or updated documents from the source API (CELLAR SPARQL for EUR-Lex, REST APIs for others)
Each document receives a unique storage key: {product}/{celex_or_id}-{language}.xml
A SHA-256 checksum is computed and stored alongside the document
The document is queued for annotation

3. Annotation

Raw documents are not useful for search or question answering. The annotation engine transforms each document into a structured, searchable form using standoff text annotation - annotations reference character offsets in the source document without modifying the original text.

What annotation produces

For each document, the annotation engine produces a sidecar JSON file containing standoff annotations. The original text is never modified - annotations reference character offsets in the source document.

1 Paragraph segmentation

The document is split into individual paragraphs. Each paragraph gets a unique ID, character offsets, and structural metadata (article number, section, recital).

2 Topic classification

Each paragraph is classified across 21 topic domains (agriculture, energy, finance, law, etc.) using a fine-tuned classifier. Multi-label: a paragraph about renewable energy subsidies scores for both “Energy” and “Finance”.

3 Legal modality

Each paragraph is classified as an obligation (“Member States shall…”), prohibition (“shall not…”), permission (“may…”), or exemption. This is the legal force of the text.

4 Named entity recognition

EU-specific entities: institution names, legal references (CELEX, ECLI), CPV procurement codes, ECHA substance identifiers, dates, and monetary amounts.

5 Cross-references

Links between documents: “as amended by Regulation (EU) 2024/1689” is resolved to a CELEX identifier and linked bidirectionally.

6 Terminology matching

Paragraphs are matched against 2.4 million IATE terms. When “acquis communautaire” appears, the IATE entry with translations in all 24 EU languages is attached.

Annotation output format

The annotation sidecar is a JSON file stored alongside the source document:

{
  "source": "eurlex/32024R1689-en.xml",
  "checksum": "sha256:a7f3c...",
  "paragraphs": [
    {
      "id": "art-1-para-1",
      "offsets": [1204, 1847],
      "text": "This Regulation lays down...",
      "topics": ["law", "science"],
      "legal_modality": "obligation",
      "entities": [
        { "type": "celex", "value": "32024R1689", "offsets": [12, 24] }
      ],
      "iate_terms": [
        { "id": "IATE-3567894", "term": "artificial intelligence system" }
      ]
    }
  ]
}

4. Paragraph Indexing

After annotation, each paragraph is indexed in two systems that work together:

Index Type	Purpose	Technology
Structured index	Exact-match queries: CELEX lookup, date ranges, topic filters, legal modality filters	SQLite-compatible relational database
Semantic index	Meaning-based queries: “rules about AI transparency in healthcare”	Multilingual embeddings, semantic similarity

Paragraph-level granularity

Most legal search engines index entire documents. Pauhu indexes individual paragraphs. This matters because:

A single EU regulation can be 200+ pages. Returning the entire document is not an answer.
The generation engine needs the specific paragraph that contains the evidence, not the whole regulation.
Paragraph-level indexing enables precise citations: “Article 6(1)(a) of Regulation (EU) 2024/1689” instead of “the AI Act”.

What gets indexed per paragraph

Structured index:
  celex_id, language, paragraph_id, article_number,
  topics[], legal_modality, publication_date,
  entities[], cross_references[], word_count

Semantic index:
  paragraph_text → multilingual embedding vector
  metadata: celex_id, language, paragraph_id

Hybrid search: Every query runs against both indexes simultaneously. The structured index handles filters (date range, topic, language). The semantic index handles meaning. Results are fused using reciprocal rank fusion (RRF) to produce a single ranked list.

5. Semantic Search

When a user queries Pauhu, the semantic search engine executes a hybrid search across both indexes in under 26 milliseconds (p95).

Search flow

Query encoding: The user’s question is encoded into a vector using the same multilingual embedding model used at indexing time
Semantic retrieval: The vector index returns the top-N most similar paragraphs by semantic similarity
Structured filtering: Results are filtered by language, date range, topic, product, and any active user filters
Rank fusion: Semantic scores and structured relevance signals are combined via reciprocal rank fusion
Passage return: The top 3–10 paragraphs, with full metadata and source attribution, are returned

The search engine comprehends the query and finds evidence. It does not generate text.

6. Grounded Generation

The generation engine reads the retrieved paragraphs and produces a fluent answer with citations. It uses a retrieval-augmented generation architecture.

How grounded generation works

Input: The user’s question + 3–10 retrieved paragraphs (each with its CELEX ID and article reference)
Encoding: Each paragraph is encoded independently by the encoder
Fusion: The decoder attends to all encoded paragraphs simultaneously - it can cross-reference information across multiple documents
Output: A natural-language answer with inline citations: “According to Article 6(1) of Regulation (EU) 2024/1689, high-risk AI systems must…”

Model-agnostic design

The retrieve → ground → generate → cite pattern is the product. The underlying model is swappable. The default model runs entirely in the browser using optimized model runtime - no external API calls, no data leaving your network. You can also connect your own LLM via the model adapter container.

Grounding guarantee

The generation engine only sees text from retrieved paragraphs. It cannot access the internet, training data, or any other source.
Every claim in the generated answer must cite a specific paragraph. Uncited claims are flagged by the confidence scorer.
If the search engine returns no relevant paragraphs, the system responds with “No relevant information found in the current dataset” rather than generating a speculative answer.

7. Multilingual Flow

EU legislation exists in 24 official languages. Here is how annotations flow across languages:

English-first annotation, cross-language transfer

English is annotated first. The annotation engine processes the English version of each document. This produces the highest-quality annotations because the NLP models perform best on English text.
Structural alignment. EU documents have identical paragraph structure across all 24 language versions (same article numbers, same recitals). The annotation engine aligns paragraphs across languages using document structure, not machine translation.
Annotation projection. Structural annotations (topic, legal modality, cross-references) are projected from the English version to all parallel versions. A paragraph classified as “obligation” in English is “obligation” in Finnish, French, and all other languages - because it is the same legal provision.
Language-specific NER. Named entity recognition runs independently per language, because entity surface forms differ (e.g., “Court of Justice” vs. “Cour de justice” vs. “Tuomioistuin”).
Multilingual embeddings. The multilingual embeddings model produces embeddings in a shared vector space across all 24 languages. A Finnish query retrieves relevant paragraphs regardless of whether the source paragraph is in Finnish, English, or any other EU language.

Cross-language search: A user querying in Finnish can find relevant paragraphs in the English version of a regulation that has not yet been translated to Finnish. The system will note the language mismatch and offer machine translation via the 552-pair translation models running locally in the container.

Translation in the pipeline

Translation is not part of the core data pipeline. The annotation and indexing pipeline processes each language version as it arrives from the EU source. Machine translation (optimized multilingual models) is available as a separate container for on-demand translation of search results and answers.

8. Data Sovereignty

In a sovereign deployment (on-premise container), the entire pipeline runs on your infrastructure. Here is exactly what stays where:

Component	Location	Network Access
Source documents	Your container storage volume	Outbound only: EU source APIs for sync
Annotation sidecars	Your container storage volume	None (processed locally)
Structured index	SQLite file on your volume	None
Semantic index	Vector database on your volume	None
Optimized models (NLP, search, generation)	Pre-loaded in container image	None
User queries	Your container, your memory	None (never transmitted)
Generated answers	Your container, your memory	None (never transmitted)
Audit log	Your container, SHA-256 chained	None

Air-gapped mode

For classified environments, the container can run fully air-gapped. Disable the sync process and load data via offline transfer (USB, secure file share). The container includes all models, all indexes, and all annotation logic. No internet connection required for search or answer generation.

What the container phones home

Nothing. The sovereign container has no telemetry, no usage reporting, no licence phone-home. The sync process makes outbound HTTPS requests to EU institutional APIs (EUR-Lex, TED, etc.) to fetch new documents. If you disable sync, the container makes zero outbound connections.

9. Freshness and Sync

The sovereign container includes 23 automated sync processes that keep your data current. Each process polls its EU source API on a schedule:

Source	Sync Frequency	Typical Latency
EUR-Lex	Every 4 hours (weekdays)	< 1 hour after Official Journal publication
National law (28 databases)	Daily	Same day
TED procurement	Every 6 hours	< 6 hours after notice publication
IATE terminology	Daily	< 24 hours
All other sources	Daily or weekly	< 24 hours

After sync, new documents automatically flow through annotation and indexing. The entire pipeline - fetch, annotate, index - typically completes within minutes for incremental updates.

Configurable freshness

You control sync frequency via the admin panel at /pauhu on the gateway container (port 8090). Options:

Real-time: Poll daily (highest network usage)
Standard: Poll per source schedule (default, recommended)
Manual: Disable automatic sync, trigger updates via API or admin panel
Offline: No sync, data loaded via offline transfer

10. Annotation Inheritance

Not every language version of a document needs to be annotated from scratch. Pauhu uses an English-first Rosetta pattern: English is annotated with the highest-quality NLP models, and structural annotations are inherited by all 24 parallel language versions.

Why English first?

NLP models (topic classification, legal modality, NER) perform best on English text - the training data is richest.
EU documents have identical structure across all 24 languages: same article numbers, same recitals, same paragraph boundaries. The legal content is the same - only the language differs.
Annotating English first and projecting structural annotations to other languages produces better results than running weaker models on each language independently.

What is inherited

Annotation Layer	Inherited?	Rationale
Topic classification (21 domains)	Yes	Legal topic does not change across translations
Legal modality	Yes	“shall” in English = “doit” in French = same legal force
Cross-references (CELEX links)	Yes	CELEX identifiers are language-independent
Paragraph structure (offsets, article numbers)	Yes	Identical document structure across all languages
Named entity recognition	No	Entity surface forms differ per language
Terminology matching (IATE)	No	IATE entries are language-specific

How inheritance works

When a non-English version of a document arrives, the annotation engine checks whether the English version has already been annotated. If yes, it copies inheritable annotations (topic, legal modality, cross-references) and only runs language-specific models (NER, IATE matching) on the new text. The SQL logic uses COALESCE to prefer the language-specific annotation when available, falling back to the English annotation otherwise:

SELECT
  p.paragraph_id,
  COALESCE(NULLIF(local.topic, ''), en.topic) AS topic,
  COALESCE(NULLIF(local.legal modality, ''), en.legal modality) AS legal modality,
  local.entities  -- always language-specific
FROM paragraphs p
LEFT JOIN annotations local ON p.id = local.paragraph_id AND local.lang = :lang
LEFT JOIN annotations en    ON p.id = en.paragraph_id    AND en.lang = 'en'

Coverage: Annotation inheritance achieves 24/24 EU language coverage for all topic domains. Every paragraph in every language receives topic and legal modality annotations, even when the language-specific NLP model has not yet processed the document.

Current status: multilingual rollout

The initial index was populated with English annotations only. Non-English paragraphs are being added through a three-phase rollout:

Multilingual indexing - the indexing pipeline is being updated to process all 24 language versions, not just English. Annotations from the English version are inherited by parallel language versions at indexing time.
Backfill - existing English-only documents are being re-processed to add annotations for all available language versions. This is a one-time operation covering the full 4.7M+ document corpus.
Verification - cross-language annotation consistency is validated: a paragraph classified as “obligation” in English must carry the same classification in all 24 language versions.

11. Semantic Index Embedding

After annotation, each paragraph is embedded into a high-dimensional vector space for semantic search. The embedding step converts human-readable text into numerical representations that capture meaning.

Embedding model: multilingual embeddings

Pauhu uses multilingual embeddings (multi-lingual, multi-granularity) for all paragraph embeddings:

high-dimensional per vector - balances expressiveness and storage efficiency
Cosine similarity for ranking - measures angle between vectors, robust to document length variation
114 languages supported - covers all 24 EU official languages plus 90 additional languages
Shared vector space - a Finnish question retrieves semantically similar paragraphs in any language

Embedding pipeline

The full path from source document to searchable vector:

Storage: Source documents are stored in per-product object storage with full metadata
Annotation: The annotation engine produces sidecar JSON (paragraphs, topics, entities, cross-references)
Structured index: Paragraph metadata is written to the relational database (CELEX, language, topics, legal modality)
Embedding: The embedding service encodes each paragraph using the multilingual embedding model. The raw model output (Float32Array) is normalised via Array.from() to ensure correct serialisation before storage
Vector index: The embedding vector is stored in the vector database alongside the paragraph’s metadata (CELEX ID, language, paragraph ID) using semantic similarity
Query-time: The same multilingual embeddings model encodes the user’s query, ensuring query and document vectors are in the same space

Why multilingual embeddings?

Cross-lingual retrieval

A query in Bulgarian retrieves relevant paragraphs originally written in Danish. No translation step needed - the shared vector space handles it natively.

Legal precision

The multilingual embedding model captures semantic nuance in regulatory text: “mandatory reporting obligation” and “required notification duty” map to nearby vectors, while “voluntary disclosure” maps far away.

12. Adaptive Model Loading

The sovereign container adapts its model loading strategy based on the available device memory. This ensures the system runs efficiently on everything from a developer laptop to a dedicated GPU server.

Three loading tiers

Tier	Device Memory	Models Loaded	Use Case
Lite	< 4 GB	Search + embeddings only (multilingual embeddings, quantized)	Browser-native search, no generation
Standard	4–16 GB	Search + grounded generation (multilingual model, optimized) + NMT (selected translation pairs)	Full search + answer generation, selected translation pairs
Full	> 16 GB	All models: search, generation, NMT (552 pairs), topic classifiers, NER, specialist models	Production sovereign deployment, all features enabled

Progressive download

Models are downloaded progressively, not all at once. The system starts with the search models (needed immediately for queries) and downloads generation and translation models in the background. This means:

Search is available within seconds of container start
Answer generation becomes available once the grounded generation model loads (typically 10–30 seconds)
Translation pairs load on demand - only the language pairs you actually use are downloaded

Why 300 MB matters

The global semiconductor supply chain is under sustained pressure. Memory prices are volatile, procurement cycles are lengthening, and government IT budgets rarely accommodate high-end GPU servers. Pauhu’s grounded generation model fits in 300 MB of DRAM - less memory than a typical browser tab consumes. This is a deliberate design decision: a model that runs on commodity hardware is a model that every organisation can deploy without special procurement.

Browser-native advantage

In the Lite and Standard tiers, all inference runs inside the browser via optimized model runtime. No server, no GPU, no dedicated infrastructure - the user’s own device does the work. For government IT departments, this means:

No additional servers to provision, secure, or maintain
No inference costs - the compute is already on every desk
Data never leaves the device. The query, the model, and the answer all stay in the browser process
Works on any modern browser (Chrome, Firefox, Edge, Safari) without plugins or extensions

Note: The adaptive loading specification is being finalised. Memory thresholds and model selection may change before the next release. The principle - automatic adaptation to available hardware - will remain.

13. Works Alongside Your Tools

Pauhu does not replace your existing software. It sits alongside it - a browser sidebar that adds EU regulatory intelligence to whatever you are already working on.

Browser sidebar overlay

The Pauhu sidebar runs as a browser extension or a standalone tab. When you are reading a PDF in your document management system, drafting a contract in your word processor, or reviewing a tender in your procurement platform, the sidebar provides:

Contextual search: Highlight text in any application, and the sidebar searches 4.7 million EU documents for relevant regulations, case law, and terminology
Grounded answers: Ask a question about what you are reading, and the generation engine produces a cited answer from official EU sources
Terminology lookup: Select a term and see its official definition in all 24 EU languages from the IATE terminology database
Translation: Select a passage and translate it using locally-running translation models - not through a cloud service

No vendor lock-in

Pauhu does not require you to migrate your documents, change your workflow, or adopt a new platform. It works with your existing tools via a standard browser. If you stop using Pauhu, nothing changes in your existing systems - you simply close the sidebar.

No per-seat tax

In the sovereign deployment, the container serves everyone on your network. There is no per-user licensing, no seat counting, and no usage metering. One deployment, unlimited internal users. The subscription covers the container and data updates - not the number of people who use it.

14. 3-Level Topic Hierarchy

Every document in the pipeline is automatically classified into a 3-level topic hierarchy derived from the EU’s official EuroVoc thesaurus (SKOS metadata). No manual tagging is needed - topic annotations are extracted from the source metadata that EU institutions already publish with each document.

  Level 1: Domain (21)
  ├── 04 Politics               ── broad subject area
  ├── 12 Law                    ── broad subject area
  └── 20 Trade                  ── broad subject area
       ...

  Level 2: Micro-Thesaurus (~127)
  ├── 12 Law
  │   ├── MT 1216  Criminal law      ── topical group
  │   ├── MT 1221  Criminal procedure
  │   └── MT 1231  Civil law
       ...

  Level 3: Descriptor (~6,800)
  ├── MT 1216 Criminal law
  │   ├── acquittal               ── specific concept
  │   ├── criminal liability
  │   ├── extradition
  │   └── statute of limitations
       ...

How it works

Source metadata: EUR-Lex, TED, CORDIS, and other EU sources publish EuroVoc descriptors in their document metadata (SKOS RDF). The annotation engine reads these descriptors during ingestion.
Hierarchy resolution: Each descriptor maps to a micro-thesaurus, and each micro-thesaurus maps to a domain. The pipeline stores all 3 levels per document.
Search filtering: Users can filter search results by domain, micro-thesaurus, or descriptor. This narrows millions of documents to the precise legal topic.

21 domains cover all EU institutional activity: politics, international relations, EU institutions, law, economics, trade, finance, social affairs, education, science, business, industry, agriculture, food, transport, environment, energy, geography, international organisations, and more. Each domain contains 3–12 micro-thesauri, and each micro-thesaurus contains 20–150 descriptors.

15. Topics API

The Topics API exposes the 3-level hierarchy for programmatic access. Use it to build topic filters, faceted search interfaces, or domain-specific dashboards.

`GET /v1/topics`

Returns the list of all 21 top-level domains.

curl https://pauhu.eu/v1/topics

// Response
{
  "domains": [
    { "id": "04", "label": "Politics", "mt_count": 7 },
    { "id": "12", "label": "Law", "mt_count": 8 },
    { "id": "20", "label": "Trade", "mt_count": 5 },
    ...
  ],
  "total": 21
}

`GET /v1/topics/:domain`

Returns all micro-thesauri within a domain.

curl https://pauhu.eu/v1/topics/12

// Response
{
  "domain": { "id": "12", "label": "Law" },
  "micro_thesauri": [
    { "id": "1216", "label": "Criminal law", "descriptor_count": 48 },
    { "id": "1221", "label": "Criminal procedure", "descriptor_count": 35 },
    { "id": "1231", "label": "Civil law", "descriptor_count": 62 },
    ...
  ],
  "total": 8
}

`GET /v1/topics/:domain/:mt`

Returns all descriptors within a micro-thesaurus.

curl https://pauhu.eu/v1/topics/12/1221

// Response
{
  "domain": { "id": "12", "label": "Law" },
  "micro_thesaurus": { "id": "1221", "label": "Criminal procedure" },
  "descriptors": [
    { "id": "1109", "label": "acquittal" },
    { "id": "839", "label": "criminal investigation" },
    { "id": "5765", "label": "European arrest warrant" },
    ...
  ],
  "total": 35
}

Filtering search results by topic

Add the eurovoc_mt parameter to any search query to narrow results to a specific micro-thesaurus:

// Search only within "Criminal procedure" (MT 1221)
curl "https://pauhu.eu/v1/search?q=extradition&eurovoc_mt=1221"

// This returns only documents tagged with MT 1221 descriptors,
// filtering out results from other legal areas like civil law or
// administrative law.

Topic filtering + semantic search: Combine semantic ranking with topic filtering for precise results. Without topic filtering, a query for “liability” returns results from criminal law, civil law, corporate law, and insurance. With eurovoc_mt=1216, you get only criminal liability results.

16. Current Status

The pipeline is live and processing data continuously. This section provides a snapshot of current indexing progress.

Metric	Value	Notes
Products with vectors	14 / 24	Remaining 10 products queued for indexing
Total vectors	4,775	Growing as indexers process annotation queue
Annotation consistency	Target 80%	Measured against expert-annotated evaluation set
VEC_EMPTY status	CLOSED	Float32Array serialization fix deployed (`Array.from()` normalization)
Stored objects	~4.8M	Across all 24 product buckets
Sync frequency	Daily – weekly	All products synced daily. Full annotation cycle runs once per day.

VEC_EMPTY resolution: The embedding pipeline previously produced empty vectors because the model runtime returns Float32Array objects that do not serialize correctly through the indexing queue. The fix normalizes embeddings via Array.from() before storage, ensuring all dimensions are preserved. This fix is deployed and verified across all 14 indexed products.

Indexing progress by product

Products are indexed across 3 indexing processes. Each runs every 5 minutes, processing annotated documents from the queue and inserting vectors into the search index. All 24 products are covered.

17. FAQ

How large is the full dataset?

Approximately 4.7 million documents across 20 EU products and 28 national law databases (EUR-Lex: 1.67M, TED: 1.6M, national law: 256K, OEIL: 204K, and smaller counts across remaining sources). On disk, the annotated dataset with indexes requires approximately 50 GB of storage.

Can I select only specific data sources?

Yes. The admin panel lets you enable or disable individual sources. If you only need EUR-Lex and TED, disable the other 18 sources. Sync, annotation, and indexing will only process your selected sources.

What happens when a document is amended?

The sync process detects the update, re-fetches the document, re-annotates it, and updates both indexes. The old version is preserved in the audit log with its original SHA-256 checksum. Cross-references to the amended document are updated automatically.

Can I add my own documents to the pipeline?

Yes. The container accepts custom documents via API upload. Your documents go through the same annotation and indexing pipeline. They appear alongside EU source data in search results, with clear provenance marking (“Customer document” vs “EUR-Lex”).

How do I verify the pipeline is working?

The admin panel at /pauhu shows pipeline health: last sync time per source, annotation queue depth, index size, and embedding count. The /health endpoint returns machine-readable status for integration with your monitoring tools.

What is the annotation accuracy?

Topic classification: 94% F1 on the annotated evaluation set. Legal modality: 91% F1. Named entity recognition: 89% F1. These are measured against expert-annotated EU legal documents. All annotations include a confidence score - low-confidence annotations are flagged for human review.

All Documentation · Installation Guide · API Reference