Grounded Generation Architecture
Two specialized engines - retrieval and generation - fused at the decoder layer for grounded, citation-backed EU intelligence. PQ Ready EU Only
Overview
Pauhu® uses a retrieval-augmented generation architecture adapted for EU regulatory intelligence. Instead of a single monolithic model, the system splits into two specialized engines that fuse at inference time:
- Search Engine - Queries fan out across 20 semantic indexes simultaneously, ranked by relevance scoring. This is the encoder half.
- Answer Engine - Specialist models running in the browser or on-premises. Domain-specific encoders plus chat models. This is the decoder half.
The key insight: the answer engine attends to all retrieved passages simultaneously, not sequentially. Each passage is independently encoded by the search engine, then all encoded representations are concatenated and fed to the decoder in a single forward pass. This is what distinguishes Pauhu from simple Retrieval-Augmented Generation (RAG).
Grounded Generation Architecture
=================================
Query: "NIS2 implementation deadlines for essential entities"
|
v
+---------------------------------------------------------------+
| SEARCH ENGINE |
| (Semantic Search) |
| |
| Query --+--> [eurlex index ] --+ |
| +--> [curia index ] --+ |
| +--> [ted index ] --+ |
| +--> [commission ] --+-- Relevance ranking ----+ |
| +--> [echa index ] --+ (semantic + keyword | |
| +--> [ema index ] --+ hybrid) | |
| +--> [epo index ] --+ | |
| +--> [ecb index ] --+ | |
| +--> [... 12 more ] --+ | |
| | |
| 20 multilingual indexes (similarity scoring) | |
| v |
| Top-K passages|
+------------------------------------------------------------+---+
|
+--------------------------------+
|
v
+---------------------------------------------------------------+
| ANSWER ENGINE |
| (Domain Specialists) |
| |
| Top-K passages ---> [Encode each passage independently] |
| | |
| v |
| [Concatenate all encodings] |
| | |
| v |
| +---------------------------+ |
| | Domain Specialist | |
| | (e.g., Law, Finance) | |
| | Cross-lingual, optimized | |
| +---------------------------+ |
| | |
| v |
| +---------------------------+ |
| | Chat Model (decoder) | |
| | Compact model (free) | |
| | Language model (pro) | |
| +---------------------------+ |
| | |
| v |
| Answer with inline citations |
| [source: CELEX 32022L2555, Art. 21(1)] |
+---------------------------------------------------------------+
Search Engine
The search engine is responsible for finding relevant passages across all 20 EU data sources. It runs on EU infrastructure.
Fan-out search
Every query is dispatched to all 20 vector indexes in parallel. Each index contains multilingual embeddings for one data product. The fan-out ensures that a query like “carbon border adjustment” finds results across EUR-Lex legislation, CURIA case law, TED procurement notices, and Eurostat data simultaneously.
| Parameter | Value |
|---|---|
| Embedding model | Multilingual embeddings |
| Similarity metric | Semantic similarity scoring |
| Index count | 20 (one per data product) |
| Languages | 24 EU official languages |
Semantic ranking
Raw vector similarity alone misses keyword-critical matches (e.g., CELEX numbers, article references). The relevance ranking combines two signals:
- Semantic similarity - vector similarity from multilingual embeddings captures meaning across languages
- Keyword matching - term frequency / inverse document frequency catches exact identifiers, legal references, and technical codes
Results from all 20 indexes are merged, deduplicated by document ID, and sorted by hybrid score. The top-K passages (default K=10) are forwarded to the answer engine.
Layer 2: Domain embeddings
Domain specialist models enrich retrieval quality via Layer 2 embeddings. When a query is classified as belonging to a specific topic (e.g., Domain 12: Law), the corresponding specialist generates domain-tuned embeddings that are blended with the base relevance score. This narrows the semantic gap for domain-specific vocabulary - for instance, “consideration” means something very different in contract law vs. general usage.
Ranking transparency (DSA Article 27)
Every search result includes provenance metadata to comply with the Digital Services Act ranking transparency requirements:
- source - which data product the passage came from (e.g., eurlex, curia)
- semantic_score - the semantic similarity component
- keyword_score - the BM25 component
- combined_score - final relevance score
- provenance_tier - NATIVE (original text, 1.0), PARSED (extracted, 0.95), or KEYWORD (≤0.9)
Answer Engine
The answer engine runs entirely in the user’s browser (via an optimized model runtime) or on-premises in a containerised deployment. No document content leaves the user’s device during generation.
Domain specialists
21 fine-tuned specialist models cover all EuroVoc topic domains. Each specialist shares a cross-lingual model backbone but is fine-tuned on domain-specific EU corpora:
| Property | Value |
|---|---|
| Backbone | Cross-lingual multilingual model |
| Quantisation | Optimized (reduced precision) |
| Specialist count | 21 (one per topic domain) |
| Runtime | Optimized model runtime (browser or server) |
In the pipeline, the specialist serves two purposes:
- Passage re-encoding: Each retrieved passage is re-encoded through the domain specialist, producing richer representations than the base embedding model alone.
- Layer 2 scoring: The domain-specific encoding feeds back into passage ranking, allowing the system to promote passages that are more relevant within the identified domain.
Chat models (decoder)
After specialist encoding, a generative chat model synthesises the answer from all passage encodings:
| Model | Tier | Use case |
|---|---|---|
| Compact model | Free | Short answers, summaries, term definitions |
| Language model | Pro | Multi-paragraph analysis, cross-reference synthesis |
Both models run in the browser. The pro-tier model loads on demand - only downloaded when the user first triggers a pro-level query.
Browser-native execution
The answer engine uses an optimized model runtime with GPU acceleration and native execution backends:
- GPU acceleration - preferred backend on supported browsers (Chrome 113+, Edge 113+). Runs inference on the GPU for faster generation.
- Native execution - fallback for browsers without GPU acceleration. Runs on CPU threads via SharedArrayBuffer.
- No server round-trip - after the initial model download, all generation happens locally. Document passages stay on the user’s device.
Passage Fusion
The fusion step is what distinguishes this architecture from simple RAG. Here is how passages flow through the system:
Step-by-step
- Query encoding: The user’s query is encoded into a multilingual embedding vector.
- Fan-out retrieval: The query vector is dispatched to all 20 vector indexes in parallel. Each index returns its top matches.
- Relevance ranking: Results from all indexes are merged and ranked by hybrid semantic + keyword score. Top-K passages are selected.
- Independent encoding: Each of the K passages is independently encoded by the domain specialist. This produces K separate hidden-state tensors.
- Concatenation: All K encoded representations are concatenated along the sequence dimension into a single extended context.
- Decoder attention: The chat model attends to the entire concatenated context in one forward pass. Cross-attention layers see all passages simultaneously.
- Generation: The decoder generates an answer token by token, with attention weights distributed across all K passages. Citations are produced inline by tracking which passage each attention head focuses on.
Standard RAG Pauhu Grounded Generation
============ =========================
Passage 1 --+ Passage 1 --> [Encode] --+
Passage 2 --+-- Concatenate Passage 2 --> [Encode] --+-- Concatenate
Passage 3 --+-- as text Passage 3 --> [Encode] --+-- as tensors
| |
v v
[Single prompt with [Decoder cross-attends
all passages as to ALL encoded passages
plain text context] simultaneously]
| |
v v
LLM generates answer Decoder generates answer
(context window limit (scales with K, not
constrains passage count) context window length)
Problem: passages compete Advantage: each passage
for context window space. is encoded independently.
Adding more passages Adding more passages
dilutes each one. does not dilute quality.
Why grounded generation matters for EU data
EU regulatory questions often require synthesising information from multiple legal instruments. A question about NIS2 implementation might need passages from the directive itself (EUR-Lex), national transposition measures (National Law), relevant CURIA case law, and ECHA guidance documents. Standard RAG would concatenate these as text, competing for a fixed context window. Pauhu encodes each independently, so the decoder can attend to all of them with equal fidelity.
Cloud vs Sovereign
Pauhu supports two deployment modes. The architecture is identical - only the infrastructure layer changes.
| Aspect | Cloud Mode | Sovereign Mode |
|---|---|---|
| Search engine | EU infrastructure (vector indexes) | Local SQLite + local vector index |
| Answer engine | User’s browser (optimized runtime) | On-premises server (optimized runtime) |
| Data residency | EU jurisdiction (edge) + user device (browser) | Entirely on-premises |
| Activation | Default | PAUHU_SOVEREIGN=true |
| Internet required | Yes (for retrieval) | No (fully air-gapped capable) |
| Model delivery | CDN → browser cache (IndexedDB) | Docker volume mount |
| IATE terminology | API lookup (EU edge) | Local SQLite database |
Cloud mode (default)
In cloud mode, the search engine runs on EU infrastructure. Query vectors are computed on the server, fan-out search hits all 20 indexes, and ranked passages are returned to the browser. The answer engine then runs entirely in the browser - no document content is sent to any server during generation.
Cloud Mode
==========
Browser EU Edge
+---------------------+ +---------------------+
| 1. User types query | | |
| | | HTTPS/TLS | |
| +--- query ------+------------->| 2. Encode query |
| | | | |
| | | v |
| | | 3. Fan-out to 20 |
| | | vector indexes |
| | | | |
| | | v |
| | | 4. Relevance rank |
| | passages | | |
| 5. Receive passages |<-------------+----+ |
| | | | |
| v | +---------------------+
| 6. Domain specialist|
| encodes each |
| | |
| v |
| 7. Fuse passages + |
| + decode |
| | |
| v |
| 8. Answer + cites |
+---------------------+
Sovereign mode
In sovereign mode, both engines run on-premises. The containerised deployment includes a gateway, context server, translation server, and an optional LLM adapter. Set PAUHU_SOVEREIGN=true in your environment to activate.
Sovereign Mode (air-gapped)
=========================
On-premises server
+-------------------------------------------------------+
| |
| Gateway (orchestrator) |
| | |
| +--- query --> Local Search Engine |
| | (SQLite FTS5 + local vector index) |
| | | |
| | ranked passages |
| | | |
| +--- passages --> Local Answer Engine |
| (optimized runtime, domain specialist)|
| | |
| +------+------+ |
| | | |
| Default or sovereign LLM |
| model (ALLaM, Mistral, etc.) |
| | | |
| +------+------+ |
| | |
| Answer + citations |
| |
| No external network access required |
+-------------------------------------------------------+
The sovereign LLM adapter supports multiple model providers:
| Provider | Config value | Example models |
|---|---|---|
| Local optimized runtime | local | Compact, language, and domain models (quantised) |
| Local Transformers | transformers-local | Any HuggingFace model |
| OpenAI-compatible API | openai-compatible | ALLaM, SwissGPT, vLLM, Ollama |
IATE Integration
IATE (Inter-Active Terminology for Europe) provides 2.4 million terms in 24 EU official languages. In the search and answer pipeline, IATE is injected into both engines:
Search engine: term expansion
When a query contains a term that exists in IATE, the search engine expands the query with equivalent terms in the same and related languages. For example, a query containing “data controller” is expanded with “Verantwortlicher” (DE), “responsable du traitement” (FR), and “rekisterinpitäjä” (FI). This expansion happens at the embedding level - the expanded terms are encoded and their vectors are averaged with the original query vector.
IATE Term Expansion (Search Engine)
====================================
Input query: "data controller obligations under GDPR"
|
v
IATE lookup: "data controller" --> IATE ID 1688230
|
+-- EN: data controller
+-- DE: Verantwortlicher
+-- FR: responsable du traitement
+-- FI: rekisterinpitaja
+-- ... (24 languages)
|
v
Expanded query vector = avg(
embed("data controller obligations under GDPR"),
embed("Verantwortlicher obligations under GDPR"),
embed("responsable du traitement obligations under GDPR")
)
|
v
Fan-out with expanded vector --> finds multilingual passages
Answer engine: term constraints
During generation, IATE terms serve as output constraints. When the decoder generates text that includes domain-specific terminology, the IATE database provides the canonical term form for the target language. This prevents the model from paraphrasing standardised terms - “data controller” remains “data controller”, not “person responsible for data”.
- Term pinning: If a passage contains an IATE term, the decoder is constrained to use the IATE-preferred form in the generated output.
- Reliability scoring: IATE terms carry reliability scores (1–4). Only terms with reliability ≥3 are used as hard constraints; lower-reliability terms are treated as soft preferences.
- Domain scoping: Term constraints are scoped to the identified topic domain. A term that means different things in law vs. finance is pinned to the correct domain-specific definition.
API Endpoints
The search and answer pipeline is accessible via the Pauhu EU API. All endpoints run in EU jurisdiction and return DSA Article 27 ranking metadata.
| Endpoint | Method | Description |
|---|---|---|
/v1/search | GET | Search only - returns ranked passages with provenance metadata |
/v1/search/answer | POST | Full pipeline - search + grounded answer generation, returns answer with inline citations |
/iate/lookup | GET | IATE term lookup - returns translations, definitions, reliability scores |
/v1/classify | POST | Topic classification - identifies which of 21 topic domains a text belongs to |
Example: search with grounded answer
POST /v1/search/answer
Content-Type: application/json
Authorization: Bearer pk_your_api_key
{
"query": "What are the NIS2 incident reporting deadlines?",
"sources": ["eurlex", "lex", "commission"],
"lang": "en",
"top_k": 10,
"model": "default"
}
Response
{
"answer": "Under NIS2 (Directive 2022/2555), essential and important
entities must report significant incidents in three stages:
(1) early warning within 24 hours, (2) incident notification
within 72 hours, and (3) final report within one month.",
"citations": [
{
"source": "eurlex",
"celex": "32022L2555",
"article": "Art. 23(4)",
"snippet": "...shall submit an early warning within 24 hours...",
"semantic_score": 0.94,
"keyword_score": 0.88,
"combined_score": 0.92
}
],
"model": "default",
"passages_used": 10,
"ranking_transparency": {
"algorithm": "hybrid-v1",
"semantic_weight": 0.70,
"keyword_weight": 0.30,
"indexes_queried": 3,
"total_candidates": 847
}
}
Comparison with RAG
Pauhu’s architecture addresses several limitations of standard Retrieval-Augmented Generation:
| Aspect | Standard RAG | Pauhu |
|---|---|---|
| Passage handling | Passages concatenated as plain text in a single prompt | Each passage encoded independently, then fused at decoder layer |
| Scaling with K | More passages = longer prompt = diluted attention | More passages = more encodings = richer cross-attention (no dilution) |
| Context window | Limited by LLM context window (4K–128K tokens) | Limited by memory for encoded tensors (typically supports 50+ passages) |
| Domain adaptation | General-purpose embeddings for retrieval | Domain specialist re-encoding enriches passage representations |
| Citation tracking | Heuristic (search generated text for passage overlap) | Structural (attention weights directly indicate source passage) |
| Multilingual | Depends on LLM’s multilingual capability | Multilingual retrieval + cross-lingual encoding + IATE term expansion across 24 languages |
| Privacy | Passages typically sent to cloud LLM API | Generation runs in browser (Cloud mode) or on-premises (Sovereign mode) |
| Ranking transparency | Opaque - no standard for explaining why a passage was selected | DSA Article 27 compliant - semantic score, keyword score, provenance tier exposed per result |
When to use each mode
- Cloud mode - Default for most users. Retrieval runs on EU infrastructure, generation runs in the browser. Best balance of search quality and privacy.
- Sovereign mode - For organisations that require full air-gap capability or on-premises data residency. Set
PAUHU_SOVEREIGN=trueand deploy the container stack. Supports custom LLM adapters (ALLaM, SwissGPT, Mistral, Llama, or any OpenAI-compatible endpoint).
Security
- Encryption at rest: AES-256 for all stored data (post-quantum safe)
- Encryption in transit: Hybrid post-quantum TLS on edge (X25519Kyber768)
- EU jurisdiction: All retrieval infrastructure runs in EU data centres only
- Model Last: All security verification steps pass before any ML inference runs
- No training on queries: User queries are never used to train or fine-tune models