Grounded Generation Architecture

Two specialized engines - retrieval and generation - fused at the decoder layer for grounded, citation-backed EU intelligence. PQ Ready EU Only

Overview

Pauhu® uses a retrieval-augmented generation architecture adapted for EU regulatory intelligence. Instead of a single monolithic model, the system splits into two specialized engines that fuse at inference time:

Search Engine - Queries fan out across 20 semantic indexes simultaneously, ranked by relevance scoring. This is the encoder half.
Answer Engine - Specialist models running in the browser or on-premises. Domain-specific encoders plus chat models. This is the decoder half.

The key insight: the answer engine attends to all retrieved passages simultaneously, not sequentially. Each passage is independently encoded by the search engine, then all encoded representations are concatenated and fed to the decoder in a single forward pass. This is what distinguishes Pauhu from simple Retrieval-Augmented Generation (RAG).

  Grounded Generation Architecture
  =================================

  Query: "NIS2 implementation deadlines for essential entities"
    |
    v
  +---------------------------------------------------------------+
  |                     SEARCH ENGINE                              |
  |                     (Semantic Search)                          |
  |                                                                |
  |  Query --+--> [eurlex index  ] --+                             |
  |          +--> [curia index   ] --+                             |
  |          +--> [ted index     ] --+                             |
  |          +--> [commission    ] --+-- Relevance ranking ----+   |
  |          +--> [echa index    ] --+  (semantic + keyword    |   |
  |          +--> [ema index     ] --+   hybrid)               |   |
  |          +--> [epo index     ] --+                         |   |
  |          +--> [ecb index     ] --+                         |   |
  |          +--> [... 12 more   ] --+                         |   |
  |                                                            |   |
  |          20 multilingual indexes (similarity scoring)      |   |
  |                                                            v   |
  |                                                   Top-K passages|
  +------------------------------------------------------------+---+
                                                               |
                              +--------------------------------+
                              |
                              v
  +---------------------------------------------------------------+
  |                     ANSWER ENGINE                              |
  |                     (Domain Specialists)                       |
  |                                                                |
  |  Top-K passages ---> [Encode each passage independently]       |
  |                           |                                    |
  |                           v                                    |
  |                 [Concatenate all encodings]                     |
  |                           |                                    |
  |                           v                                    |
  |              +---------------------------+                     |
  |              |  Domain Specialist        |                     |
  |              |  (e.g., Law, Finance)     |                     |
  |              |  Cross-lingual, optimized |                     |
  |              +---------------------------+                     |
  |                           |                                    |
  |                           v                                    |
  |              +---------------------------+                     |
  |              |  Chat Model (decoder)     |                     |
  |              |  Compact model (free)     |                     |
  |              |  Language model (pro)     |                     |
  |              +---------------------------+                     |
  |                           |                                    |
  |                           v                                    |
  |              Answer with inline citations                      |
  |              [source: CELEX 32022L2555, Art. 21(1)]            |
  +---------------------------------------------------------------+

Search Engine

The search engine is responsible for finding relevant passages across all 20 EU data sources. It runs on EU infrastructure.

Fan-out search

Every query is dispatched to all 20 vector indexes in parallel. Each index contains multilingual embeddings for one data product. The fan-out ensures that a query like “carbon border adjustment” finds results across EUR-Lex legislation, CURIA case law, TED procurement notices, and Eurostat data simultaneously.

Parameter	Value
Embedding model	Multilingual embeddings
Similarity metric	Semantic similarity scoring
Index count	20 (one per data product)
Languages	24 EU official languages

Semantic ranking

Raw vector similarity alone misses keyword-critical matches (e.g., CELEX numbers, article references). The relevance ranking combines two signals:

Semantic similarity - vector similarity from multilingual embeddings captures meaning across languages
Keyword matching - term frequency / inverse document frequency catches exact identifiers, legal references, and technical codes

Results from all 20 indexes are merged, deduplicated by document ID, and sorted by hybrid score. The top-K passages (default K=10) are forwarded to the answer engine.

Layer 2: Domain embeddings

Domain specialist models enrich retrieval quality via Layer 2 embeddings. When a query is classified as belonging to a specific topic (e.g., Domain 12: Law), the corresponding specialist generates domain-tuned embeddings that are blended with the base relevance score. This narrows the semantic gap for domain-specific vocabulary - for instance, “consideration” means something very different in contract law vs. general usage.

Ranking transparency (DSA Article 27)

Every search result includes provenance metadata to comply with the Digital Services Act ranking transparency requirements:

source - which data product the passage came from (e.g., eurlex, curia)
semantic_score - the semantic similarity component
keyword_score - the BM25 component
combined_score - final relevance score
provenance_tier - NATIVE (original text, 1.0), PARSED (extracted, 0.95), or KEYWORD (≤0.9)

Answer Engine

The answer engine runs entirely in the user’s browser (via an optimized model runtime) or on-premises in a containerised deployment. No document content leaves the user’s device during generation.

Domain specialists

21 fine-tuned specialist models cover all EuroVoc topic domains. Each specialist shares a cross-lingual model backbone but is fine-tuned on domain-specific EU corpora:

Property	Value
Backbone	Cross-lingual multilingual model
Quantisation	Optimized (reduced precision)
Specialist count	21 (one per topic domain)
Runtime	Optimized model runtime (browser or server)

In the pipeline, the specialist serves two purposes:

Passage re-encoding: Each retrieved passage is re-encoded through the domain specialist, producing richer representations than the base embedding model alone.
Layer 2 scoring: The domain-specific encoding feeds back into passage ranking, allowing the system to promote passages that are more relevant within the identified domain.

Chat models (decoder)

After specialist encoding, a generative chat model synthesises the answer from all passage encodings:

Model	Tier	Use case
Compact model	Free	Short answers, summaries, term definitions
Language model	Pro	Multi-paragraph analysis, cross-reference synthesis

Both models run in the browser. The pro-tier model loads on demand - only downloaded when the user first triggers a pro-level query.

Browser-native execution

The answer engine uses an optimized model runtime with GPU acceleration and native execution backends:

GPU acceleration - preferred backend on supported browsers (Chrome 113+, Edge 113+). Runs inference on the GPU for faster generation.
Native execution - fallback for browsers without GPU acceleration. Runs on CPU threads via SharedArrayBuffer.
No server round-trip - after the initial model download, all generation happens locally. Document passages stay on the user’s device.

Passage Fusion

The fusion step is what distinguishes this architecture from simple RAG. Here is how passages flow through the system:

Step-by-step

Query encoding: The user’s query is encoded into a multilingual embedding vector.
Fan-out retrieval: The query vector is dispatched to all 20 vector indexes in parallel. Each index returns its top matches.
Relevance ranking: Results from all indexes are merged and ranked by hybrid semantic + keyword score. Top-K passages are selected.
Independent encoding: Each of the K passages is independently encoded by the domain specialist. This produces K separate hidden-state tensors.
Concatenation: All K encoded representations are concatenated along the sequence dimension into a single extended context.
Decoder attention: The chat model attends to the entire concatenated context in one forward pass. Cross-attention layers see all passages simultaneously.
Generation: The decoder generates an answer token by token, with attention weights distributed across all K passages. Citations are produced inline by tracking which passage each attention head focuses on.

  Standard RAG                       Pauhu Grounded Generation
  ============                       =========================

  Passage 1 --+                      Passage 1 --> [Encode] --+
  Passage 2 --+-- Concatenate        Passage 2 --> [Encode] --+-- Concatenate
  Passage 3 --+-- as text            Passage 3 --> [Encode] --+-- as tensors
              |                                                |
              v                                                v
  [Single prompt with                 [Decoder cross-attends
   all passages as                    to ALL encoded passages
   plain text context]                simultaneously]
              |                                                |
              v                                                v
  LLM generates answer               Decoder generates answer
  (context window limit              (scales with K, not
   constrains passage count)          context window length)

  Problem: passages compete          Advantage: each passage
  for context window space.          is encoded independently.
  Adding more passages               Adding more passages
  dilutes each one.                  does not dilute quality.

Why grounded generation matters for EU data

EU regulatory questions often require synthesising information from multiple legal instruments. A question about NIS2 implementation might need passages from the directive itself (EUR-Lex), national transposition measures (National Law), relevant CURIA case law, and ECHA guidance documents. Standard RAG would concatenate these as text, competing for a fixed context window. Pauhu encodes each independently, so the decoder can attend to all of them with equal fidelity.

Cloud vs Sovereign

Pauhu supports two deployment modes. The architecture is identical - only the infrastructure layer changes.

Aspect	Cloud Mode	Sovereign Mode
Search engine	EU infrastructure (vector indexes)	Local SQLite + local vector index
Answer engine	User’s browser (optimized runtime)	On-premises server (optimized runtime)
Data residency	EU jurisdiction (edge) + user device (browser)	Entirely on-premises
Activation	Default	`PAUHU_SOVEREIGN=true`
Internet required	Yes (for retrieval)	No (fully air-gapped capable)
Model delivery	CDN → browser cache (IndexedDB)	Docker volume mount
IATE terminology	API lookup (EU edge)	Local SQLite database

Cloud mode (default)

In cloud mode, the search engine runs on EU infrastructure. Query vectors are computed on the server, fan-out search hits all 20 indexes, and ranked passages are returned to the browser. The answer engine then runs entirely in the browser - no document content is sent to any server during generation.

  Cloud Mode
  ==========

  Browser                              EU Edge
  +---------------------+              +---------------------+
  | 1. User types query |              |                     |
  |    |                |   HTTPS/TLS  |                     |
  |    +--- query ------+------------->| 2. Encode query     |
  |                     |              |    |                 |
  |                     |              |    v                 |
  |                     |              | 3. Fan-out to 20    |
  |                     |              |    vector indexes    |
  |                     |              |    |                 |
  |                     |              |    v                 |
  |                     |              | 4. Relevance rank   |
  |                     |   passages   |    |                 |
  | 5. Receive passages |<-------------+----+                 |
  |    |                |              |                     |
  |    v                |              +---------------------+
  | 6. Domain specialist|
  |    encodes each     |
  |    |                |
  |    v                |
  | 7. Fuse passages +   |
  |    + decode         |
  |    |                |
  |    v                |
  | 8. Answer + cites   |
  +---------------------+

Sovereign mode

In sovereign mode, both engines run on-premises. The containerised deployment includes a gateway, context server, translation server, and an optional LLM adapter. Set PAUHU_SOVEREIGN=true in your environment to activate.

  Sovereign Mode (air-gapped)
  =========================

  On-premises server
  +-------------------------------------------------------+
  |                                                       |
  |  Gateway (orchestrator)                               |
  |    |                                                  |
  |    +--- query --> Local Search Engine                  |
  |    |              (SQLite FTS5 + local vector index)  |
  |    |                        |                         |
  |    |                  ranked passages                 |
  |    |                        |                         |
  |    +--- passages --> Local Answer Engine               |
  |                      (optimized runtime, domain specialist)|
  |                             |                         |
  |                      +------+------+                  |
  |                      |             |                  |
  |                   Default      or sovereign LLM       |
  |                   model        (ALLaM, Mistral, etc.) |
  |                      |             |                  |
  |                      +------+------+                  |
  |                             |                         |
  |                      Answer + citations               |
  |                                                       |
  |  No external network access required                  |
  +-------------------------------------------------------+

The sovereign LLM adapter supports multiple model providers:

Provider	Config value	Example models
Local optimized runtime	`local`	Compact, language, and domain models (quantised)
Local Transformers	`transformers-local`	Any HuggingFace model
OpenAI-compatible API	`openai-compatible`	ALLaM, SwissGPT, vLLM, Ollama

IATE Integration

IATE (Inter-Active Terminology for Europe) provides 2.4 million terms in 24 EU official languages. In the search and answer pipeline, IATE is injected into both engines:

Search engine: term expansion

When a query contains a term that exists in IATE, the search engine expands the query with equivalent terms in the same and related languages. For example, a query containing “data controller” is expanded with “Verantwortlicher” (DE), “responsable du traitement” (FR), and “rekisterinpitäjä” (FI). This expansion happens at the embedding level - the expanded terms are encoded and their vectors are averaged with the original query vector.

  IATE Term Expansion (Search Engine)
  ====================================

  Input query: "data controller obligations under GDPR"
                    |
                    v
  IATE lookup: "data controller" --> IATE ID 1688230
    |
    +-- EN: data controller
    +-- DE: Verantwortlicher
    +-- FR: responsable du traitement
    +-- FI: rekisterinpitaja
    +-- ... (24 languages)
    |
    v
  Expanded query vector = avg(
    embed("data controller obligations under GDPR"),
    embed("Verantwortlicher obligations under GDPR"),
    embed("responsable du traitement obligations under GDPR")
  )
    |
    v
  Fan-out with expanded vector --> finds multilingual passages

Answer engine: term constraints

During generation, IATE terms serve as output constraints. When the decoder generates text that includes domain-specific terminology, the IATE database provides the canonical term form for the target language. This prevents the model from paraphrasing standardised terms - “data controller” remains “data controller”, not “person responsible for data”.

Term pinning: If a passage contains an IATE term, the decoder is constrained to use the IATE-preferred form in the generated output.
Reliability scoring: IATE terms carry reliability scores (1–4). Only terms with reliability ≥3 are used as hard constraints; lower-reliability terms are treated as soft preferences.
Domain scoping: Term constraints are scoped to the identified topic domain. A term that means different things in law vs. finance is pinned to the correct domain-specific definition.

API Endpoints

The search and answer pipeline is accessible via the Pauhu EU API. All endpoints run in EU jurisdiction and return DSA Article 27 ranking metadata.

Endpoint	Method	Description
`/v1/search`	GET	Search only - returns ranked passages with provenance metadata
`/v1/search/answer`	POST	Full pipeline - search + grounded answer generation, returns answer with inline citations
`/iate/lookup`	GET	IATE term lookup - returns translations, definitions, reliability scores
`/v1/classify`	POST	Topic classification - identifies which of 21 topic domains a text belongs to

Example: search with grounded answer

  POST /v1/search/answer
  Content-Type: application/json
  Authorization: Bearer pk_your_api_key

  {
    "query": "What are the NIS2 incident reporting deadlines?",
    "sources": ["eurlex", "lex", "commission"],
    "lang": "en",
    "top_k": 10,
    "model": "default"
  }

Response

  {
    "answer": "Under NIS2 (Directive 2022/2555), essential and important
      entities must report significant incidents in three stages:
      (1) early warning within 24 hours, (2) incident notification
      within 72 hours, and (3) final report within one month.",
    "citations": [
      {
        "source": "eurlex",
        "celex": "32022L2555",
        "article": "Art. 23(4)",
        "snippet": "...shall submit an early warning within 24 hours...",
        "semantic_score": 0.94,
        "keyword_score": 0.88,
        "combined_score": 0.92
      }
    ],
    "model": "default",
    "passages_used": 10,
    "ranking_transparency": {
      "algorithm": "hybrid-v1",
      "semantic_weight": 0.70,
      "keyword_weight": 0.30,
      "indexes_queried": 3,
      "total_candidates": 847
    }
  }

Comparison with RAG

Pauhu’s architecture addresses several limitations of standard Retrieval-Augmented Generation:

Aspect	Standard RAG	Pauhu
Passage handling	Passages concatenated as plain text in a single prompt	Each passage encoded independently, then fused at decoder layer
Scaling with K	More passages = longer prompt = diluted attention	More passages = more encodings = richer cross-attention (no dilution)
Context window	Limited by LLM context window (4K–128K tokens)	Limited by memory for encoded tensors (typically supports 50+ passages)
Domain adaptation	General-purpose embeddings for retrieval	Domain specialist re-encoding enriches passage representations
Citation tracking	Heuristic (search generated text for passage overlap)	Structural (attention weights directly indicate source passage)
Multilingual	Depends on LLM’s multilingual capability	Multilingual retrieval + cross-lingual encoding + IATE term expansion across 24 languages
Privacy	Passages typically sent to cloud LLM API	Generation runs in browser (Cloud mode) or on-premises (Sovereign mode)
Ranking transparency	Opaque - no standard for explaining why a passage was selected	DSA Article 27 compliant - semantic score, keyword score, provenance tier exposed per result

When to use each mode

Cloud mode - Default for most users. Retrieval runs on EU infrastructure, generation runs in the browser. Best balance of search quality and privacy.
Sovereign mode - For organisations that require full air-gap capability or on-premises data residency. Set PAUHU_SOVEREIGN=true and deploy the container stack. Supports custom LLM adapters (ALLaM, SwissGPT, Mistral, Llama, or any OpenAI-compatible endpoint).

Security

Encryption at rest: AES-256 for all stored data (post-quantum safe)
Encryption in transit: Hybrid post-quantum TLS on edge (X25519Kyber768)
EU jurisdiction: All retrieval infrastructure runs in EU data centres only
Model Last: All security verification steps pass before any ML inference runs
No training on queries: User queries are never used to train or fine-tune models