Grounded Generation Architecture

Two specialized engines - retrieval and generation - fused at the decoder layer for grounded, citation-backed EU intelligence. PQ Ready EU Only

Overview

Pauhu® uses a retrieval-augmented generation architecture adapted for EU regulatory intelligence. Instead of a single monolithic model, the system splits into two specialized engines that fuse at inference time:

  1. Search Engine - Queries fan out across 20 semantic indexes simultaneously, ranked by relevance scoring. This is the encoder half.
  2. Answer Engine - Specialist models running in the browser or on-premises. Domain-specific encoders plus chat models. This is the decoder half.

The key insight: the answer engine attends to all retrieved passages simultaneously, not sequentially. Each passage is independently encoded by the search engine, then all encoded representations are concatenated and fed to the decoder in a single forward pass. This is what distinguishes Pauhu from simple Retrieval-Augmented Generation (RAG).

  Grounded Generation Architecture
  =================================

  Query: "NIS2 implementation deadlines for essential entities"
    |
    v
  +---------------------------------------------------------------+
  |                     SEARCH ENGINE                              |
  |                     (Semantic Search)                          |
  |                                                                |
  |  Query --+--> [eurlex index  ] --+                             |
  |          +--> [curia index   ] --+                             |
  |          +--> [ted index     ] --+                             |
  |          +--> [commission    ] --+-- Relevance ranking ----+   |
  |          +--> [echa index    ] --+  (semantic + keyword    |   |
  |          +--> [ema index     ] --+   hybrid)               |   |
  |          +--> [epo index     ] --+                         |   |
  |          +--> [ecb index     ] --+                         |   |
  |          +--> [... 12 more   ] --+                         |   |
  |                                                            |   |
  |          20 multilingual indexes (similarity scoring)      |   |
  |                                                            v   |
  |                                                   Top-K passages|
  +------------------------------------------------------------+---+
                                                               |
                              +--------------------------------+
                              |
                              v
  +---------------------------------------------------------------+
  |                     ANSWER ENGINE                              |
  |                     (Domain Specialists)                       |
  |                                                                |
  |  Top-K passages ---> [Encode each passage independently]       |
  |                           |                                    |
  |                           v                                    |
  |                 [Concatenate all encodings]                     |
  |                           |                                    |
  |                           v                                    |
  |              +---------------------------+                     |
  |              |  Domain Specialist        |                     |
  |              |  (e.g., Law, Finance)     |                     |
  |              |  Cross-lingual, optimized |                     |
  |              +---------------------------+                     |
  |                           |                                    |
  |                           v                                    |
  |              +---------------------------+                     |
  |              |  Chat Model (decoder)     |                     |
  |              |  Compact model (free)     |                     |
  |              |  Language model (pro)     |                     |
  |              +---------------------------+                     |
  |                           |                                    |
  |                           v                                    |
  |              Answer with inline citations                      |
  |              [source: CELEX 32022L2555, Art. 21(1)]            |
  +---------------------------------------------------------------+

Search Engine

The search engine is responsible for finding relevant passages across all 20 EU data sources. It runs on EU infrastructure.

Fan-out search

Every query is dispatched to all 20 vector indexes in parallel. Each index contains multilingual embeddings for one data product. The fan-out ensures that a query like “carbon border adjustment” finds results across EUR-Lex legislation, CURIA case law, TED procurement notices, and Eurostat data simultaneously.

ParameterValue
Embedding modelMultilingual embeddings
Similarity metricSemantic similarity scoring
Index count20 (one per data product)
Languages24 EU official languages

Semantic ranking

Raw vector similarity alone misses keyword-critical matches (e.g., CELEX numbers, article references). The relevance ranking combines two signals:

Results from all 20 indexes are merged, deduplicated by document ID, and sorted by hybrid score. The top-K passages (default K=10) are forwarded to the answer engine.

Layer 2: Domain embeddings

Domain specialist models enrich retrieval quality via Layer 2 embeddings. When a query is classified as belonging to a specific topic (e.g., Domain 12: Law), the corresponding specialist generates domain-tuned embeddings that are blended with the base relevance score. This narrows the semantic gap for domain-specific vocabulary - for instance, “consideration” means something very different in contract law vs. general usage.

Ranking transparency (DSA Article 27)

Every search result includes provenance metadata to comply with the Digital Services Act ranking transparency requirements:

Answer Engine

The answer engine runs entirely in the user’s browser (via an optimized model runtime) or on-premises in a containerised deployment. No document content leaves the user’s device during generation.

Domain specialists

21 fine-tuned specialist models cover all EuroVoc topic domains. Each specialist shares a cross-lingual model backbone but is fine-tuned on domain-specific EU corpora:

PropertyValue
BackboneCross-lingual multilingual model
QuantisationOptimized (reduced precision)
Specialist count21 (one per topic domain)
RuntimeOptimized model runtime (browser or server)

In the pipeline, the specialist serves two purposes:

  1. Passage re-encoding: Each retrieved passage is re-encoded through the domain specialist, producing richer representations than the base embedding model alone.
  2. Layer 2 scoring: The domain-specific encoding feeds back into passage ranking, allowing the system to promote passages that are more relevant within the identified domain.

Chat models (decoder)

After specialist encoding, a generative chat model synthesises the answer from all passage encodings:

ModelTierUse case
Compact modelFreeShort answers, summaries, term definitions
Language modelProMulti-paragraph analysis, cross-reference synthesis

Both models run in the browser. The pro-tier model loads on demand - only downloaded when the user first triggers a pro-level query.

Browser-native execution

The answer engine uses an optimized model runtime with GPU acceleration and native execution backends:

Passage Fusion

The fusion step is what distinguishes this architecture from simple RAG. Here is how passages flow through the system:

Step-by-step

  1. Query encoding: The user’s query is encoded into a multilingual embedding vector.
  2. Fan-out retrieval: The query vector is dispatched to all 20 vector indexes in parallel. Each index returns its top matches.
  3. Relevance ranking: Results from all indexes are merged and ranked by hybrid semantic + keyword score. Top-K passages are selected.
  4. Independent encoding: Each of the K passages is independently encoded by the domain specialist. This produces K separate hidden-state tensors.
  5. Concatenation: All K encoded representations are concatenated along the sequence dimension into a single extended context.
  6. Decoder attention: The chat model attends to the entire concatenated context in one forward pass. Cross-attention layers see all passages simultaneously.
  7. Generation: The decoder generates an answer token by token, with attention weights distributed across all K passages. Citations are produced inline by tracking which passage each attention head focuses on.
  Standard RAG                       Pauhu Grounded Generation
  ============                       =========================

  Passage 1 --+                      Passage 1 --> [Encode] --+
  Passage 2 --+-- Concatenate        Passage 2 --> [Encode] --+-- Concatenate
  Passage 3 --+-- as text            Passage 3 --> [Encode] --+-- as tensors
              |                                                |
              v                                                v
  [Single prompt with                 [Decoder cross-attends
   all passages as                    to ALL encoded passages
   plain text context]                simultaneously]
              |                                                |
              v                                                v
  LLM generates answer               Decoder generates answer
  (context window limit              (scales with K, not
   constrains passage count)          context window length)

  Problem: passages compete          Advantage: each passage
  for context window space.          is encoded independently.
  Adding more passages               Adding more passages
  dilutes each one.                  does not dilute quality.

Why grounded generation matters for EU data

EU regulatory questions often require synthesising information from multiple legal instruments. A question about NIS2 implementation might need passages from the directive itself (EUR-Lex), national transposition measures (National Law), relevant CURIA case law, and ECHA guidance documents. Standard RAG would concatenate these as text, competing for a fixed context window. Pauhu encodes each independently, so the decoder can attend to all of them with equal fidelity.

Cloud vs Sovereign

Pauhu supports two deployment modes. The architecture is identical - only the infrastructure layer changes.

AspectCloud ModeSovereign Mode
Search engineEU infrastructure (vector indexes)Local SQLite + local vector index
Answer engineUser’s browser (optimized runtime)On-premises server (optimized runtime)
Data residencyEU jurisdiction (edge) + user device (browser)Entirely on-premises
ActivationDefaultPAUHU_SOVEREIGN=true
Internet requiredYes (for retrieval)No (fully air-gapped capable)
Model deliveryCDN → browser cache (IndexedDB)Docker volume mount
IATE terminologyAPI lookup (EU edge)Local SQLite database

Cloud mode (default)

In cloud mode, the search engine runs on EU infrastructure. Query vectors are computed on the server, fan-out search hits all 20 indexes, and ranked passages are returned to the browser. The answer engine then runs entirely in the browser - no document content is sent to any server during generation.

  Cloud Mode
  ==========

  Browser                              EU Edge
  +---------------------+              +---------------------+
  | 1. User types query |              |                     |
  |    |                |   HTTPS/TLS  |                     |
  |    +--- query ------+------------->| 2. Encode query     |
  |                     |              |    |                 |
  |                     |              |    v                 |
  |                     |              | 3. Fan-out to 20    |
  |                     |              |    vector indexes    |
  |                     |              |    |                 |
  |                     |              |    v                 |
  |                     |              | 4. Relevance rank   |
  |                     |   passages   |    |                 |
  | 5. Receive passages |<-------------+----+                 |
  |    |                |              |                     |
  |    v                |              +---------------------+
  | 6. Domain specialist|
  |    encodes each     |
  |    |                |
  |    v                |
  | 7. Fuse passages +   |
  |    + decode         |
  |    |                |
  |    v                |
  | 8. Answer + cites   |
  +---------------------+

Sovereign mode

In sovereign mode, both engines run on-premises. The containerised deployment includes a gateway, context server, translation server, and an optional LLM adapter. Set PAUHU_SOVEREIGN=true in your environment to activate.

  Sovereign Mode (air-gapped)
  =========================

  On-premises server
  +-------------------------------------------------------+
  |                                                       |
  |  Gateway (orchestrator)                               |
  |    |                                                  |
  |    +--- query --> Local Search Engine                  |
  |    |              (SQLite FTS5 + local vector index)  |
  |    |                        |                         |
  |    |                  ranked passages                 |
  |    |                        |                         |
  |    +--- passages --> Local Answer Engine               |
  |                      (optimized runtime, domain specialist)|
  |                             |                         |
  |                      +------+------+                  |
  |                      |             |                  |
  |                   Default      or sovereign LLM       |
  |                   model        (ALLaM, Mistral, etc.) |
  |                      |             |                  |
  |                      +------+------+                  |
  |                             |                         |
  |                      Answer + citations               |
  |                                                       |
  |  No external network access required                  |
  +-------------------------------------------------------+

The sovereign LLM adapter supports multiple model providers:

ProviderConfig valueExample models
Local optimized runtimelocalCompact, language, and domain models (quantised)
Local Transformerstransformers-localAny HuggingFace model
OpenAI-compatible APIopenai-compatibleALLaM, SwissGPT, vLLM, Ollama

IATE Integration

IATE (Inter-Active Terminology for Europe) provides 2.4 million terms in 24 EU official languages. In the search and answer pipeline, IATE is injected into both engines:

Search engine: term expansion

When a query contains a term that exists in IATE, the search engine expands the query with equivalent terms in the same and related languages. For example, a query containing “data controller” is expanded with “Verantwortlicher” (DE), “responsable du traitement” (FR), and “rekisterinpitäjä” (FI). This expansion happens at the embedding level - the expanded terms are encoded and their vectors are averaged with the original query vector.

  IATE Term Expansion (Search Engine)
  ====================================

  Input query: "data controller obligations under GDPR"
                    |
                    v
  IATE lookup: "data controller" --> IATE ID 1688230
    |
    +-- EN: data controller
    +-- DE: Verantwortlicher
    +-- FR: responsable du traitement
    +-- FI: rekisterinpitaja
    +-- ... (24 languages)
    |
    v
  Expanded query vector = avg(
    embed("data controller obligations under GDPR"),
    embed("Verantwortlicher obligations under GDPR"),
    embed("responsable du traitement obligations under GDPR")
  )
    |
    v
  Fan-out with expanded vector --> finds multilingual passages

Answer engine: term constraints

During generation, IATE terms serve as output constraints. When the decoder generates text that includes domain-specific terminology, the IATE database provides the canonical term form for the target language. This prevents the model from paraphrasing standardised terms - “data controller” remains “data controller”, not “person responsible for data”.

API Endpoints

The search and answer pipeline is accessible via the Pauhu EU API. All endpoints run in EU jurisdiction and return DSA Article 27 ranking metadata.

EndpointMethodDescription
/v1/searchGETSearch only - returns ranked passages with provenance metadata
/v1/search/answerPOSTFull pipeline - search + grounded answer generation, returns answer with inline citations
/iate/lookupGETIATE term lookup - returns translations, definitions, reliability scores
/v1/classifyPOSTTopic classification - identifies which of 21 topic domains a text belongs to

Example: search with grounded answer

  POST /v1/search/answer
  Content-Type: application/json
  Authorization: Bearer pk_your_api_key

  {
    "query": "What are the NIS2 incident reporting deadlines?",
    "sources": ["eurlex", "lex", "commission"],
    "lang": "en",
    "top_k": 10,
    "model": "default"
  }

Response

  {
    "answer": "Under NIS2 (Directive 2022/2555), essential and important
      entities must report significant incidents in three stages:
      (1) early warning within 24 hours, (2) incident notification
      within 72 hours, and (3) final report within one month.",
    "citations": [
      {
        "source": "eurlex",
        "celex": "32022L2555",
        "article": "Art. 23(4)",
        "snippet": "...shall submit an early warning within 24 hours...",
        "semantic_score": 0.94,
        "keyword_score": 0.88,
        "combined_score": 0.92
      }
    ],
    "model": "default",
    "passages_used": 10,
    "ranking_transparency": {
      "algorithm": "hybrid-v1",
      "semantic_weight": 0.70,
      "keyword_weight": 0.30,
      "indexes_queried": 3,
      "total_candidates": 847
    }
  }

Comparison with RAG

Pauhu’s architecture addresses several limitations of standard Retrieval-Augmented Generation:

AspectStandard RAGPauhu
Passage handling Passages concatenated as plain text in a single prompt Each passage encoded independently, then fused at decoder layer
Scaling with K More passages = longer prompt = diluted attention More passages = more encodings = richer cross-attention (no dilution)
Context window Limited by LLM context window (4K–128K tokens) Limited by memory for encoded tensors (typically supports 50+ passages)
Domain adaptation General-purpose embeddings for retrieval Domain specialist re-encoding enriches passage representations
Citation tracking Heuristic (search generated text for passage overlap) Structural (attention weights directly indicate source passage)
Multilingual Depends on LLM’s multilingual capability Multilingual retrieval + cross-lingual encoding + IATE term expansion across 24 languages
Privacy Passages typically sent to cloud LLM API Generation runs in browser (Cloud mode) or on-premises (Sovereign mode)
Ranking transparency Opaque - no standard for explaining why a passage was selected DSA Article 27 compliant - semantic score, keyword score, provenance tier exposed per result

When to use each mode

Security