Architecture
EU data intelligence infrastructure. Layered security architecture, automated task discovery, and multi-reviewer quality gates. PQ Ready EU Only
Overview
Pauhu® infrastructure uses a self-organizing architecture where each processing unit operates independently within its security zone, communicates via structured messages, and adapts to changing workload through automated task discovery and health monitoring.
Key principles:
- Zone isolation: Three security zones (Protected, Controlled, External) with controlled data paths between them
- Model Last: ML inference runs only after all security verification steps pass
- Browser-native: Optimized models run in the browser - no data leaves the user's device
- EU jurisdiction: All infrastructure runs in EU regions
- Post-quantum ready: AES-256 at rest (PQ-safe), hybrid PQ TLS on edge (X25519Kyber768), algorithm-agile crypto layer. See security docs.
Security architecture
Pauhu implements industrial-grade security with four distinct zones and controlled data paths between them.
+-----------+ +------------+ +----------+
| Protected | -> | Controlled | -> | External |
| Zone | | Zone | | Zone |
+-----------+ +------------+ +----------+
|
==============
|| Data Path ||
==============
|
+-----------+ +----------+
| Business | <- | Audit |
| Zone | | Zone |
+-----------+ +----------+
Each zone is assigned a security level appropriate to its function, ranging from protection against casual violation (external-facing) to protection against state-sponsored attack (protected zone). Data paths between zones are controlled and audited.
Request orchestration
A central orchestrator routes each request through zone-specific security checks. All validation passes before any ML model runs (Model Last principle).
+-- Protected ------+-- Architecture
| (Constraints) +-- ML Pipeline
| +-- Data Engineering
| +-- Security Audit
| +-- Legal Review
|
Orchestrator ----+-- Controlled -----+-- Development
| (Data Flow) +-- Operations
| +-- Infrastructure
|
+-- External -------+-- Frontend
(Actions) +-- Documentation
+-- Internationalization
Model Last: All security checks pass FIRST → then AI inference
Validation classification
Each validation step enforces a specific level of strictness, determining how it controls data flow:
| Level | Behaviour | Example |
|---|---|---|
| Block | Reject if violated (MUST NOT) | Reject requests containing PII in search queries |
| Require | Require completion (MUST) | Enforce data license terms before export |
| Allow | Approve if proposed (MAY) | Allow optional semantic ranking add-on |
| Pass-through | No action needed (EXEMPT) | Pass-through for static documentation |
Document extraction on EU infrastructure
Document extraction runs on a dedicated server in Helsinki (EU jurisdiction). It provides server-side document extraction, PDF rendering, and accessibility tree snapshotting.
+-----------------------+ +----------------------+
| Edge Services | | Helsinki Server |
| (EU edge) | | (EU datacenter) |
| | | |
| API Router | | Document Extraction |
| | | | | |
| +-- /extract ------+------->| +-- Chrome CDP |
| +-- /pdf-render ---+------->| +-- Tab lifecycle |
| | | | +-- Text extract |
| Vision Service | | +-- PDF render |
| +-- annotate ------+--+ | |
| +-- terminology ---+--+ +----------------------+
| | | |
| Annotation Service | | +-----------------------+
| +-- sidecar -------+--+---->| Object Storage (EU) |
| | | {product}/ |
| Index Service | | {hash}.json |
| +-- DB + semantic index | +-----------------------+
+-----------------------+
Data flow
- Client sends URL to
/extractor/pdf-rendervia the API router - The vision service opens a Chrome tab on Document extraction (Helsinki) via authenticated bridge token
- Chrome navigates to the URL, extracts text (or renders PDF)
- Tab is closed in background - stateless, no data retained
- If
annotate: true, text is sent to the annotation service for topic classification - If
terminology: true, IATE terms are extracted via the terminology service - For
/extract-and-index, annotation sidecar JSON is written to object storage for indexing
Security
- Document extraction is authenticated with a rotating token (rotated quarterly)
- Documents are processed in ephemeral Chrome tabs - no persistent storage on the server
- All data stays in EU jurisdiction (Helsinki datacenter)
- Document extraction server is hardened per industrial security standards
Automated task discovery
The health system automatically discovers work from four signal sources and creates tasks in the registry. Each task is routed to the appropriate team based on its zone and type.
| Source | Signal | Routing |
|---|---|---|
| Error collector | >10 same error in 24h | By phase: model→ML team, auth→security, ui→frontend, api→development |
| Git log | FIXME / TODO / HACK in recent commits | Development team |
| CI failures | Same workflow fails 3+ times in 7d | Operations team |
| Stale PRs | Open >7 days, no updates | Original author |
Failed tasks are automatically retried up to 3 times. After 3 failures, the task is marked as blocked and requires human intervention.
Multi-reviewer quality
Every pull request is reviewed by three independent reviewers:
- CodeRabbit: Style, patterns, and best practices
- Pauhu Review: Multi-perspective analysis (security, compliance, correctness, performance, maintainability) with EU AI Act compliance checks
- Codex Review: Edge cases, logic errors, race conditions, and resource leaks
A consensus job runs after all three reviewers complete. If 2 out of 3 flag critical issues, the PR is blocked until the issues are resolved.
Health monitoring
A periodic health check runs across the entire infrastructure:
- Open PR status and check rollup
- CI workflow failure trends
- Task registry: auto-respawn failed tasks (up to 3 attempts)
- Task discovery: scan all 4 signal sources
Results are written to a JSON report and committed back to the repository automatically.
Data pipeline
Every EU document flows through a 6-stage pipeline from source to searchable index. This is the same pipeline for all 20 data sources - only the seed script and product code differ.
Seed Script Object Storage Queue Annotation Service
(per source) (per product) (per product) (topic classification)
| | | |
Fetch from EU Upload with Storage event Classify:
institution metadata notification - language detection
(SPARQL, REST, (celex_id, triggers - topic domain (1-21)
SDMX, OAI-PMH) product, lang) consumer - word/char count
| | | |
v v v v
Annotation Sidecar Index Service Database Semantic Index
(.json) (hybrid search) (per product) (multilingual embeddings)
| | | |
Annotation Index into Structured Semantic search
stored next to database + metadata for across 20 indexes
source doc semantic index SQL queries via relevance scoring
in storage (70% semantic,
30% BM25)
Stage 1: Seed
Source-specific seed scripts fetch documents from EU institutions via their official APIs (SPARQL for EUR-Lex/CELLAR, REST for TED/ECHA/EMA, SDMX for ECB/Eurostat, OAI-PMH for CORDIS). Each document is uploaded to its product-specific storage bucket with metadata containing the CELEX ID, product code, language, and source URL.
Stage 2: Queue
Storage event notifications trigger a queue message for each new or updated document. The 20 products are split across two queue consumers for load balancing (products A–E and products E–W).
Stage 3: Annotate
The annotation worker classifies each document:
- Language detection: Identifies the document language (24 EU official languages)
- EuroVoc classification: Assigns one of 21 topic domains
- Metadata: Word count, character count, provenance tier (NATIVE 1.0, PARSED 0.95, KEYWORD ≤0.9)
Stage 4: Annotation sidecar
Annotations are stored as sidecar JSON files next to the source document in object storage. The stand-off annotation format keeps annotations separate from source text, enabling non-destructive updates and full provenance tracking.
Stage 5: Index
The indexing service reads annotated documents and indexes them into:
- Databases: Structured metadata for SQL queries (per product)
- Semantic indexes: Multilingual embeddings for semantic search
Stage 6: Search
The search engine fans out queries across all 20 semantic indexes simultaneously, combining semantic similarity with keyword matching. Results include DSA Article 27 ranking transparency metadata.
Dead letter queue
Documents that fail annotation after 3 retries are routed to a dead letter queue for manual inspection. The /backfill admin endpoint can re-index documents after the issue is resolved.
ML pipeline
Pauhu uses programmatic ML pipelines for all prompt optimization and orchestration. Rather than traditional prompt engineering, typed input/output contracts define each processing step, and composable processing units can be assembled into larger orchestration flows.
The optimization engine handles quality tuning across all services, ensuring consistent output quality and EU AI Act transparency compliance.