Frequently Asked Questions

Last updated: February 2026

Architecture

What is the processing pipeline?

Documents pass through a multi-stage pipeline before reaching your data feed. Each stage performs a specific verification: source authentication, data quality checks, legal classification, and annotation. Only after all stages pass does a document enter the commercial feed.

Machine learning inference happens last, after all deterministic checks have passed.

What is the context window of the pipeline?

The pipeline does not have a single context window in the way a large language model does. It processes documents of any length because the architecture uses staged processing rather than fitting everything into one model's attention window.

The verification stages are stateless boolean checks with no token limit. Documents are segmented before reaching ML models: sentence-level for translation, paragraph-level for classification, and chunk-level for embeddings.

Stage	Context	Method
Source validation	Unlimited	Deterministic checks
Legal classification	Unlimited	Rule-based analysis
Semantic embeddings	8,192 tokens	Multilingual embeddings, per chunk
Translation	512 tokens	Neural MT, per sentence pair
Fast classification	512 tokens	Fast classifier, per paragraph

State between stages is stored in databases, not in model memory. This means the pipeline can process documents of any practical length without the information loss that occurs when exceeding a model's context window.

What models are used?

The annotation pipeline uses open-source models hosted in the EU:

Multilingual embeddings for semantic search across 24 languages
Neural machine translation covering 452 translation directions
Fast text classification for document categorization

No proprietary or closed-source models are used in the data feed pipeline. All model weights are stored in EU-jurisdiction infrastructure.

Where is the data processed?

All processing happens within EU jurisdiction. Storage, compute, and model inference are deployed on Cloudflare's EU network with jurisdiction pinning enabled. No data leaves the EU at any stage of the pipeline.

Data Sources

Which EU sources are included?

The feeds currently include 19 EU institutional sources (EUR-Lex, CURIA, TED, IATE, Eurostat, Data Europa, Publications Office, European Commission, Council, European Parliament, ECB, CORDIS, EMA, OEIL, Who is Who) plus national law from 15 EU member states.

How often is data updated?

EU institutional sources are checked every 15 minutes. National law databases are checked 4 times per hour. New documents are annotated automatically and appear in feeds within minutes of publication at source.

What licenses apply to the data?

All EU institutional data in the commercial feeds is available under CC-BY 4.0 or equivalent open reuse terms. Each feed includes machine-readable license policies with source attribution. National law data follows per-country open data policies, verified individually for each jurisdiction.

Integration

What formats are available?

Data feeds are available as DCAT-AP catalogs with individual resources in JSON-LD, RDF, and CSV. Standard connectors support automated discovery and subscription.

Can I query specific regulations?

Yes. The API supports CELEX ID lookup, EuroVoc domain filtering, date range queries, and semantic search across all sources.

Is there an SLA?

Uptime and latency targets are published on the status page. Enterprise plans include contractual SLAs. Contact sales@pauhu.eu for details.