Frequently Asked Questions
Last updated: February 2026
Architecture
What is the processing pipeline?
Documents pass through a multi-stage pipeline before reaching your data feed. Each stage performs a specific verification: source authentication, data quality checks, legal classification, and annotation. Only after all stages pass does a document enter the commercial feed.
Machine learning inference happens last, after all deterministic checks have passed.
What is the context window of the pipeline?
The pipeline does not have a single context window in the way a large language model does. It processes documents of any length because the architecture uses staged processing rather than fitting everything into one model's attention window.
The verification stages are stateless boolean checks with no token limit. Documents are segmented before reaching ML models: sentence-level for translation, paragraph-level for classification, and chunk-level for embeddings.
| Stage | Context | Method |
|---|---|---|
| Source validation | Unlimited | Deterministic checks |
| Legal classification | Unlimited | Rule-based analysis |
| Semantic embeddings | 8,192 tokens | Multilingual embeddings, per chunk |
| Translation | 512 tokens | Neural MT, per sentence pair |
| Fast classification | 512 tokens | Fast classifier, per paragraph |
State between stages is stored in databases, not in model memory. This means the pipeline can process documents of any practical length without the information loss that occurs when exceeding a model's context window.
What models are used?
The annotation pipeline uses open-source models hosted in the EU:
- Multilingual embeddings for semantic search across 24 languages
- Neural machine translation covering 452 translation directions
- Fast text classification for document categorization
No proprietary or closed-source models are used in the data feed pipeline. All model weights are stored in EU-jurisdiction infrastructure.
Where is the data processed?
All processing happens within EU jurisdiction. Storage, compute, and model inference are deployed on Cloudflare's EU network with jurisdiction pinning enabled. No data leaves the EU at any stage of the pipeline.
Data Sources
Which EU sources are included?
The feeds currently include 19 EU institutional sources (EUR-Lex, CURIA, TED, IATE, Eurostat, Data Europa, Publications Office, European Commission, Council, European Parliament, ECB, CORDIS, EMA, OEIL, Who is Who) plus national law from 15 EU member states.
How often is data updated?
EU institutional sources are checked every 15 minutes. National law databases are checked 4 times per hour. New documents are annotated automatically and appear in feeds within minutes of publication at source.
What licenses apply to the data?
All EU institutional data in the commercial feeds is available under CC-BY 4.0 or equivalent open reuse terms. Each feed includes machine-readable license policies with source attribution. National law data follows per-country open data policies, verified individually for each jurisdiction.
Integration
What formats are available?
Data feeds are available as DCAT-AP catalogs with individual resources in JSON-LD, RDF, and CSV. Standard connectors support automated discovery and subscription.
Can I query specific regulations?
Yes. The API supports CELEX ID lookup, EuroVoc domain filtering, date range queries, and semantic search across all sources.
Is there an SLA?
Uptime and latency targets are published on the status page. Enterprise plans include contractual SLAs. Contact sales@pauhu.eu for details.