Why Pauhu Runs on Any Chip
Browser-native inference via ONNX Runtime Web. No CUDA dependency, no GPU vendor lock-in. From a government server in Helsinki to a smartphone in Lisbon — the same models, the same accuracy, EUR 0 inference cost.
- 1. The runtime: ONNX in the browser
- 2. Adaptive model loading
- 3. Device detection and progressive download
- 4. No CUDA, no lock-in
- 5. Market impact
- 6. FAQ
1. The Runtime: ONNX in the Browser
Pauhu models are exported in the ONNX (Open Neural Network Exchange) format — an open standard supported by every major ML framework. At inference time, models run inside the browser via ONNX Runtime Web, which supports two execution backends:
| Backend | Technology | Best for |
|---|---|---|
| WebAssembly (WASM) | CPU-based, runs everywhere | Universal compatibility. Any browser, any device. |
| WebGPU | GPU-accelerated, shader-based | Faster inference on devices with a GPU. Automatic fallback to WASM if unavailable. |
The runtime selects the best backend automatically. On a laptop with a discrete GPU, WebGPU accelerates inference. On a smartphone or a locked-down government workstation without GPU drivers, WASM provides the same results at a slightly slower speed. The model weights are identical in both cases.
2. Adaptive Model Loading
Not every device has 16 GB of RAM. Pauhu detects available memory and loads the appropriate model tier automatically:
Device memory: ≤4 GB
Model: Encoder-only (150 MB)
Capability: Search and retrieval. Instant lookups across 4.8M documents. No generation.
Use case: Smartphones, tablets, low-spec laptops, embedded kiosks.
Device memory: 4–8 GB
Model: mT5-small encoder + decoder (300 MB)
Capability: Search + grounded answer generation with citations. 24 EU languages.
Use case: Office laptops, standard government workstations.
Device memory: 8 GB+
Model: mT5-base encoder + decoder (1.2 GB)
Capability: Full pipeline: search, generation, topic classification, deontic analysis, translation.
Use case: Developer machines, servers, self-hosted containers.
Tier selection is automatic but can be overridden. In a self-hosted deployment, you can pin a specific tier via environment variable:
# Force Tier C regardless of detected memory
PAUHU_MODEL_TIER=full docker compose up -d
3. Device Detection and Progressive Download
The loading sequence is designed to minimise time-to-first-result:
1. navigator.deviceMemory → detect available RAM
2. Select tier (A, B, or C)
3. Download encoder (search model) → search is available immediately
4. Download decoder (generation model) → generation available when ready
5. Cache both in IndexedDB → subsequent visits are instant
Encoder first, decoder lazy
The encoder (search/retrieval) downloads first because it is smaller and provides immediate value. Users can search and browse results while the decoder downloads in the background. If a user only needs search, the decoder is never downloaded at all.
Progressive download sizes
| Phase | Tier A | Tier B | Tier C |
|---|---|---|---|
| Encoder | 150 MB | 150 MB | 600 MB |
| Decoder | — | 150 MB | 600 MB |
| Total | 150 MB | 300 MB | 1.2 GB |
All models are quantized (INT8) using the ONNX quantization toolkit. This reduces file size and inference latency without measurable accuracy loss on EU legal benchmarks.
4. No CUDA, No Lock-In
Most AI systems require NVIDIA GPUs with CUDA drivers. This creates three problems for government and enterprise buyers:
- Hardware lock-in: You must buy NVIDIA GPUs, which are supply-constrained and expensive.
- Driver dependency: CUDA drivers must be installed and maintained, which requires system-level access that many IT policies restrict.
- Chip sovereignty: NVIDIA is a US company subject to US export controls. Dependency on a single non-EU chip vendor is a supply chain risk.
Pauhu avoids all three. ONNX Runtime Web compiles models to WebAssembly, which runs on any CPU architecture:
| Architecture | Examples | Status |
|---|---|---|
| x86-64 | Intel, AMD (most desktops and servers) | Full support |
| ARM64 | Apple Silicon (M1–M4), Qualcomm Snapdragon, AWS Graviton | Full support |
| ARM32 | Older Android devices, Raspberry Pi | WASM only (Tier A) |
| RISC-V | Emerging open-standard processors | WASM only (Tier A) |
5. Market Impact
There are approximately 3.5 billion smartphone users worldwide. Every one of them has a device capable of running Pauhu Tier A inference — search across 4.8 million EU documents at zero marginal cost.
For government buyers, chip-agnostic inference means:
- No GPU procurement: Run Pauhu on existing hardware. No additional capital expenditure.
- No cloud dependency: Inference happens on-device or on-premises. No data leaves your environment.
- No per-query cost: Once models are downloaded, every query is free. There is no API metering, no token counting, no usage-based billing.
- Future-proof: As new chip architectures emerge (RISC-V, Arm Neoverse, custom EU silicon), WASM and WebGPU run on them without any changes to Pauhu.
The arithmetic
A cloud LLM charges EUR 0.01–0.03 per query. At 1,000 queries/day across a government ministry, that is EUR 10–30/day, or EUR 3,650–10,950/year — for a single ministry. Pauhu’s on-device inference costs EUR 0 per query after the initial subscription. The models run on hardware you already own.
6. FAQ
Does WebAssembly inference have good performance?
For the model sizes Pauhu uses (150 MB–1.2 GB), WASM inference on a modern laptop completes search queries in under 30 ms and generation in 1–3 seconds. This is comparable to cloud API latency when you include network round-trip time.
What browsers support ONNX Runtime Web?
All modern browsers: Chrome 90+, Firefox 89+, Safari 15+, Edge 90+. WebGPU requires Chrome 113+ or Edge 113+. Older browsers fall back to WASM automatically.
Can I force a specific tier?
Yes. In the VS Code extension: pauhu.model.tier setting. In the container: PAUHU_MODEL_TIER environment variable. In the browser: ?tier=full URL parameter.
What about offline use?
Once models are cached in IndexedDB (browser) or on disk (container), Pauhu works fully offline. No network connection needed for search or generation. See the Data Sovereignty section for details.
Is ONNX an open standard?
Yes. ONNX is maintained by the LF AI & Data Foundation (part of the Linux Foundation). It is supported by Microsoft, Meta, Google, Intel, AMD, and others. There is no single-vendor dependency.