For: Chris · CTO · Hypernym

Modulum × MTP + Inference Stack — Strategic Direction

2026-05-11 · drafted from R18 panel synthesis · prepared as R19 seed

Two architectural priorities + one product wedge + one developer-stack play, all converging on the inference-core thesis. Centered on the compound-3×-MTP × 3×-Modulum opportunity you flagged. Modulum API is currently live and responding at 200 OK on Q4_K_M Gemma 4 31B with active 2-token speculative drafting (verified 2026-05-11).

9×

compound decode target

+9pp

babilong @128k (shipped)

2/2

draft acceptance (live)

Q4_K_M

production quantization

01 · The MTP × Modulum compound architecture

The central technical problem — 3× MTP + 3× Modulum → 9× compound, not erosion

Modulum was developed pre-MTP. MTP changes the decode flow: each pass produces N tokens instead of 1. Modulum's per-pass attention savings compress across N tokens — the 3× decode advantage erodes when MTP is active. The architectural fix lives at the attention-mask level, not the decode-stage level.

The compound-erosion problem is concrete: MTP reduces pass count by ~N×; Modulum reduces per-pass attention cost by ~3×. They operate on different bottlenecks. When fused naively, the per-token win from Modulum compresses across N MTP-predicted tokens.

Three architectural paths that compound (not compete) with each MTP head:

Substrate-conditioned MTP heads. Each MTP prediction head sees the same Modulum attention mask the verifier would. Acceptance rate stays high because draft heads see what the verifier sees. Closest to today's `draft_n=2` architecture visible in the API.
Per-head Modulum draft. Head k (predicting token t+k) gets its own Modulum mask tuned for position-k retention bandwidth. Head 1 = proximate retention; head 8 = distal retention. Compound speedup recovers because each head is locally optimal for its prediction horizon.
Substrate-aware draft training (your "larger scale draft training" path). Train MTP-head Modulum masks with the model author (Google for Gemma, Meta for Llama). This is the cheapest path to compound 3× × N× and the natural JV opportunity you flagged. Inference-time-only is sufficient for v1; trained-in is the v2 moat.

Compound target math

3× MTP × 3× Modulum-fused = 9× compound decode at long context. Achievable in principle if the fusion happens at the attention-mask level. The fusion is the technical risk; the math holds if Modulum masks remain optimal per-MTP-head.

JV implication

None — not all but few — frontier labs have MTP enabled at production scale today. The compound moat (Modulum + MTP) is winnable as a partnered training run with a frontier-lab partner, not as an inference-time-only retrofit. Both routes are valid; the JV path is the larger upside and the scaffolding for category-defining results across base-model families.

02 · Hypernym Router — the OpenRouter equivalent

Year-1 commercial wedge · adoption by base_url swap

OpenRouter sits between developers and frontier APIs taking a unified-billing cut. ~100k developers use them as the single base_url. They don't add value beyond routing/billing/fallbacks. Hypernym does.

Hypernym Router (HR): drop-in base_url replacement for OpenRouter. Same OpenAI-compatible interface, same routing/billing/fallback, BUT every call routes through Modulum-enhanced inference where applicable. Customers pay similar to OpenRouter, get +9pp retention at 128k for free underneath.

Cost beat

3× decode

OpenRouter charges ~5% over wholesale. HR charges 0-3%, subsidized by Modulum's decode savings as margin. Compound MTP × Modulum amplifies this.

Speed beat

+9pp @128k

HR is genuinely faster on long-context calls vs vanilla OpenRouter. Retention is the speed multiplier — fewer retries because facts don't get lost.

Traceability beat

receipts

Every call returns a Retention Receipt (R18 convergent product). OpenRouter has nothing equivalent. Procurement-defensible.

# 30-second adoption — change one line base_url = "https://api.openrouter.ai/v1" # before base_url = "https://router.hypernym.ai/v1" # after — same code, +9pp retention

Distribution

Direct developer + small-startup market. No procurement, no contract. pip install hypernym-router or just the base_url swap. Wide-market reach by definition — exactly what we need for fast adoption while institutional pathways mature on slower clocks.

03 · Reasoning-state architecture · Day 1 deployment

Who uses it immediately — agents that fail at multi-turn tool use

Modulum-retained evidence plus an explicit proof/claim/obligation graph that the model updates and verifies during generation. The final answer includes a verifiable dependency trace. This is R19's flagship research stream — but the Day 1 customer is concrete.

Cursor / Cline / Aider working on >10-file refactors. They currently hallucinate which file changed at step 5 because retention fails. Modulum-retained + reasoning-state means they actually remember the dependency graph across the full session.
LangGraph / CrewAI multi-agent flows. They lose state between agents. Reasoning-state architecture lets agent B query agent A's claim graph deterministically rather than re-reading the entire trace.
Devin / OpenHands long-horizon coding. Same problem 100× harder. Hours-long autonomous coding sessions can't currently hold the system invariants.
Customer-support automation handling 100-turn conversations. Reasoning-state is the difference between "I see your earlier email said X" working at turn 80 vs hallucinating.

Concrete deployment

API: https://reasoning.hypernym.ai/v1 — returns answer + dependency trace + claim graph state · SDK: pip install hypernym-reasoning adds proof/claim/obligation tracking to any agent loop · Pricing: $0.10-0.50 per traced query (premium over base Modulum) · Day-1 customer: any developer building an agent that needs to remember what it did 30 steps ago.

Reasoning-state architecture becomes a developer product before an institutional product. Wide-market reach (agent developers, tens of thousands) immediately; regulated-institution adoption follows once the trace is procurement-defensible.

04 · Forge as Modulum deployment toolkit

The developer-relations pipeline runs through Forge

Forge is already a multi-track research orchestrator, LaunchAgent fleet manager, cross-model dispatch + cost tracking, and memory infrastructure. Extend it to also be the Modulum customer-deployment toolkit.

# one-command Modulum deployment $ forge deploy modulum --model gemma-4-31b --endpoint customer.example.com # benchmark in customer's own infrastructure $ forge benchmark modulum --customer-corpus ./docs --baseline gemma-vanilla # wire Modulum into an agent framework $ forge integrate langgraph --hypernym-router router.hypernym.ai

Customer flow: deploy → benchmark → integrate in a single forge session, all open-source, all reproducible. Adoption pipeline runs through forge; customers verify the +9pp retention gain in their own benchmarks before committing. Direct top-of-funnel into Hypernym Cloud or self-hosted Modulum deployment.

Forge becomes the developer-relations tool for Modulum. Customers who run forge to deploy Modulum locally become the warm leads for Hypernym Router, Hypernym Reasoning, and (eventually) institutional deals.

05 · Cross-model transfer · Llama 4 / DeepSeek V4 / Qwen 3

Modulum-as-Compiler — drop in any base model, get sparsity in <10 GPU-minutes

From R18 R1 (Grok origin, panel-converged). M5 patterns are partly architectural (transfer-friendly) and partly training-corpus-specific (transfer-unfriendly). Distill M5 into a 50M parameter "attention compiler" that emits per-layer sparsity for any new model in minutes.

You're already running Llama 4 infrastructure per-head-scale. The transfer experiment is cheap — warm-start from Gemma 4 31B patterns; calibrate on a single BABILong qa1 sweep at 128k; measure retention gain. Predicted floor: +5pp on Llama 4 70B, +4-6pp on Mistral Large 3, +5-7pp on Qwen 3 72B.

Why it matters

If M5 patterns are mostly model-agnostic (attention entropy profiles rather than weight-specific), the moat scales horizontally. Hypernym ships compile.hypernym.ai as a developer service — any new frontier or OSS model launches; we have Modulum-on-it running within a day. Counter to a frontier lab's "we built our own retention layer" — ours composes with theirs.

06 · World models · cheap, non-weather targets

Pick one bet at ≤$200K cap with clear validation pathway

Climate is hard, expensive, and slow to validate. Cheaper world-model targets exist where Hypernym's substrate-grounding moat applies directly and validation is tractable.

Domain	Why cheap + tractable	Spend estimate
Economic / financial market simulators	Real-time data is free (yfinance, FRED, SEC EDGAR). Validation = backtest against historical prices.	$50K cloud compute / 6 months
Materials science (small-molecule + crystal) discovery	Open data (Materials Project, OQMD). Validation = predict synthesizable + DFT-stable. Pharma + battery + semiconductors TAM.	$100-150K
Logistics / supply-chain optimization	Synthetic + open data. Direct procurement appetite (every F500 has logistics teams).	$80-120K
Multi-agent game theory / mechanism design	Pure simulation, no real-world data costs. Crypto / DeFi applications.	$30-60K
Drug-protein binding simulators	AlphaFold-era data is open. Pharma customers pay seven figures.	$100-180K

Recommendation

Economic-market simulators OR drug-protein binding as the first non-weather world-model bet. Both have clear validation (backtests / wet-lab experiments), both have direct enterprise customers, both fit the $200K cap. Both also let Hypernym productize substrate-grounded simulation distinct from climate's coupled-PDE complexity.

07 · System status · today

Both up and running

Modulum API

live

200 OK in ~1.2s smoke test 2026-05-11. Model: gemma-4-31B-it-Q4_K_M.gguf. Active 2-token speculative drafting with 100% acceptance on short prompts. Endpoint: https://gemma4.hypernym.ai/v1.

Forge infrastructure

17/17 canonical · clean

All canonical services up. Zero drift. Zero orphans. Research LaunchAgents on-demand. Memory router 10/12 providers healthy. Ready to deploy Modulum benchmarks + customer integrations.

08 · Proposed R19 scope

Six streams · ~$40-60 panel spend · 1-2 day round

R19 = "Modulum × MTP + Inference Stack." Expands R18's reasoning-state direction to include the MTP compound work + commercial wedges.

MTP × Modulum compound architecture — substrate-conditioned MTP heads + per-head Modulum draft + JV path with frontier-lab partner; target 9× compound at long context.
Hypernym Router (HR) — OpenRouter-equivalent; OAI-compatible drop-in; Cost / speed / traceability beat. Year-1 commercial wedge.
Reasoning-state architecture + agent-stack deployment — Cursor / LangGraph / Devin / customer-support integration; reasoning.hypernym.ai/v1; SDK adoption.
Forge as Modulum deployment toolkit — forge deploy / benchmark / integrate; developer-relations pipeline.
Cross-model transfer · Llama 4 + DeepSeek V4 + Qwen 3 — M5 Compiler distillation; compile.hypernym.ai.
One non-weather world-model bet — economic-market simulator OR drug-protein binding at ≤$200K; clear validation pathway.

Plus operational: R19 dispatch fixes (Gemini single-key auth path, Gemma mlx_vlm CLI invocation); IHC benchmarking integration as a sign-in page + proof doc URL + API access (no cost-tracking infrastructure — they'll do their own).

09 · IHC benchmarking

Minimum-viable: sign-in + proof doc + API key

Don't waste time on cost-tracking infrastructure. They'll do their own.

Stand up a small webpage at benchmark.hypernym.ai (or similar) where IHC contacts:

Sign in with email + a Hypernym-vetted code
Read a cleaned-up version of the existing proof doc (BABILong qa1 reproduction recipe)
Get an API key auto-provisioned with rate limits appropriate for benchmark workloads
Hit the API; do their own benchmarking; return results to Hypernym at their discretion

Single-day buildout. Stripe + Clerk + the proof doc + the existing Modulum API. No telemetry, no cost-tracking — just hand them the keys.

10 · Closing

Lost-in-the-middle solved · Modulum is the foundation

The lost-in-the-middle results are the #1 thing Hypernym can hold its hat on right now. The proof doc validates +9pp at 128k against vanilla Gemma 4 31B on a public benchmark, with frontier labs visibly regressing 32-46pp on multi-needle at the same lengths. Hypernym is moving the right direction at 5 orders of magnitude lower cost.

R19's job: layer MTP-compound architecture + Hypernym Router commercial wedge + reasoning-state for agents + forge deployment toolkit + cross-model transfer + one cheap world-model bet on top of this proven foundation. All of these adopt by API call, none require institutional procurement.

Modulum stays the core thing. Hypernym Router shows the world how it works. Reasoning state extends it to multi-turn agents. Forge ships it to developers. Cross-model transfer makes it universal. The MTP compound work — most importantly — keeps the decode advantage growing rather than eroding as base-model architectures evolve.