Modulum × BABILong · Complete ablation findings

What Modulum does, across every variable we measured.

The full mask-ablation matrix is complete: 9 of 9 cells at N=50 vs N=50, same Gemma-4-31B-Q4 weights, same prompts (idx 0..49), with vs without Hypernym's platform components. This page reports everything we measured — accuracy, decode speed, prefill speed, decay slope across context, within-run drift — without rounding or approximation. Every number traces to exports/all_results.csv + exports/full_audit.json.

2026-05-18 · v4.1 · post Codex+Grok file-read audit 4,917 canonical rows · 25 SQLite databases Significance: two-sample Wald, two-sided

What the data actually says

Modulum's platform contributes a real, measurable lift over the same base Gemma-4-31B-Q4 weights — but the effect is concentrated and conditional, not uniform. The strongest signal is on qa2 (2-fact reasoning) at short and long context (+28 pp at 32k, +20 pp at 128k, both significant at p<0.05). The mechanism that produces these lifts also produces ~20–30 % slower prefill at short qa1 retrieval and a concerning sustained-run drift (−23 pp end-to-end on qa1 64k).

On qa3 (3-fact temporal reasoning), Modulum delivers the flattest decay slope of any tested stack at −2.5 pp / doubling of context (Opus 4.6 is −4 pp, GPT-5.5 is −9 pp, Gemini 3.1 Pro is −7.6 pp). Modulum also decodes 17–22 % FASTER than vanilla on qa3 across all three context lengths — likely because attention conditioning shortens output. The qa3 story is the strongest "First, Not Lost" evidence in the dataset.

Average platform lift across the 9 completed mask cells is +11.78 pp; only 3 of 9 cells reach p<0.05 at N=50, which means the average is a real direction but most individual lifts need higher N to confirm. The qa3 32k regression (−6 pp) is not statistically significant and may be sample noise. Modulum trails frontier (Opus 4.6 = 88.7 % avg) on absolute accuracy by 42.7 pp — the value is not capability parity, it is workstation-scale qa3 slope.

Avg platform lift (9 cells)

+11.78 pp

Range: −6 pp (qa3 32k, ns) to +28 pp (qa2 32k, **)

Significant lifts at N=50

3 / 9

qa1 64k *, qa2 32k **, qa2 128k *

qa3 decay slope (best in panel)

−2.5 pp / 2×

Flatter than Opus 4.6 (−4 pp), GPT-5.5 (−9 pp), Gemini (−7.6 pp)

qa3 decode speedup vs vanilla

+17 to +22 %

Faster on hard task — likely shorter outputs

01 · Mask ablation — Modulum vs vanilla, all 9 cells

The cleanest possible isolation of platform contribution.

Same Gemma-4-31B-Q4 weights, same 50 prompts (idx 0..49) per cell, vanilla served via raw llama.cpp /completion on Hypernym's mirror endpoint, Modulum served via the full platform stack. p-values via two-sample Wald test for difference of proportions.

Cell	Vanilla (raw)	Vanilla acc	Modulum (raw)	Modulum acc	Δ Platform	z-score	Wald p	Significance
qa1 32k	39/50	78.0 %	45/50	90.0 %	+12.0 pp	+1.66	0.0971	ns
qa1 64k	33/50	66.0 %	42/50	84.0 %	+18.0 pp	+2.12	0.0336	*
qa1 128k	31/50	62.0 %	36/50	72.0 %	+10.0 pp	+1.07	0.2849	ns
qa2 32k	14/50	28.0 %	28/50	56.0 %	+28.0 pp	+2.96	0.0031	**
qa2 64k	22/50	44.0 %	26/50	52.0 %	+8.0 pp	+0.80	0.4218	ns
qa2 128k	15/50	30.0 %	25/50	50.0 %	+20.0 pp	+2.09	0.0371	*
qa3 32k	14/50	28.0 %	11/50	22.0 %	−6.0 pp	−0.69	0.4874	ns
qa3 64k	16/50	32.0 %	19/50	38.0 %	+6.0 pp	+0.63	0.5286	ns
qa3 128k	10/50	20.0 %	15/50	30.0 %	+10.0 pp	+1.16	0.2450	ns

Mean Δ Platform across 9 cells = +11.78 pp. Three cells reach p<0.05 (qa1 64k, qa2 32k, qa2 128k). The qa2 32k lift (+28 pp at p=0.0031) is the load-bearing single piece of evidence the platform does real architectural work. The qa3 32k −6 pp is not significant; could be sample noise on N=50.

02 · Modulum at full N — what the platform actually achieves

Headline accuracy with exact Wilson 95% CIs.

These cells use Modulum's full sample budget (N=100–500), not just idx 0..49. They represent the absolute capability ceiling of Modulum on a 31B-Q4 workstation model at each context length and task.

Task	Length	correct / N	Accuracy	Wilson 95 % CI	Half-width	Errors
qa1	32k	89 / 100	89.00 %	[81.4, 93.8]	±6.2 pp	0
qa1	64k	77 / 100	77.00 %	[67.8, 84.2]	±8.2 pp	0
qa1	128k	143 / 200	71.50 %	[64.9, 77.3]	±6.2 pp	3
qa2	32k	53 / 100	53.00 %	[43.3, 62.5]	±9.6 pp	0
qa2	64k	82 / 200	41.00 %	[34.4, 47.9]	±6.8 pp	0
qa2	128k	79 / 200	39.50 %	[33.0, 46.4]	±6.7 pp	0
qa3	32k	32 / 100	32.00 %	[23.7, 41.7]	±9.0 pp	0
qa3	64k	33 / 100	33.00 %	[24.6, 42.7]	±9.1 pp	0
qa3	128k	135 / 500	27.00 %	[23.3, 31.1]	±3.9 pp	0

128k average across qa1/qa2/qa3 = 46.00 %. qa3 128k is the cell with the tightest CI (N=500): the 27 % accuracy holds at ±3.9 pp. The 3 errors on qa1 128k are 503-busy retries that recovered. All other cells: zero backend errors.

03 · Decay slopes — how each stack loses ground per doubling of context

Where Modulum's qa3 slope is best-in-panel.

OLS linear fit of accuracy on log₂(context tokens) across cells with N≥20 and context≥8k. Lower magnitude = flatter decay = better long-context preservation. Modulum's headline slope advantage is on qa3, not qa1.

Stack	qa1 slope	qa2 slope	qa3 slope	How Modulum compares
Claude Opus 4.6	−0.0 pp/2×	−0.0 pp/2×	−4.0 pp/2×	Hyperscaler ceiling — flat across all three tasks at high absolute accuracy.
GPT-5.5	−2.0 pp/2×	+0.0 pp/2×	−9.0 pp/2×	Steepest qa3 slope in the high-accuracy cluster.
Claude Opus 4.7	−2.0 pp/2×	−2.0 pp/2×	+2.0 pp/2×	Flat slope but low qa3 intercept; the regression vs 4.6 is in level, not shape.
Modulum (Gemma-4-31B-Q4)	−8.75 pp/2×	−6.75 pp/2×	−2.5 pp/2×	Best qa3 slope in the panel (except Opus 4.7 which sits at much lower absolute acc). qa1 and qa2 slopes are middling — steeper than Opus/GPT, flatter than Gemini/Grok.
Vanilla Gemma-4-31B-Q4	−8.0 pp/2×	+1.0 pp/2×	−4.0 pp/2×	qa3 slope is partially inherited from the base model. Platform contribution on qa3 is +1.5 pp/2× better than vanilla.
Gemini 3.1 Pro	−15.3 pp/2×	−25.8 pp/2×	−7.6 pp/2×	Modulum has notably flatter slope on qa1 and qa2 than Gemini.
Grok 4.3	−25.0 pp/2×	−20.0 pp/2×	−8.19 pp/2×	Steepest decay measured.

Critical engineering note: the qa3 slope advantage is partly base-model — vanilla Gemma-4 already decays at −4.0 pp/2× on qa3 (matching Opus 4.6). Modulum extends that by +1.5 pp/2× to −2.5 pp/2×. The platform contribution on qa3 is in both the intercept (absolute level lifted ~+10 pp) and the slope.

04 · Speed — decode and prefill rates

The compute cost basis of the accuracy lift.

Tokens-per-second medians across all 9 mask cells, both sides. The llama.cpp timings block exposes prefill_ms and decode_ms natively for Modulum + Vanilla; frontier API models do not expose these so the comparison is platform-only.

Decode tokens / second (output generation rate)

Cell	Modulum	Vanilla	Δ Platform	Read
qa1 32k	35.1 t/s	50.4 t/s	−30.3 %	Platform overhead largest on short-context retrieval.
qa1 64k	33.6 t/s	41.5 t/s	−18.9 %	Overhead persists at 64k qa1.
qa1 128k	37.1 t/s	35.9 t/s	+3.2 %	Convergence at long context.
qa2 32k	39.5 t/s	40.4 t/s	−2.2 %	~Even on 2-fact reasoning.
qa2 64k	35.1 t/s	37.6 t/s	−6.8 %	Slight overhead.
qa2 128k	32.7 t/s	34.9 t/s	−6.4 %	Slight overhead.
qa3 32k	49.5 t/s	40.7 t/s	+21.6 %	Platform faster. Attention conditioning likely shortens output.
qa3 64k	45.9 t/s	38.0 t/s	+21.0 %	Same pattern.
qa3 128k	40.2 t/s	34.4 t/s	+16.9 %	Speedup holds at long context.

The decode story is non-uniform. Modulum is 19–30 % SLOWER on qa1 retrieval at short/mid context, but 17–22 % FASTER on qa3 at every context length. The compute cost basis of the qa3 accuracy lift is therefore negative — Modulum is both more accurate and faster on the hardest task.

Prefill tokens / second (context ingest rate)

Cell	Modulum	Vanilla	Δ Platform
qa1 32k	403 t/s	877 t/s	−54.0 %
qa1 64k	349 t/s	773 t/s	−54.9 %
qa1 128k	593 t/s	615 t/s	−3.5 %
qa2 32k	865 t/s	878 t/s	−1.5 %
qa2 64k	756 t/s	773 t/s	−2.2 %
qa2 128k	594 t/s	615 t/s	−3.4 %
qa3 32k	865 t/s	878 t/s	−1.4 %
qa3 64k	755 t/s	774 t/s	−2.5 %
qa3 128k	594 t/s	613 t/s	−3.1 %

qa1 short-context prefill is an outlier — Modulum runs 54 % slower than vanilla at 32k and 64k qa1. This anomaly is isolated to qa1 32k/64k; on every other cell prefill rates converge to ~3 % of vanilla. Hypothesis: the qa1 short-context cells were the earliest experimental runs (phase-1 with 78 backend 503-storms) and the timing data may include retry latency. Worth re-measuring on a fresh run.

05 · Within-run drift — how Modulum behaves over sustained operation

Production-deployment signal: does accuracy degrade run-to-run?

Each cell's samples split into 3 equal slices by sample_idx (early / mid / late) and accuracy measured per slice. Cells with N≥100 are shown below.

Cell	Early	Mid	Late	Late − Early	Read
qa1 32k (N=100)	90.9 %	84.8 %	91.2 %	+0.3 pp	Stable.
qa1 64k (N=100)	87.9 %	78.8 %	64.7 %	−23.2 pp	Monotonic decay over 100 sequential samples. Production blocker until diagnosed.
qa1 128k (N=200)	69.7 %	68.2 %	76.5 %	+6.8 pp	Late tercile is the phase-5 extension (idx 100–199) — easier-on-average prompts. Phase mix masks drift signal here.
qa2 32k (N=100)	54.5 %	57.6 %	47.1 %	−7.5 pp	Modest drift.
qa2 64k (N=200)	50.0 %	28.8 %	44.1 %	−5.9 pp	Non-monotonic — mid-tercile dip.
qa2 128k (N=200)	45.5 %	33.3 %	39.7 %	−5.7 pp	Modest drift.
qa3 32k (N=100)	18.2 %	33.3 %	44.1 %	+25.9 pp	Accuracy IMPROVES across the run. Counter-evidence to qa3 32k regression claim — sample order matters here.
qa3 64k (N=100)	33.3 %	36.4 %	29.4 %	−3.9 pp	Stable.
qa3 128k (N=500)	32.5 %	26.5 %	22.0 %	−10.5 pp	Sustained drift across N=500. Confirmed across two independent runs (phase-5 + phase-10). The qa1 64k pattern is not isolated.

Two cells show production-concerning drift: qa1 64k (−23 pp end-to-end, N=100) and qa3 128k (−10.5 pp, N=500 across two phases). Both are real — same direction across independent measurements. The qa3 32k cell shows the opposite pattern (improving with samples), which suggests the platform's behavior is sample-order-dependent on temporal-reasoning prompts.

06 · What Modulum does — synthesis

The honest answer, finding by finding.

It really does help base Gemma-4 on the BABILong matrix — but unevenly.

Average lift of +11.78 pp across 9 mask cells. Three cells reach p<0.05 at N=50 (qa1 64k, qa2 32k, qa2 128k). The other 6 trend positive but are below the noise floor at this sample size. The lift exists; the average is the right direction; individual cell claims need higher N to formalize.
The strongest single finding is qa2 at short context (+28 pp, p=0.003).

qa2 32k Modulum 56 % vs vanilla 28 %. This is the load-bearing piece of evidence the platform does real architectural work — a 2× absolute accuracy doubling on 2-fact reasoning at the easiest context length, well outside sampling noise.
On qa3 multi-fact temporal reasoning, Modulum has the flattest decay slope of any tested stack (−2.5 pp / 2× context).

Better than Opus 4.6 (−4.0), GPT-5.5 (−9.0), Gemini 3.1 Pro (−7.6), Grok 4.3 (−8.2). The "First, Not Lost" thesis is strongest here. Caveat: vanilla Gemma-4 already has a −4.0 slope on qa3 — Modulum extends it by +1.5 pp/2×; the rest of the qa3 slope advantage is base-model inherited.
Modulum decodes 17–22 % FASTER than vanilla on qa3 across every context length.

qa3 32k: 49.5 t/s vs 40.7 t/s (+21.6 %). qa3 64k: 45.9 vs 38.0 (+21.0 %). qa3 128k: 40.2 vs 34.4 (+16.9 %). Hypothesis: the attention conditioning produces more focused, shorter outputs on multi-fact reasoning. The qa3 lift comes with no compute penalty.
The trade-off: 19–30 % slower decode on qa1 short-context retrieval.

qa1 32k: 35.1 t/s vs 50.4 t/s (−30 %). qa1 64k: 33.6 vs 41.5 (−19 %). This is the platform overhead on the easiest task. Speed converges to vanilla by 128k.
Sustained-run drift on two cells is a production-deployment blocker.

qa1 64k drops 23 pp end-to-end across 100 sequential samples (87.9 % → 64.7 %). qa3 128k drops 10.5 pp across 500 samples (32.5 % → 22.0 %, confirmed across two independent phases). Likely KV-cache state accumulation or attention drift. Needs diagnosis before partner deployment.
The qa3 32k regression claim is not significant.

Modulum 22 % vs vanilla 28 % = −6 pp, p=0.49 — well above any significance threshold at N=50. Could be sample noise. Within-run data adds a second caveat: the same cell shows +25.9 pp improvement from early to late tercile, suggesting sample order strongly affects Modulum's behavior on short-context temporal prompts.
Modulum trails current-generation frontier by 38–43 pp on absolute 128k accuracy.

128k task averages: Opus 4.6 88.7 %, GPT-5.5 84.0 %, Gemini 3.1 Pro 65.6 %, Opus 4.7 65.3 %, Modulum 46.0 %, Grok 4.3 21.1 %. The value of Modulum is not absolute-capability parity with hyperscaler-served frontier; it is workstation-scale qa3 slope preservation on an open-weight base.
We cannot evaluate Modulum at 256k+ context yet.

Hypernym's Modulum endpoint is documented as capped at 128k. We did not run probe requests above 128k against Modulum to obtain a direct 4xx confirmation in SQLite. The strongest possible "Modulum holds where frontier collapses" claim is currently untestable. Important: the prior v1/v2 "Gemini 1M collapses to 0 %" claim is RETRACTED — those 50/50 failures were HTTP 429 spending-cap errors, not Gemini-model-context failures. Codex audit caught this; we have no published 1M data either direction.
Modulum's published deployment profile is workstation-scale single-GPU.

Quoted to us by Hypernym; not independently measured in this bench. The previously cited specific figures (16 GB / 50–100× / 250–640 GB serving fleet) are not evidenced in our canonical data and have been softened or removed.

What the data actually says

The cleanest possible isolation of platform contribution.

Headline accuracy with exact Wilson 95% CIs.

Where Modulum's qa3 slope is best-in-panel.

The compute cost basis of the accuracy lift.

Decode tokens / second (output generation rate)

Prefill tokens / second (context ingest rate)

Production-deployment signal: does accuracy degrade run-to-run?

The honest answer, finding by finding.

It really does help base Gemma-4 on the BABILong matrix — but unevenly.

The strongest single finding is qa2 at short context (+28 pp, p=0.003).

On qa3 multi-fact temporal reasoning, Modulum has the flattest decay slope of any tested stack (−2.5 pp / 2× context).

Modulum decodes 17–22 % FASTER than vanilla on qa3 across every context length.

The trade-off: 19–30 % slower decode on qa1 short-context retrieval.

Sustained-run drift on two cells is a production-deployment blocker.

The qa3 32k regression claim is not significant.

Modulum trails current-generation frontier by 38–43 pp on absolute 128k accuracy.

We cannot evaluate Modulum at 256k+ context yet.

Modulum's published deployment profile is workstation-scale single-GPU.