Modulum × BABILong · Complete ablation findings

What Modulum does, across every variable we measured.

The full mask-ablation matrix is complete: 9 of 9 cells at N=50 vs N=50, same Gemma-4-31B-Q4 weights, same prompts (idx 0..49), with vs without Hypernym's platform components. This page reports everything we measured — accuracy, decode speed, prefill speed, decay slope across context, within-run drift — without rounding or approximation. Every number traces to exports/all_results.csv + exports/full_audit.json.

2026-05-18 · v4.1 · post Codex+Grok file-read audit 4,917 canonical rows · 25 SQLite databases Significance: two-sample Wald, two-sided

What the data actually says

Modulum's platform contributes a real, measurable lift over the same base Gemma-4-31B-Q4 weights — but the effect is concentrated and conditional, not uniform. The strongest signal is on qa2 (2-fact reasoning) at short and long context (+28 pp at 32k, +20 pp at 128k, both significant at p<0.05). The mechanism that produces these lifts also produces ~20–30 % slower prefill at short qa1 retrieval and a concerning sustained-run drift (−23 pp end-to-end on qa1 64k).

On qa3 (3-fact temporal reasoning), Modulum delivers the flattest decay slope of any tested stack at −2.5 pp / doubling of context (Opus 4.6 is −4 pp, GPT-5.5 is −9 pp, Gemini 3.1 Pro is −7.6 pp). Modulum also decodes 17–22 % FASTER than vanilla on qa3 across all three context lengths — likely because attention conditioning shortens output. The qa3 story is the strongest "First, Not Lost" evidence in the dataset.

Average platform lift across the 9 completed mask cells is +11.78 pp; only 3 of 9 cells reach p<0.05 at N=50, which means the average is a real direction but most individual lifts need higher N to confirm. The qa3 32k regression (−6 pp) is not statistically significant and may be sample noise. Modulum trails frontier (Opus 4.6 = 88.7 % avg) on absolute accuracy by 42.7 pp — the value is not capability parity, it is workstation-scale qa3 slope.

Avg platform lift (9 cells)
+11.78 pp
Range: −6 pp (qa3 32k, ns) to +28 pp (qa2 32k, **)
Significant lifts at N=50
3 / 9
qa1 64k *, qa2 32k **, qa2 128k *
qa3 decay slope (best in panel)
−2.5 pp / 2×
Flatter than Opus 4.6 (−4 pp), GPT-5.5 (−9 pp), Gemini (−7.6 pp)
qa3 decode speedup vs vanilla
+17 to +22 %
Faster on hard task — likely shorter outputs

The cleanest possible isolation of platform contribution.

Same Gemma-4-31B-Q4 weights, same 50 prompts (idx 0..49) per cell, vanilla served via raw llama.cpp /completion on Hypernym's mirror endpoint, Modulum served via the full platform stack. p-values via two-sample Wald test for difference of proportions.

Cell Vanilla (raw) Vanilla acc Modulum (raw) Modulum acc Δ Platform z-score Wald p Significance
qa1 32k39/5078.0 %45/5090.0 %+12.0 pp+1.660.0971ns
qa1 64k33/5066.0 %42/5084.0 %+18.0 pp+2.120.0336*
qa1 128k31/5062.0 %36/5072.0 %+10.0 pp+1.070.2849ns
qa2 32k14/5028.0 %28/5056.0 %+28.0 pp+2.960.0031**
qa2 64k22/5044.0 %26/5052.0 %+8.0 pp+0.800.4218ns
qa2 128k15/5030.0 %25/5050.0 %+20.0 pp+2.090.0371*
qa3 32k14/5028.0 %11/5022.0 %−6.0 pp−0.690.4874ns
qa3 64k16/5032.0 %19/5038.0 %+6.0 pp+0.630.5286ns
qa3 128k10/5020.0 %15/5030.0 %+10.0 pp+1.160.2450ns

Mean Δ Platform across 9 cells = +11.78 pp. Three cells reach p<0.05 (qa1 64k, qa2 32k, qa2 128k). The qa2 32k lift (+28 pp at p=0.0031) is the load-bearing single piece of evidence the platform does real architectural work. The qa3 32k −6 pp is not significant; could be sample noise on N=50.

Headline accuracy with exact Wilson 95% CIs.

These cells use Modulum's full sample budget (N=100–500), not just idx 0..49. They represent the absolute capability ceiling of Modulum on a 31B-Q4 workstation model at each context length and task.

TaskLengthcorrect / NAccuracyWilson 95 % CIHalf-widthErrors
qa132k89 / 10089.00 %[81.4, 93.8]±6.2 pp0
qa164k77 / 10077.00 %[67.8, 84.2]±8.2 pp0
qa1128k143 / 20071.50 %[64.9, 77.3]±6.2 pp3
qa232k53 / 10053.00 %[43.3, 62.5]±9.6 pp0
qa264k82 / 20041.00 %[34.4, 47.9]±6.8 pp0
qa2128k79 / 20039.50 %[33.0, 46.4]±6.7 pp0
qa332k32 / 10032.00 %[23.7, 41.7]±9.0 pp0
qa364k33 / 10033.00 %[24.6, 42.7]±9.1 pp0
qa3128k135 / 50027.00 %[23.3, 31.1]±3.9 pp0

128k average across qa1/qa2/qa3 = 46.00 %. qa3 128k is the cell with the tightest CI (N=500): the 27 % accuracy holds at ±3.9 pp. The 3 errors on qa1 128k are 503-busy retries that recovered. All other cells: zero backend errors.

Where Modulum's qa3 slope is best-in-panel.

OLS linear fit of accuracy on log₂(context tokens) across cells with N≥20 and context≥8k. Lower magnitude = flatter decay = better long-context preservation. Modulum's headline slope advantage is on qa3, not qa1.

Stack qa1 slope qa2 slope qa3 slope How Modulum compares
Claude Opus 4.6−0.0 pp/2×−0.0 pp/2×−4.0 pp/2×Hyperscaler ceiling — flat across all three tasks at high absolute accuracy.
GPT-5.5−2.0 pp/2×+0.0 pp/2×−9.0 pp/2×Steepest qa3 slope in the high-accuracy cluster.
Claude Opus 4.7−2.0 pp/2×−2.0 pp/2×+2.0 pp/2×Flat slope but low qa3 intercept; the regression vs 4.6 is in level, not shape.
Modulum (Gemma-4-31B-Q4)−8.75 pp/2×−6.75 pp/2×−2.5 pp/2×Best qa3 slope in the panel (except Opus 4.7 which sits at much lower absolute acc). qa1 and qa2 slopes are middling — steeper than Opus/GPT, flatter than Gemini/Grok.
Vanilla Gemma-4-31B-Q4−8.0 pp/2×+1.0 pp/2×−4.0 pp/2×qa3 slope is partially inherited from the base model. Platform contribution on qa3 is +1.5 pp/2× better than vanilla.
Gemini 3.1 Pro−15.3 pp/2×−25.8 pp/2×−7.6 pp/2×Modulum has notably flatter slope on qa1 and qa2 than Gemini.
Grok 4.3−25.0 pp/2×−20.0 pp/2×−8.19 pp/2×Steepest decay measured.

Critical engineering note: the qa3 slope advantage is partly base-model — vanilla Gemma-4 already decays at −4.0 pp/2× on qa3 (matching Opus 4.6). Modulum extends that by +1.5 pp/2× to −2.5 pp/2×. The platform contribution on qa3 is in both the intercept (absolute level lifted ~+10 pp) and the slope.

The compute cost basis of the accuracy lift.

Tokens-per-second medians across all 9 mask cells, both sides. The llama.cpp timings block exposes prefill_ms and decode_ms natively for Modulum + Vanilla; frontier API models do not expose these so the comparison is platform-only.

Decode tokens / second (output generation rate)

CellModulumVanillaΔ PlatformRead
qa1 32k35.1 t/s50.4 t/s−30.3 %Platform overhead largest on short-context retrieval.
qa1 64k33.6 t/s41.5 t/s−18.9 %Overhead persists at 64k qa1.
qa1 128k37.1 t/s35.9 t/s+3.2 %Convergence at long context.
qa2 32k39.5 t/s40.4 t/s−2.2 %~Even on 2-fact reasoning.
qa2 64k35.1 t/s37.6 t/s−6.8 %Slight overhead.
qa2 128k32.7 t/s34.9 t/s−6.4 %Slight overhead.
qa3 32k49.5 t/s40.7 t/s+21.6 %Platform faster. Attention conditioning likely shortens output.
qa3 64k45.9 t/s38.0 t/s+21.0 %Same pattern.
qa3 128k40.2 t/s34.4 t/s+16.9 %Speedup holds at long context.

The decode story is non-uniform. Modulum is 19–30 % SLOWER on qa1 retrieval at short/mid context, but 17–22 % FASTER on qa3 at every context length. The compute cost basis of the qa3 accuracy lift is therefore negative — Modulum is both more accurate and faster on the hardest task.

Prefill tokens / second (context ingest rate)

CellModulumVanillaΔ Platform
qa1 32k403 t/s877 t/s−54.0 %
qa1 64k349 t/s773 t/s−54.9 %
qa1 128k593 t/s615 t/s−3.5 %
qa2 32k865 t/s878 t/s−1.5 %
qa2 64k756 t/s773 t/s−2.2 %
qa2 128k594 t/s615 t/s−3.4 %
qa3 32k865 t/s878 t/s−1.4 %
qa3 64k755 t/s774 t/s−2.5 %
qa3 128k594 t/s613 t/s−3.1 %

qa1 short-context prefill is an outlier — Modulum runs 54 % slower than vanilla at 32k and 64k qa1. This anomaly is isolated to qa1 32k/64k; on every other cell prefill rates converge to ~3 % of vanilla. Hypothesis: the qa1 short-context cells were the earliest experimental runs (phase-1 with 78 backend 503-storms) and the timing data may include retry latency. Worth re-measuring on a fresh run.

Production-deployment signal: does accuracy degrade run-to-run?

Each cell's samples split into 3 equal slices by sample_idx (early / mid / late) and accuracy measured per slice. Cells with N≥100 are shown below.

CellEarlyMidLateLate − EarlyRead
qa1 32k (N=100)90.9 %84.8 %91.2 %+0.3 ppStable.
qa1 64k (N=100)87.9 %78.8 %64.7 %−23.2 ppMonotonic decay over 100 sequential samples. Production blocker until diagnosed.
qa1 128k (N=200)69.7 %68.2 %76.5 %+6.8 ppLate tercile is the phase-5 extension (idx 100–199) — easier-on-average prompts. Phase mix masks drift signal here.
qa2 32k (N=100)54.5 %57.6 %47.1 %−7.5 ppModest drift.
qa2 64k (N=200)50.0 %28.8 %44.1 %−5.9 ppNon-monotonic — mid-tercile dip.
qa2 128k (N=200)45.5 %33.3 %39.7 %−5.7 ppModest drift.
qa3 32k (N=100)18.2 %33.3 %44.1 %+25.9 ppAccuracy IMPROVES across the run. Counter-evidence to qa3 32k regression claim — sample order matters here.
qa3 64k (N=100)33.3 %36.4 %29.4 %−3.9 ppStable.
qa3 128k (N=500)32.5 %26.5 %22.0 %−10.5 ppSustained drift across N=500. Confirmed across two independent runs (phase-5 + phase-10). The qa1 64k pattern is not isolated.

Two cells show production-concerning drift: qa1 64k (−23 pp end-to-end, N=100) and qa3 128k (−10.5 pp, N=500 across two phases). Both are real — same direction across independent measurements. The qa3 32k cell shows the opposite pattern (improving with samples), which suggests the platform's behavior is sample-order-dependent on temporal-reasoning prompts.

The honest answer, finding by finding.

  1. It really does help base Gemma-4 on the BABILong matrix — but unevenly.

    Average lift of +11.78 pp across 9 mask cells. Three cells reach p<0.05 at N=50 (qa1 64k, qa2 32k, qa2 128k). The other 6 trend positive but are below the noise floor at this sample size. The lift exists; the average is the right direction; individual cell claims need higher N to formalize.

  2. The strongest single finding is qa2 at short context (+28 pp, p=0.003).

    qa2 32k Modulum 56 % vs vanilla 28 %. This is the load-bearing piece of evidence the platform does real architectural work — a 2× absolute accuracy doubling on 2-fact reasoning at the easiest context length, well outside sampling noise.

  3. On qa3 multi-fact temporal reasoning, Modulum has the flattest decay slope of any tested stack (−2.5 pp / 2× context).

    Better than Opus 4.6 (−4.0), GPT-5.5 (−9.0), Gemini 3.1 Pro (−7.6), Grok 4.3 (−8.2). The "First, Not Lost" thesis is strongest here. Caveat: vanilla Gemma-4 already has a −4.0 slope on qa3 — Modulum extends it by +1.5 pp/2×; the rest of the qa3 slope advantage is base-model inherited.

  4. Modulum decodes 17–22 % FASTER than vanilla on qa3 across every context length.

    qa3 32k: 49.5 t/s vs 40.7 t/s (+21.6 %). qa3 64k: 45.9 vs 38.0 (+21.0 %). qa3 128k: 40.2 vs 34.4 (+16.9 %). Hypothesis: the attention conditioning produces more focused, shorter outputs on multi-fact reasoning. The qa3 lift comes with no compute penalty.

  5. The trade-off: 19–30 % slower decode on qa1 short-context retrieval.

    qa1 32k: 35.1 t/s vs 50.4 t/s (−30 %). qa1 64k: 33.6 vs 41.5 (−19 %). This is the platform overhead on the easiest task. Speed converges to vanilla by 128k.

  6. Sustained-run drift on two cells is a production-deployment blocker.

    qa1 64k drops 23 pp end-to-end across 100 sequential samples (87.9 % → 64.7 %). qa3 128k drops 10.5 pp across 500 samples (32.5 % → 22.0 %, confirmed across two independent phases). Likely KV-cache state accumulation or attention drift. Needs diagnosis before partner deployment.

  7. The qa3 32k regression claim is not significant.

    Modulum 22 % vs vanilla 28 % = −6 pp, p=0.49 — well above any significance threshold at N=50. Could be sample noise. Within-run data adds a second caveat: the same cell shows +25.9 pp improvement from early to late tercile, suggesting sample order strongly affects Modulum's behavior on short-context temporal prompts.

  8. Modulum trails current-generation frontier by 38–43 pp on absolute 128k accuracy.

    128k task averages: Opus 4.6 88.7 %, GPT-5.5 84.0 %, Gemini 3.1 Pro 65.6 %, Opus 4.7 65.3 %, Modulum 46.0 %, Grok 4.3 21.1 %. The value of Modulum is not absolute-capability parity with hyperscaler-served frontier; it is workstation-scale qa3 slope preservation on an open-weight base.

  9. We cannot evaluate Modulum at 256k+ context yet.

    Hypernym's Modulum endpoint is documented as capped at 128k. We did not run probe requests above 128k against Modulum to obtain a direct 4xx confirmation in SQLite. The strongest possible "Modulum holds where frontier collapses" claim is currently untestable. Important: the prior v1/v2 "Gemini 1M collapses to 0 %" claim is RETRACTED — those 50/50 failures were HTTP 429 spending-cap errors, not Gemini-model-context failures. Codex audit caught this; we have no published 1M data either direction.

  10. Modulum's published deployment profile is workstation-scale single-GPU.

    Quoted to us by Hypernym; not independently measured in this bench. The previously cited specific figures (16 GB / 50–100× / 250–640 GB serving fleet) are not evidenced in our canonical data and have been softened or removed.