The full mask-ablation matrix is complete: 9 of 9 cells at N=50 vs N=50, same Gemma-4-31B-Q4 weights, same prompts (idx 0..49), with vs without Hypernym's platform components. This page reports everything we measured — accuracy, decode speed, prefill speed, decay slope across context, within-run drift — without rounding or approximation. Every number traces to exports/all_results.csv + exports/full_audit.json.
Modulum's platform contributes a real, measurable lift over the same base Gemma-4-31B-Q4 weights — but the effect is concentrated and conditional, not uniform. The strongest signal is on qa2 (2-fact reasoning) at short and long context (+28 pp at 32k, +20 pp at 128k, both significant at p<0.05). The mechanism that produces these lifts also produces ~20–30 % slower prefill at short qa1 retrieval and a concerning sustained-run drift (−23 pp end-to-end on qa1 64k).
On qa3 (3-fact temporal reasoning), Modulum delivers the flattest decay slope of any tested stack at −2.5 pp / doubling of context (Opus 4.6 is −4 pp, GPT-5.5 is −9 pp, Gemini 3.1 Pro is −7.6 pp). Modulum also decodes 17–22 % FASTER than vanilla on qa3 across all three context lengths — likely because attention conditioning shortens output. The qa3 story is the strongest "First, Not Lost" evidence in the dataset.
Average platform lift across the 9 completed mask cells is +11.78 pp; only 3 of 9 cells reach p<0.05 at N=50, which means the average is a real direction but most individual lifts need higher N to confirm. The qa3 32k regression (−6 pp) is not statistically significant and may be sample noise. Modulum trails frontier (Opus 4.6 = 88.7 % avg) on absolute accuracy by 42.7 pp — the value is not capability parity, it is workstation-scale qa3 slope.
Same Gemma-4-31B-Q4 weights, same 50 prompts (idx 0..49) per cell, vanilla served via raw llama.cpp /completion on Hypernym's mirror endpoint, Modulum served via the full platform stack. p-values via two-sample Wald test for difference of proportions.
| Cell | Vanilla (raw) | Vanilla acc | Modulum (raw) | Modulum acc | Δ Platform | z-score | Wald p | Significance |
|---|---|---|---|---|---|---|---|---|
| qa1 32k | 39/50 | 78.0 % | 45/50 | 90.0 % | +12.0 pp | +1.66 | 0.0971 | ns |
| qa1 64k | 33/50 | 66.0 % | 42/50 | 84.0 % | +18.0 pp | +2.12 | 0.0336 | * |
| qa1 128k | 31/50 | 62.0 % | 36/50 | 72.0 % | +10.0 pp | +1.07 | 0.2849 | ns |
| qa2 32k | 14/50 | 28.0 % | 28/50 | 56.0 % | +28.0 pp | +2.96 | 0.0031 | ** |
| qa2 64k | 22/50 | 44.0 % | 26/50 | 52.0 % | +8.0 pp | +0.80 | 0.4218 | ns |
| qa2 128k | 15/50 | 30.0 % | 25/50 | 50.0 % | +20.0 pp | +2.09 | 0.0371 | * |
| qa3 32k | 14/50 | 28.0 % | 11/50 | 22.0 % | −6.0 pp | −0.69 | 0.4874 | ns |
| qa3 64k | 16/50 | 32.0 % | 19/50 | 38.0 % | +6.0 pp | +0.63 | 0.5286 | ns |
| qa3 128k | 10/50 | 20.0 % | 15/50 | 30.0 % | +10.0 pp | +1.16 | 0.2450 | ns |
Mean Δ Platform across 9 cells = +11.78 pp. Three cells reach p<0.05 (qa1 64k, qa2 32k, qa2 128k). The qa2 32k lift (+28 pp at p=0.0031) is the load-bearing single piece of evidence the platform does real architectural work. The qa3 32k −6 pp is not significant; could be sample noise on N=50.
These cells use Modulum's full sample budget (N=100–500), not just idx 0..49. They represent the absolute capability ceiling of Modulum on a 31B-Q4 workstation model at each context length and task.
| Task | Length | correct / N | Accuracy | Wilson 95 % CI | Half-width | Errors |
|---|---|---|---|---|---|---|
| qa1 | 32k | 89 / 100 | 89.00 % | [81.4, 93.8] | ±6.2 pp | 0 |
| qa1 | 64k | 77 / 100 | 77.00 % | [67.8, 84.2] | ±8.2 pp | 0 |
| qa1 | 128k | 143 / 200 | 71.50 % | [64.9, 77.3] | ±6.2 pp | 3 |
| qa2 | 32k | 53 / 100 | 53.00 % | [43.3, 62.5] | ±9.6 pp | 0 |
| qa2 | 64k | 82 / 200 | 41.00 % | [34.4, 47.9] | ±6.8 pp | 0 |
| qa2 | 128k | 79 / 200 | 39.50 % | [33.0, 46.4] | ±6.7 pp | 0 |
| qa3 | 32k | 32 / 100 | 32.00 % | [23.7, 41.7] | ±9.0 pp | 0 |
| qa3 | 64k | 33 / 100 | 33.00 % | [24.6, 42.7] | ±9.1 pp | 0 |
| qa3 | 128k | 135 / 500 | 27.00 % | [23.3, 31.1] | ±3.9 pp | 0 |
128k average across qa1/qa2/qa3 = 46.00 %. qa3 128k is the cell with the tightest CI (N=500): the 27 % accuracy holds at ±3.9 pp. The 3 errors on qa1 128k are 503-busy retries that recovered. All other cells: zero backend errors.
OLS linear fit of accuracy on log₂(context tokens) across cells with N≥20 and context≥8k. Lower magnitude = flatter decay = better long-context preservation. Modulum's headline slope advantage is on qa3, not qa1.
| Stack | qa1 slope | qa2 slope | qa3 slope | How Modulum compares |
|---|---|---|---|---|
| Claude Opus 4.6 | −0.0 pp/2× | −0.0 pp/2× | −4.0 pp/2× | Hyperscaler ceiling — flat across all three tasks at high absolute accuracy. |
| GPT-5.5 | −2.0 pp/2× | +0.0 pp/2× | −9.0 pp/2× | Steepest qa3 slope in the high-accuracy cluster. |
| Claude Opus 4.7 | −2.0 pp/2× | −2.0 pp/2× | +2.0 pp/2× | Flat slope but low qa3 intercept; the regression vs 4.6 is in level, not shape. |
| Modulum (Gemma-4-31B-Q4) | −8.75 pp/2× | −6.75 pp/2× | −2.5 pp/2× | Best qa3 slope in the panel (except Opus 4.7 which sits at much lower absolute acc). qa1 and qa2 slopes are middling — steeper than Opus/GPT, flatter than Gemini/Grok. |
| Vanilla Gemma-4-31B-Q4 | −8.0 pp/2× | +1.0 pp/2× | −4.0 pp/2× | qa3 slope is partially inherited from the base model. Platform contribution on qa3 is +1.5 pp/2× better than vanilla. |
| Gemini 3.1 Pro | −15.3 pp/2× | −25.8 pp/2× | −7.6 pp/2× | Modulum has notably flatter slope on qa1 and qa2 than Gemini. |
| Grok 4.3 | −25.0 pp/2× | −20.0 pp/2× | −8.19 pp/2× | Steepest decay measured. |
Critical engineering note: the qa3 slope advantage is partly base-model — vanilla Gemma-4 already decays at −4.0 pp/2× on qa3 (matching Opus 4.6). Modulum extends that by +1.5 pp/2× to −2.5 pp/2×. The platform contribution on qa3 is in both the intercept (absolute level lifted ~+10 pp) and the slope.
Tokens-per-second medians across all 9 mask cells, both sides. The llama.cpp timings block exposes prefill_ms and decode_ms natively for Modulum + Vanilla; frontier API models do not expose these so the comparison is platform-only.
| Cell | Modulum | Vanilla | Δ Platform | Read |
|---|---|---|---|---|
| qa1 32k | 35.1 t/s | 50.4 t/s | −30.3 % | Platform overhead largest on short-context retrieval. |
| qa1 64k | 33.6 t/s | 41.5 t/s | −18.9 % | Overhead persists at 64k qa1. |
| qa1 128k | 37.1 t/s | 35.9 t/s | +3.2 % | Convergence at long context. |
| qa2 32k | 39.5 t/s | 40.4 t/s | −2.2 % | ~Even on 2-fact reasoning. |
| qa2 64k | 35.1 t/s | 37.6 t/s | −6.8 % | Slight overhead. |
| qa2 128k | 32.7 t/s | 34.9 t/s | −6.4 % | Slight overhead. |
| qa3 32k | 49.5 t/s | 40.7 t/s | +21.6 % | Platform faster. Attention conditioning likely shortens output. |
| qa3 64k | 45.9 t/s | 38.0 t/s | +21.0 % | Same pattern. |
| qa3 128k | 40.2 t/s | 34.4 t/s | +16.9 % | Speedup holds at long context. |
The decode story is non-uniform. Modulum is 19–30 % SLOWER on qa1 retrieval at short/mid context, but 17–22 % FASTER on qa3 at every context length. The compute cost basis of the qa3 accuracy lift is therefore negative — Modulum is both more accurate and faster on the hardest task.
| Cell | Modulum | Vanilla | Δ Platform |
|---|---|---|---|
| qa1 32k | 403 t/s | 877 t/s | −54.0 % |
| qa1 64k | 349 t/s | 773 t/s | −54.9 % |
| qa1 128k | 593 t/s | 615 t/s | −3.5 % |
| qa2 32k | 865 t/s | 878 t/s | −1.5 % |
| qa2 64k | 756 t/s | 773 t/s | −2.2 % |
| qa2 128k | 594 t/s | 615 t/s | −3.4 % |
| qa3 32k | 865 t/s | 878 t/s | −1.4 % |
| qa3 64k | 755 t/s | 774 t/s | −2.5 % |
| qa3 128k | 594 t/s | 613 t/s | −3.1 % |
qa1 short-context prefill is an outlier — Modulum runs 54 % slower than vanilla at 32k and 64k qa1. This anomaly is isolated to qa1 32k/64k; on every other cell prefill rates converge to ~3 % of vanilla. Hypothesis: the qa1 short-context cells were the earliest experimental runs (phase-1 with 78 backend 503-storms) and the timing data may include retry latency. Worth re-measuring on a fresh run.
Each cell's samples split into 3 equal slices by sample_idx (early / mid / late) and accuracy measured per slice. Cells with N≥100 are shown below.
| Cell | Early | Mid | Late | Late − Early | Read |
|---|---|---|---|---|---|
| qa1 32k (N=100) | 90.9 % | 84.8 % | 91.2 % | +0.3 pp | Stable. |
| qa1 64k (N=100) | 87.9 % | 78.8 % | 64.7 % | −23.2 pp | Monotonic decay over 100 sequential samples. Production blocker until diagnosed. |
| qa1 128k (N=200) | 69.7 % | 68.2 % | 76.5 % | +6.8 pp | Late tercile is the phase-5 extension (idx 100–199) — easier-on-average prompts. Phase mix masks drift signal here. |
| qa2 32k (N=100) | 54.5 % | 57.6 % | 47.1 % | −7.5 pp | Modest drift. |
| qa2 64k (N=200) | 50.0 % | 28.8 % | 44.1 % | −5.9 pp | Non-monotonic — mid-tercile dip. |
| qa2 128k (N=200) | 45.5 % | 33.3 % | 39.7 % | −5.7 pp | Modest drift. |
| qa3 32k (N=100) | 18.2 % | 33.3 % | 44.1 % | +25.9 pp | Accuracy IMPROVES across the run. Counter-evidence to qa3 32k regression claim — sample order matters here. |
| qa3 64k (N=100) | 33.3 % | 36.4 % | 29.4 % | −3.9 pp | Stable. |
| qa3 128k (N=500) | 32.5 % | 26.5 % | 22.0 % | −10.5 pp | Sustained drift across N=500. Confirmed across two independent runs (phase-5 + phase-10). The qa1 64k pattern is not isolated. |
Two cells show production-concerning drift: qa1 64k (−23 pp end-to-end, N=100) and qa3 128k (−10.5 pp, N=500 across two phases). Both are real — same direction across independent measurements. The qa3 32k cell shows the opposite pattern (improving with samples), which suggests the platform's behavior is sample-order-dependent on temporal-reasoning prompts.
Average lift of +11.78 pp across 9 mask cells. Three cells reach p<0.05 at N=50 (qa1 64k, qa2 32k, qa2 128k). The other 6 trend positive but are below the noise floor at this sample size. The lift exists; the average is the right direction; individual cell claims need higher N to formalize.
qa2 32k Modulum 56 % vs vanilla 28 %. This is the load-bearing piece of evidence the platform does real architectural work — a 2× absolute accuracy doubling on 2-fact reasoning at the easiest context length, well outside sampling noise.
Better than Opus 4.6 (−4.0), GPT-5.5 (−9.0), Gemini 3.1 Pro (−7.6), Grok 4.3 (−8.2). The "First, Not Lost" thesis is strongest here. Caveat: vanilla Gemma-4 already has a −4.0 slope on qa3 — Modulum extends it by +1.5 pp/2×; the rest of the qa3 slope advantage is base-model inherited.
qa3 32k: 49.5 t/s vs 40.7 t/s (+21.6 %). qa3 64k: 45.9 vs 38.0 (+21.0 %). qa3 128k: 40.2 vs 34.4 (+16.9 %). Hypothesis: the attention conditioning produces more focused, shorter outputs on multi-fact reasoning. The qa3 lift comes with no compute penalty.
qa1 32k: 35.1 t/s vs 50.4 t/s (−30 %). qa1 64k: 33.6 vs 41.5 (−19 %). This is the platform overhead on the easiest task. Speed converges to vanilla by 128k.
qa1 64k drops 23 pp end-to-end across 100 sequential samples (87.9 % → 64.7 %). qa3 128k drops 10.5 pp across 500 samples (32.5 % → 22.0 %, confirmed across two independent phases). Likely KV-cache state accumulation or attention drift. Needs diagnosis before partner deployment.
Modulum 22 % vs vanilla 28 % = −6 pp, p=0.49 — well above any significance threshold at N=50. Could be sample noise. Within-run data adds a second caveat: the same cell shows +25.9 pp improvement from early to late tercile, suggesting sample order strongly affects Modulum's behavior on short-context temporal prompts.
128k task averages: Opus 4.6 88.7 %, GPT-5.5 84.0 %, Gemini 3.1 Pro 65.6 %, Opus 4.7 65.3 %, Modulum 46.0 %, Grok 4.3 21.1 %. The value of Modulum is not absolute-capability parity with hyperscaler-served frontier; it is workstation-scale qa3 slope preservation on an open-weight base.
Hypernym's Modulum endpoint is documented as capped at 128k. We did not run probe requests above 128k against Modulum to obtain a direct 4xx confirmation in SQLite. The strongest possible "Modulum holds where frontier collapses" claim is currently untestable. Important: the prior v1/v2 "Gemini 1M collapses to 0 %" claim is RETRACTED — those 50/50 failures were HTTP 429 spending-cap errors, not Gemini-model-context failures. Codex audit caught this; we have no published 1M data either direction.
Quoted to us by Hypernym; not independently measured in this bench. The previously cited specific figures (16 GB / 50–100× / 250–640 GB serving fleet) are not evidenced in our canonical data and have been softened or removed.