Ablations & Negative Results

Every complexity increase tested, including those that reduced the score

Summary

Negative results are logged; no model that failed to help is dropped silently. Across the study, nearly every increase in model complexity reduced the score. At 393 positives, the efficient, trivial choices (a small backbone, a rank-average combiner, intrinsic tabular features) are both cheaper and more accurate. That negative result is the contribution.

Tabular GBDT progression

Step	Change	OOF pAUC	Δ
0	Broken: `is_unbalance` + AUC early-stop → `best_iter = 1`	0.09941	—
1	Fix early-stopping / objective (drop `is_unbalance`, early-stop on pAUC)	0.11826	+0.01885
2	Wide patient-relative ugly-duckling features (`pdev_` / `prank_` / `pxc_`)	0.14420	+0.02594
3	Greysky-bagged LightGBM + `pxc_` + CatBoost rank-blend	0.16890	+0.02470

The broken config scored barely above 5× random: on extreme imbalance the training setup dominates the model. The patient-relative feature block (Step 2) is the single largest jump (+0.026), confirming the ugly-duckling thesis. Bagging plus CatBoost (Step 3) adds +0.025 from variance reduction across seeds and families.

Image expert — resolution & regularizer

Config	OOF pAUC	Verdict
`convnextv2_nano` @128	0.15311	baseline image point
`convnextv2_nano` @224 + EMA0.995 + mixup	0.15821	best image
↳ same, EMA 0.999	~0.146	reduced (over-smooths at ~23 steps/epoch)
`convnextv2_tiny` @224	0.15824	comparable to nano, ~2× the cost
`swinv2_tiny` @256	0.10422	collapse — near random
`eva02_small` @336	0.09984	collapse — near random

Resolution plus light EMA(0.995) and mixup on the small nano backbone is the optimum. Stronger EMA (0.999) over-smooths because the undersampled epochs are short. Heavy transformers (Swin, EVA-02) collapse to near-random — overfitting at 393 positives. convnextv2_tiny (2× nano’s params) does not exceed nano.

Stack combiners

Combiner	OOF pAUC	Verdict
rank-avg [GBDT, nano@224]	0.17376	winner
meta-LGBM stacker	0.17108	loses to rank-average
rank [+ weak images]	0.16679	dilutes
learned per-lesion gate (MoE)	0.15007	loss — below tabular 0.16890
+ PCA image embeddings into GBDT	< 0.17376	reduces

The trivial 2-way rank-average wins. Every learned or heavier combiner loses, and the MoE gate falls below the tabular expert alone (0.150 < 0.169). Adding weak images, or PCA embeddings into the GBDT, dilutes.

Reconstruction note. A recomputed rank-average over {GBDT + 4 image OOFs} scores 0.16917; the logged “rank [+ weak images]” run (0.16679) used a different weak-image set and is kept as the authoritative logged number. The conclusion holds either way: adding weak images dilutes.

Negative results summary

Idea tried	Its pAUC	Beaten by	Finding
Learned per-lesion gate (MoE)	0.15007	0.16890 (tabular)	loses to tabular alone
Meta-LGBM stacker	0.17108	0.17376 (rank-avg)	loses to rank-average
+ PCA image embeddings into GBDT	< 0.1738	0.17376	embeddings reduce
Heavy backbones (SwinV2 / EVA-02)	0.104 / 0.100	0.15821 (nano@224)	dominated — overfit at 393 pos
EMA 0.999 (over-smoothing)	~0.146	0.15821 (EMA0.995)	stronger EMA reduces

Why complexity reduces the score

Each learned add-on must estimate its parameters from the same 393 positives the base experts already consumed. A rank-average has zero parameters and cannot overfit the validation folds; a meta-learner, a gate, or extra embedding dimensions all have parameters to fit and insufficient fresh positive signal to fit them well. The result extends a classical lesson to the extreme-imbalance regime: when positives are scarce, model selection should bias toward zero-parameter fusion and small backbones. That choice is also the cheapest, which places it on the efficiency frontier.

Continue to Reproducibility →