Ablations & Negative Results
Every complexity increase tested, including those that reduced the score
Negative results are logged; no model that failed to help is dropped silently. Across the study, nearly every increase in model complexity reduced the score. At 393 positives, the efficient, trivial choices (a small backbone, a rank-average combiner, intrinsic tabular features) are both cheaper and more accurate. That negative result is the contribution.
Tabular GBDT progression
| Step | Change | OOF pAUC | Δ |
|---|---|---|---|
| 0 | Broken: is_unbalance + AUC early-stop → best_iter = 1 |
0.09941 | — |
| 1 | Fix early-stopping / objective (drop is_unbalance, early-stop on pAUC) |
0.11826 | +0.01885 |
| 2 | Wide patient-relative ugly-duckling features (pdev_ / prank_ / pxc_) |
0.14420 | +0.02594 |
| 3 | Greysky-bagged LightGBM + pxc_ + CatBoost rank-blend |
0.16890 | +0.02470 |
The broken config scored barely above 5× random: on extreme imbalance the training setup dominates the model. The patient-relative feature block (Step 2) is the single largest jump (+0.026), confirming the ugly-duckling thesis. Bagging plus CatBoost (Step 3) adds +0.025 from variance reduction across seeds and families.
Image expert — resolution & regularizer
| Config | OOF pAUC | Verdict |
|---|---|---|
convnextv2_nano @128 |
0.15311 | baseline image point |
convnextv2_nano @224 + EMA0.995 + mixup |
0.15821 | best image |
| ↳ same, EMA 0.999 | ~0.146 | reduced (over-smooths at ~23 steps/epoch) |
convnextv2_tiny @224 |
0.15824 | comparable to nano, ~2× the cost |
swinv2_tiny @256 |
0.10422 | collapse — near random |
eva02_small @336 |
0.09984 | collapse — near random |
Resolution plus light EMA(0.995) and mixup on the small nano backbone is the optimum. Stronger EMA (0.999) over-smooths because the undersampled epochs are short. Heavy transformers (Swin, EVA-02) collapse to near-random — overfitting at 393 positives. convnextv2_tiny (2× nano’s params) does not exceed nano.
Stack combiners
| Combiner | OOF pAUC | Verdict |
|---|---|---|
| rank-avg [GBDT, nano@224] | 0.17376 | winner |
| meta-LGBM stacker | 0.17108 | loses to rank-average |
| rank [+ weak images] | 0.16679 | dilutes |
| learned per-lesion gate (MoE) | 0.15007 | loss — below tabular 0.16890 |
| + PCA image embeddings into GBDT | < 0.17376 | reduces |
The trivial 2-way rank-average wins. Every learned or heavier combiner loses, and the MoE gate falls below the tabular expert alone (0.150 < 0.169). Adding weak images, or PCA embeddings into the GBDT, dilutes.
Reconstruction note. A recomputed rank-average over {GBDT + 4 image OOFs} scores 0.16917; the logged “rank [+ weak images]” run (0.16679) used a different weak-image set and is kept as the authoritative logged number. The conclusion holds either way: adding weak images dilutes.
Negative results summary
| Idea tried | Its pAUC | Beaten by | Finding |
|---|---|---|---|
| Learned per-lesion gate (MoE) | 0.15007 | 0.16890 (tabular) | loses to tabular alone |
| Meta-LGBM stacker | 0.17108 | 0.17376 (rank-avg) | loses to rank-average |
| + PCA image embeddings into GBDT | < 0.1738 | 0.17376 | embeddings reduce |
| Heavy backbones (SwinV2 / EVA-02) | 0.104 / 0.100 | 0.15821 (nano@224) | dominated — overfit at 393 pos |
| EMA 0.999 (over-smoothing) | ~0.146 | 0.15821 (EMA0.995) | stronger EMA reduces |
Why complexity reduces the score
Each learned add-on must estimate its parameters from the same 393 positives the base experts already consumed. A rank-average has zero parameters and cannot overfit the validation folds; a meta-learner, a gate, or extra embedding dimensions all have parameters to fit and insufficient fresh positive signal to fit them well. The result extends a classical lesson to the extreme-imbalance regime: when positives are scarce, model selection should bias toward zero-parameter fusion and small backbones. That choice is also the cheapest, which places it on the efficiency frontier.
Continue to Reproducibility →