Ablations & Negative Results

Every complexity increase tested, including those that reduced the score

ImportantSummary

Negative results are logged; no model that failed to help is dropped silently. Across the study, nearly every increase in model complexity reduced the score. At 393 positives, the efficient, trivial choices (a small backbone, a rank-average combiner, intrinsic tabular features) are both cheaper and more accurate. That negative result is the contribution.

Tabular GBDT progression

Step Change OOF pAUC Δ
0 Broken: is_unbalance + AUC early-stop → best_iter = 1 0.09941
1 Fix early-stopping / objective (drop is_unbalance, early-stop on pAUC) 0.11826 +0.01885
2 Wide patient-relative ugly-duckling features (pdev_ / prank_ / pxc_) 0.14420 +0.02594
3 Greysky-bagged LightGBM + pxc_ + CatBoost rank-blend 0.16890 +0.02470

The broken config scored barely above 5× random: on extreme imbalance the training setup dominates the model. The patient-relative feature block (Step 2) is the single largest jump (+0.026), confirming the ugly-duckling thesis. Bagging plus CatBoost (Step 3) adds +0.025 from variance reduction across seeds and families.

Image expert — resolution & regularizer

Config OOF pAUC Verdict
convnextv2_nano @128 0.15311 baseline image point
convnextv2_nano @224 + EMA0.995 + mixup 0.15821 best image
↳ same, EMA 0.999 ~0.146 reduced (over-smooths at ~23 steps/epoch)
convnextv2_tiny @224 0.15824 comparable to nano, ~2× the cost
swinv2_tiny @256 0.10422 collapse — near random
eva02_small @336 0.09984 collapse — near random

Resolution plus light EMA(0.995) and mixup on the small nano backbone is the optimum. Stronger EMA (0.999) over-smooths because the undersampled epochs are short. Heavy transformers (Swin, EVA-02) collapse to near-random — overfitting at 393 positives. convnextv2_tiny (2× nano’s params) does not exceed nano.

Stack combiners

Combiner OOF pAUC Verdict
rank-avg [GBDT, nano@224] 0.17376 winner
meta-LGBM stacker 0.17108 loses to rank-average
rank [+ weak images] 0.16679 dilutes
learned per-lesion gate (MoE) 0.15007 loss — below tabular 0.16890
+ PCA image embeddings into GBDT < 0.17376 reduces

The trivial 2-way rank-average wins. Every learned or heavier combiner loses, and the MoE gate falls below the tabular expert alone (0.150 < 0.169). Adding weak images, or PCA embeddings into the GBDT, dilutes.

Reconstruction note. A recomputed rank-average over {GBDT + 4 image OOFs} scores 0.16917; the logged “rank [+ weak images]” run (0.16679) used a different weak-image set and is kept as the authoritative logged number. The conclusion holds either way: adding weak images dilutes.

Negative results summary

Idea tried Its pAUC Beaten by Finding
Learned per-lesion gate (MoE) 0.15007 0.16890 (tabular) loses to tabular alone
Meta-LGBM stacker 0.17108 0.17376 (rank-avg) loses to rank-average
+ PCA image embeddings into GBDT < 0.1738 0.17376 embeddings reduce
Heavy backbones (SwinV2 / EVA-02) 0.104 / 0.100 0.15821 (nano@224) dominated — overfit at 393 pos
EMA 0.999 (over-smoothing) ~0.146 0.15821 (EMA0.995) stronger EMA reduces

Why complexity reduces the score

Each learned add-on must estimate its parameters from the same 393 positives the base experts already consumed. A rank-average has zero parameters and cannot overfit the validation folds; a meta-learner, a gate, or extra embedding dimensions all have parameters to fit and insufficient fresh positive signal to fit them well. The result extends a classical lesson to the extreme-imbalance regime: when positives are scarce, model selection should bias toward zero-parameter fusion and small backbones. That choice is also the cheapest, which places it on the efficiency frontier.


Continue to Reproducibility →