Methods
The metric, the CV spine, the features, the GBDT, the image experts, and the stack
The pipeline is presented bottom-up: the metric, then the cross-validation spine, the tabular features, the GBDT expert, the image experts, and the combiner.
The metric: partial AUC above 80% TPR
Why not accuracy or plain AUC
At 0.098% prevalence, accuracy is uninformative: a constant “benign” predictor scores 99.9% and catches zero malignancies. Plain ROC-AUC is threshold-free and prevalence-robust but averages over all operating points, including low-sensitivity ones irrelevant to clinical triage, where only the high-sensitivity tail matters.
Definition
The official ISIC-2024 metric is the partial area under the ROC curve restricted to the region where the true-positive rate (sensitivity) is at least 80%:
\[ \text{pAUC}_{80} \;=\; \int_{\text{TPR}=0.80}^{1.0} \text{TPR}\; d(\text{FPR}) \]
The ROC is integrated over the band TPR ∈ [0.80, 1.0] (FPR ≤ 0.20) and reported in TPR-units × FPR-units. Because the band is 0.20 tall, the score is bounded in [0, 0.20]:
- random classifier ≈ 0.02,
- perfect classifier = 0.20.
A move from 0.169 → 0.174 is a +2.9% relative gain; improvements are stated in relative terms throughout.
Two pinned conventions
The public ISIC metrics repository README mentions an 88% TPR threshold (which bounds the score in [0, 0.12]), but the live competition used 80%. Mixing the two makes numbers incomparable. This study pins 80% for every leaderboard-comparable claim, and the scorer is hard-coded to it.
src/cv.py:pauc_above_tpr is the only function permitted to compute the metric. It is proven numerically identical to the vendored official scorer (src/metric_official.py, © 2024 N. R. Kurtansky, MSKCC) by a unit test (tests/test_cv.py, part of the 9/9 passing suite).
The CV spine
Leaky cross-validation moved teams ~200 places on the private split in ISIC-2024. Validation is the foundation of every reported number.
- Patient-grouped, target-stratified folds.
StratifiedGroupKFold(5 folds, SEED = 42) with the patient as the group and the malignant label as the stratification target.- Grouped → no patient straddles folds; a patient’s lesions are correlated, so a single straddling patient leaks.
- Stratified → each fold holds 77–83 of the 393 positives, so no fold is starved of signal.
- Frozen once. The split is written to
data/folds.parquetand read unchanged by every expert and by the stack. There is exactly one split in the project. - No-leak guarantee and veto. The
cv-guardianagent holds veto power over any change to the split or the metric; a change that would leak patients across folds, recompute the metric independently, or compare models on mismatched splits is blocked.
Out-of-fold (OOF) predictions. For each fold, the model trains on the other four and predicts on the held-out fold; concatenating the five held-out predictions gives one prediction per row, none of which saw its own row in training. pAUC is scored once on this full OOF vector. Because pAUC is non-additive across folds, the full-OOF number (0.17376) differs slightly from the mean of per-fold pAUCs (0.17487); both are reported.
Tabular feature engineering
The predictive signal lives largely in the tabular metadata. src/features.py builds, from ISIC-2024 columns only (no external data, each feature leak-checked):
- Geometry & size ratios — normalized lesion size, perimeter, area, eccentricity, axis ratios, border composites; ratios make raw size scale-robust.
- Color / hue / luminance contrasts —
tbp_lv_H(hue angle), lesion-vs-skin contrast, color uniformity, ΔA/ΔB color deltas. Hue alone reaches univariate AUC 0.81. - Border & shape composites — irregularity and asymmetry combinations.
- 3D position — the lesion’s location on the body surface (TBP provides true 3D coordinates).
- Patient-relative “ugly-duckling” deviations — for each lesion and each base feature, how anomalous it is relative to the same patient’s other lesions:
pdev_*— signed/standardized distance from the patient’s own mean,prank_*— within-patient percentile rank,pxc_*— patient × category interactions (e.g. hue × body location).
- Per-fold smoothed target encoding — for high-cardinality categoricals (e.g. location), computed inside each fold’s training data only and applied to the held-out fold, smoothed toward the global mean.
How the tabular metadata predicts the class
The metadata are per-lesion numeric measurements derived by the TBP system: lesion size in mm, L*/A*/B* and hue color coordinates, lesion-to-skin color contrast, border irregularity, eccentricity, and 3D body position. Malignancy shifts the marginal distributions of these measurements — malignant lesions are roughly twice as large, higher-contrast, more border-irregular, and concentrated in particular hue ranges and body sites. The engineered patient-relative deviations (pdev_, prank_, pxc_) re-express each measurement as a deviation from the same patient’s own lesions, capturing the ugly-duckling sign: the lesion that stands out from a patient’s normal moles. A gradient-boosted tree ensemble fits these signals directly: each split thresholds a measurement, and successive trees compose non-linear threshold interactions (e.g. large size and high hue deviation and head/neck location), producing a malignancy score without any image pixels.
Every patient-relative statistic and every target encoding is computed fold-locally (training rows only) and applied to held-out rows; no held-out information flows into a feature. The cv-guardian audits this. Patient-relative deviations are safe because they use the patient’s own benign moles, which sit in the same fold by construction.
The GBDT expert
src/gbdt.py trains the tabular expert.
The underfit bug and the fix
The initial GBDT scored 0.09941, barely 5× random. The cause was an imbalance foot-gun: is_unbalance=True up-weighted the 393 positives ~1000×, and LightGBM early-stopping was driven by built-in AUC, which under that weighting peaked at best_iter = 1 — the model effectively never trained. The fix is twofold:
- Drop
is_unbalance— undersampling (below) handles the imbalance instead of extreme instance weights. - Early-stop on the official pAUC — a custom
fevalinsrc/gbdt.pyevaluates pAUC@80%TPR directly, so early stopping optimizes the scored metric.
This change moved the model from 0.09941 → 0.11826 (+0.019); see Ablations → Tabular.
Production recipe
The final tabular expert (greysky-lineage hyperparameters; see Credits) is a bagged ensemble:
- Manual undersampling to ~1% (
neg_ratio = 0.01→ ~100 negatives per positive, ~4k rows per fold/seed) so each booster trains quickly and is not dominated by benigns. - 5-seed bagging (seeds 12/22/32/42/52) of LightGBM, averaged on rank.
- CatBoost rank-blend — a second GBDT family bagged identically; the final tabular OOF is
0.8·rank(LightGBM-bag) + 0.2·rank(CatBoost-bag). CatBoost is the weaker family here, so an equal 0.5 blend dilutes (0.16809); 0.2 is the swept optimum (0.16890).
Result: OOF pAUC 0.16890 at near-zero inference cost (0.86 M tree “params”, ~0 GFLOPs, 0.02 ms/img) — the efficient anchor of the frontier.
The image experts
src/vision/* trains the image expert. Each backbone is one point on the efficiency frontier, measured by the efficiency-auditor.
The 12-backbone frontier sweep
The sweep spans the cost axis from ~2.5 M to ~294 M params: mnv4_small, effvit_b0, starnet_s1, fastvit_t8, effnetv2_b0, ghostnetv3, vit_tiny, vit_small, convnextv2_nano (128 px and 224 px), convnextv2_tiny, swinv2_tiny, eva02_small, and the heavy anchor mnv5_300m. Heavy transformers collapse to near-random at 393 positives (SwinV2@256 → 0.104, EVA-02@336 → 0.100); the small convnextv2_nano is the optimum (see Results, Ablations).
The stack (combiner)
src/stack.py fuses the experts. Three combiners were compared on the same OOF vectors:
| Combiner | Mechanism | OOF pAUC |
|---|---|---|
| Rank-average | average the percentile ranks of GBDT and image OOF — param-free | 0.17376 |
| Meta-LightGBM | a second-level GBDT over the expert OOFs | 0.17108 |
| Learned per-lesion gate (MoE) | a network selects expert weights per lesion | 0.15007 |
Adding the image embedding (PCA components) and an image-space ugly-duckling feature into the GBDT both reduced the score (< 0.1738).
A rank-average has zero parameters to fit and cannot overfit the validation folds. Every learned combiner must estimate its parameters from the same 393 positives the experts already used; there is insufficient positive signal to learn a better fusion than averaging ranks. The learned gate falls below the tabular expert alone (0.150 < 0.169). At this scale, the simpler combiner is both cheaper and more accurate. Full numbers in Ablations.
Efficiency as a first-class axis
Every reported model logs parameters, FLOPs, and single-thread CPU latency (src/efficiency.py) alongside its pAUC, placing each model as one point on a quality-vs-cost plane. The Pareto-optimal set (points no other model beats on both axes) is the headline figure. A model that does not earn its cost is not retained. See Results.
Continue to Results →