Methods

The metric, the CV spine, the features, the GBDT, the image experts, and the stack

The pipeline is presented bottom-up: the metric, then the cross-validation spine, the tabular features, the GBDT expert, the image experts, and the combiner.

The metric: partial AUC above 80% TPR

Why not accuracy or plain AUC

At 0.098% prevalence, accuracy is uninformative: a constant “benign” predictor scores 99.9% and catches zero malignancies. Plain ROC-AUC is threshold-free and prevalence-robust but averages over all operating points, including low-sensitivity ones irrelevant to clinical triage, where only the high-sensitivity tail matters.

Definition

The official ISIC-2024 metric is the partial area under the ROC curve restricted to the region where the true-positive rate (sensitivity) is at least 80%:

\[ \text{pAUC}_{80} \;=\; \int_{\text{TPR}=0.80}^{1.0} \text{TPR}\; d(\text{FPR}) \]

The ROC is integrated over the band TPR ∈ [0.80, 1.0] (FPR ≤ 0.20) and reported in TPR-units × FPR-units. Because the band is 0.20 tall, the score is bounded in [0, 0.20]:

random classifier ≈ 0.02,
perfect classifier = 0.20.

A move from 0.169 → 0.174 is a +2.9% relative gain; improvements are stated in relative terms throughout.

Two pinned conventions

80% vs 88% TPR

The public ISIC metrics repository README mentions an 88% TPR threshold (which bounds the score in [0, 0.12]), but the live competition used 80%. Mixing the two makes numbers incomparable. This study pins 80% for every leaderboard-comparable claim, and the scorer is hard-coded to it.

src/cv.py:pauc_above_tpr is the only function permitted to compute the metric. It is proven numerically identical to the vendored official scorer (src/metric_official.py, © 2024 N. R. Kurtansky, MSKCC) by a unit test (tests/test_cv.py, part of the 9/9 passing suite).

The CV spine

Leaky cross-validation moved teams ~200 places on the private split in ISIC-2024. Validation is the foundation of every reported number.

Patient-grouped, target-stratified folds. StratifiedGroupKFold (5 folds, SEED = 42) with the patient as the group and the malignant label as the stratification target.
- Grouped → no patient straddles folds; a patient’s lesions are correlated, so a single straddling patient leaks.
- Stratified → each fold holds 77–83 of the 393 positives, so no fold is starved of signal.
Frozen once. The split is written to data/folds.parquet and read unchanged by every expert and by the stack. There is exactly one split in the project.
No-leak guarantee and veto. The cv-guardian agent holds veto power over any change to the split or the metric; a change that would leak patients across folds, recompute the metric independently, or compare models on mismatched splits is blocked.

Out-of-fold (OOF) predictions. For each fold, the model trains on the other four and predicts on the held-out fold; concatenating the five held-out predictions gives one prediction per row, none of which saw its own row in training. pAUC is scored once on this full OOF vector. Because pAUC is non-additive across folds, the full-OOF number (0.17376) differs slightly from the mean of per-fold pAUCs (0.17487); both are reported.

Tabular feature engineering

The predictive signal lives largely in the tabular metadata. src/features.py builds, from ISIC-2024 columns only (no external data, each feature leak-checked):

Geometry & size ratios — normalized lesion size, perimeter, area, eccentricity, axis ratios, border composites; ratios make raw size scale-robust.
Color / hue / luminance contrasts — tbp_lv_H (hue angle), lesion-vs-skin contrast, color uniformity, ΔA/ΔB color deltas. Hue alone reaches univariate AUC 0.81.
Border & shape composites — irregularity and asymmetry combinations.
3D position — the lesion’s location on the body surface (TBP provides true 3D coordinates).
Patient-relative “ugly-duckling” deviations — for each lesion and each base feature, how anomalous it is relative to the same patient’s other lesions:
- pdev_* — signed/standardized distance from the patient’s own mean,
- prank_* — within-patient percentile rank,
- pxc_* — patient × category interactions (e.g. hue × body location).
These encode the clinical ugly-duckling sign directly and account for ~65% of total GBDT gain (see Results and Ablations).
Per-fold smoothed target encoding — for high-cardinality categoricals (e.g. location), computed inside each fold’s training data only and applied to the held-out fold, smoothed toward the global mean.

How the tabular metadata predicts the class

The metadata are per-lesion numeric measurements derived by the TBP system: lesion size in mm, L*/A*/B* and hue color coordinates, lesion-to-skin color contrast, border irregularity, eccentricity, and 3D body position. Malignancy shifts the marginal distributions of these measurements — malignant lesions are roughly twice as large, higher-contrast, more border-irregular, and concentrated in particular hue ranges and body sites. The engineered patient-relative deviations (pdev_, prank_, pxc_) re-express each measurement as a deviation from the same patient’s own lesions, capturing the ugly-duckling sign: the lesion that stands out from a patient’s normal moles. A gradient-boosted tree ensemble fits these signals directly: each split thresholds a measurement, and successive trees compose non-linear threshold interactions (e.g. large size and high hue deviation and head/neck location), producing a malignancy score without any image pixels.

Leak-safety of features

Every patient-relative statistic and every target encoding is computed fold-locally (training rows only) and applied to held-out rows; no held-out information flows into a feature. The cv-guardian audits this. Patient-relative deviations are safe because they use the patient’s own benign moles, which sit in the same fold by construction.

The GBDT expert

src/gbdt.py trains the tabular expert.

The underfit bug and the fix

The initial GBDT scored 0.09941, barely 5× random. The cause was an imbalance foot-gun: is_unbalance=True up-weighted the 393 positives ~1000×, and LightGBM early-stopping was driven by built-in AUC, which under that weighting peaked at best_iter = 1 — the model effectively never trained. The fix is twofold:

Drop is_unbalance — undersampling (below) handles the imbalance instead of extreme instance weights.
Early-stop on the official pAUC — a custom feval in src/gbdt.py evaluates pAUC@80%TPR directly, so early stopping optimizes the scored metric.

This change moved the model from 0.09941 → 0.11826 (+0.019); see Ablations → Tabular.

Production recipe

The final tabular expert (greysky-lineage hyperparameters; see Credits) is a bagged ensemble:

Manual undersampling to ~1% (neg_ratio = 0.01 → ~100 negatives per positive, ~4k rows per fold/seed) so each booster trains quickly and is not dominated by benigns.
5-seed bagging (seeds 12/22/32/42/52) of LightGBM, averaged on rank.
CatBoost rank-blend — a second GBDT family bagged identically; the final tabular OOF is 0.8·rank(LightGBM-bag) + 0.2·rank(CatBoost-bag). CatBoost is the weaker family here, so an equal 0.5 blend dilutes (0.16809); 0.2 is the swept optimum (0.16890).

Result: OOF pAUC 0.16890 at near-zero inference cost (0.86 M tree “params”, ~0 GFLOPs, 0.02 ms/img) — the efficient anchor of the frontier.

The image experts

src/vision/* trains the image expert. Each backbone is one point on the efficiency frontier, measured by the efficiency-auditor.

Recipe (shared across backbones)

Small natural-image-pretrained backbones at 128 px (and 224 px for the resolution axis). Pretraining is ImageNet / FCMAE only — no skin-cancer labels, so the no-external-data claim holds (pretrained weights are allowed; external training data is not).
Optimiser / schedule. AdamW, learning rate 1e-4, weight decay 1e-3; cosine schedule with 5% warmup and a 1% floor; 30 epochs; batch size 128.
Per-epoch negative undersampling to ~1:1 (neg_ratio=7, pos_mult=2). Each epoch sees all positives plus a fresh random ~equal sample of negatives, so an epoch is a few thousand images (~4.5 s); a full 5-fold backbone trains in ~8 minutes. Re-sampling negatives each epoch exposes the whole benign distribution over time.
Loss: BCE with label smoothing 0.05.
Augmentation (transV2, classical only — no generative augmentation). Albumentations: random transpose and vertical/horizontal flips; random brightness/contrast; one of {motion, median, Gaussian blur, Gaussian noise}; one of {optical, grid, elastic distortion}; CLAHE; hue/saturation/value jitter; shift–scale–rotate (±15°); and a single CoarseDropout hole (~0.375× the image side); then resize and ImageNet normalisation. A gentler light variant (flips + mild brightness/contrast + shift–scale–rotate, no heavy distortion/blur/dropout) is used for ViT backbones; evaluation applies resize + normalisation only (no augmentation).
Test-time augmentation. Optional flip/rot90 averaging (n_tta); the reported runs use n_tta=1.
EMA(0.995) + mixup(α=0.2). Weight EMA stabilizes the noisy undersampled updates; mixup regularizes. EMA decay is consequential: at ~23 steps/epoch, decay 0.999 (time-constant ~1000 steps) lags the weights and reduces the score (fold-0: 0.146 vs 0.156); 0.995 with warmup is retained (0.163).
Resolution as a frontier axis. Both convnextv2_nano@128 (0.15311) and @224 (0.15821) are kept as distinct points, trading accuracy for cost.

The 12-backbone frontier sweep

The sweep spans the cost axis from ~2.5 M to ~294 M params: mnv4_small, effvit_b0, starnet_s1, fastvit_t8, effnetv2_b0, ghostnetv3, vit_tiny, vit_small, convnextv2_nano (128 px and 224 px), convnextv2_tiny, swinv2_tiny, eva02_small, and the heavy anchor mnv5_300m. Heavy transformers collapse to near-random at 393 positives (SwinV2@256 → 0.104, EVA-02@336 → 0.100); the small convnextv2_nano is the optimum (see Results, Ablations).

The stack (combiner)

src/stack.py fuses the experts. Three combiners were compared on the same OOF vectors:

Combiner	Mechanism	OOF pAUC
Rank-average	average the percentile ranks of GBDT and image OOF — param-free	0.17376
Meta-LightGBM	a second-level GBDT over the expert OOFs	0.17108
Learned per-lesion gate (MoE)	a network selects expert weights per lesion	0.15007

Adding the image embedding (PCA components) and an image-space ugly-duckling feature into the GBDT both reduced the score (< 0.1738).

Why the trivial combiner wins at 393 positives

A rank-average has zero parameters to fit and cannot overfit the validation folds. Every learned combiner must estimate its parameters from the same 393 positives the experts already used; there is insufficient positive signal to learn a better fusion than averaging ranks. The learned gate falls below the tabular expert alone (0.150 < 0.169). At this scale, the simpler combiner is both cheaper and more accurate. Full numbers in Ablations.

Efficiency as a first-class axis

Every reported model logs parameters, FLOPs, and single-thread CPU latency (src/efficiency.py) alongside its pAUC, placing each model as one point on a quality-vs-cost plane. The Pareto-optimal set (points no other model beats on both axes) is the headline figure. A model that does not earn its cost is not retained. See Results.

Continue to Results →