Results

The three-axis Pareto frontier, the scored ROC region, per-fold stability, and calibration

All numbers are out-of-fold on the frozen patient-grouped folds (5 folds, SEED = 42), scored only by src/cv.py at 80% TPR, and independently re-derived from the OOF parquet files by the cv-guardian agent. Figures come from reports/frontier.py and reports/perf/make_perf.py.

Headline numbers

Model	OOF pAUC@80%TPR
Tabular GBDT	0.16890
Best image (ConvNeXtV2-nano @224)	0.15821
Stack — rank-avg[GBDT, nano@224]	0.17376

The stack reconstruction rank-avg[gbdt, r224] recomputes to 0.17376, matching the canonical stack_oof.parquet exactly; the combiner is a 2-way rank average. Image and tabular make complementary errors, so the rank-average yields +0.0049 over tabular alone.

Tabular progression

Table 1. Tabular LightGBM progression (out-of-fold, leak-verified).
Configuration	OOF pAUC@80%TPR
Step 0 — broken: is_unbalance + AUC early-stop	0.09941
Step 1 — fixed early-stopping / objective	0.11826
Step 2 — wide patient-relative ugly-duckling features	0.14420
Step 3 — greysky-bagged + cross feats + CatBoost ensemble	0.16890

The tabular expert progresses from a broken baseline (0.09941) to the production LightGBM+CatBoost ensemble (0.16890); the patient-relative feature block is the single largest step (see Ablations → Tabular).

The quality-vs-cost frontier (three axes)

A point is Pareto-optimal if no other model is both cheaper and more accurate. The frontier is drawn against all three cost axes, since cost means different things in different deployments.

Combined view

Figure 1: pAUC@80%TPR vs cost — the headline frontier.

Versus parameters

Figure 2: pAUC vs parameter count (model size / memory).

Versus FLOPs

Figure 3: pAUC vs GFLOPs (compute per image).

Versus CPU latency

Figure 4: pAUC vs single-thread CPU latency (deployability).

All experiments and the Pareto-optimal set

Table 2. Quality–efficiency frontier across all experiments (out-of-fold pAUC; single-thread CPU cost).
Model	OOF pAUC@80%TPR	Params (M)	GFLOPs	CPU latency (ms)	Pareto-optimal
Stack (rank-avg: GBDT + nano@224)	0.17376	15.84	2.4550	60.91	Yes
Stack (GBDT + 3 images + ugly-duckling)	0.17117	23.84	1.2200	32.86	Yes
Tabular GBDT (LightGBM + CatBoost)	0.16890	0.86	0.0000	0.02	Yes
ConvNeXtV2-tiny @224	0.15824	27.87	4.4697	104.14	—
ConvNeXtV2-nano @224	0.15821	14.98	2.4550	60.89	—
ConvNeXtV2-nano @128	0.15311	14.98	0.8016	21.61	—
EfficientViT-b0 @128	0.13706	2.13	0.0345	3.45	—
ViT-tiny @128	0.13654	5.50	0.3560	8.17	—
MobileNetV4-small @128	0.11242	2.49	0.0624	3.06	—
SwinV2-tiny @256	0.10422	27.58	5.9615	138.07	—
EVA-02-small @336	0.09984	21.74	12.4085	249.18	—

CPU latency is single-thread, one image; GBDT GFLOPs reported as 0.0 (tree ensemble, not FLOP-comparable). Pareto-optimal on the pAUC-vs-latency axis.

Reading the frontier

The GBDT alone is near-free and already at 0.169; it dominates every image-only model on cost and most on accuracy. This motivates the GBDT-first architecture.
The image expert earns its place only inside the stack: +0.0049 pAUC for +60 ms. Whether that trade is worthwhile is a deployment decision, which is why a frontier is reported, not a single number.
The heavy backbones (mnv5_300m at 294 M / 765 ms, eva02_small, swinv2_tiny) are far off the frontier, dominated on both axes. Added capacity reduces the score at 393 positives.

The scored ROC region

Figure 5: OOF ROC with the pAUC@80%TPR band shaded.

Figure 6: Zoom on the high-sensitivity tail (TPR ≥ 0.80) that the metric integrates.

The shaded band is the region the metric integrates — the area under the ROC for TPR ≥ 0.80 (equivalently FPR ≤ 0.20). In that band the stack sits above both single experts: the +0.0049 lift is concentrated where it is scored, not in low-sensitivity operating points.

Per-fold stability

Figure 7: Per-fold pAUC for tabular, best image, and stack.

Fold	Tabular	Best image	Stack
0	0.18014	0.16087	0.18061
1	0.15424	0.16947	0.17246
2	0.17614	0.16499	0.18003
3	0.17871	0.15733	0.17898
4	0.15708	0.14277	0.16139
mean ± std	0.1693 ± 0.0112	0.1591 ± 0.0091	0.1747 ± 0.0073

Fold variance is the dominant uncertainty

At 393 positives, per-fold pAUC swings by ~0.025 (tabular ranges 0.154 → 0.180), so fold-to-fold variance exceeds the gap between models; a single-fold number would be untrustworthy. The stack has the lowest std (0.0073) and wins or ties on every fold — the strongest evidence the +0.0049 is real, and it is both the most accurate and the most stable estimator.

Feature importance

Figure 8: Top-20 GBDT features by mean gain (25 LightGBM + 25 CatBoost boosters).

Aggregated mean gain over 25 LightGBM + 25 CatBoost boosters (each booster normalized to sum = 1 for cross-library comparability, then averaged):

Rank	Feature	Gain %
1	`tbp_lv_H` (hue)	3.28
2	`pdev_tbp_lv_H`	3.05
3	`pdev_clin_size_long_diam_mm`	2.19
4	`pxc_tbp_lv_H_tbp_lv_location`	1.88
5	`clin_size_long_diam_mm`	1.68
6	`prank_tbp_lv_H`	1.48
7	`z_clin_size_long_diam_mm`	1.46
8	`pdev_tbp_lv_Hext`	1.25
9	`pxc_tbp_lv_H_anatom_site_general`	1.09
10	`f_normalized_lesion_size`	1.07

Patient-relative ugly-duckling features (pdev_ / prank_ / pxc_) make up 56% of top-20 gain and ~65% of total gain. A lesion’s deviation from its own patient’s distribution is the dominant signal, confirming the tabular architecture.

Score separation & calibration

Figure 9: Left: stack score histogram by class. Right: reliability curve.

Malignant score-mass concentrates near 1.0 and benign mass near 0.0, with the mid-range overlap penalized by the partial-AUC region. The reliability curve shows the model separates classes well by rank but is far from probability-calibrated (observed malignant fraction stays ≈0 even in the top bin). Under 0.1% prevalence and a rank-based metric, calibration is irrelevant to pAUC.

The CV → private projection

What to expect on a hidden test set

The field saw a ~0.013–0.021 public→private drop on this task even with clean CV, driven by the 393-positive scale and (for the winners) reliance on external/synthetic data. The headline is OOF CV 0.1738 (stack) / 0.1689 (tabular); a fair private-test projection is headline minus ~0.01–0.02. Because no part of the pipeline saw out-of-distribution data, the OOF number is a conservative predictor of generalization.

Leaderboard context

Table 3. Honest comparison to top Kaggle solutions (champion private-LB pAUC; ours is leak-audited cross-validation).
Solution	External data?	Synthetic data?	pAUC
1st — Ilya Novoselskiy (EVA-02 + EdgeNeXt + GBDT)	Yes	Yes (~30k)	0.17264
2nd — uchiyama33 (image + tabular ensemble)	Yes	Yes	—
3rd — kyohei-123 (image + tabular blend)	Yes	Yes	—
Ours (single-dataset, no external, no synthetic)	No	No	CV 0.17376

Champions used external ISIC-archive dermoscopy and synthetic positives, both banned here; their own ablation reports the ~30k synthetic lesions added only +0.0007 pAUC. Ours is an out-of-fold CV number, not private LB.

The constrained CV result (0.17376) sits in the same range as the unconstrained 1st-place private pAUC (0.17264), achieved without the external and synthetic data the champions relied on. The out-of-fold CV is not directly comparable to a private-LB number; see Submission and Credits.

Continue to Ablations →