flowchart TD
A["ISIC-2024 SLICE-3D<br/>401,059 crops · 1,042 patients<br/>393 malignant / 400,666 benign"] --> B
B["Patient-grouped<br/>StratifiedGroupKFold<br/>(5 folds · SEED=42 · frozen)<br/><b>no patient straddles folds</b>"]
B --> C["Tabular feature engineering<br/>geometry · color/hue · size ratios<br/>3D position · patient-relative<br/>ugly-duckling (pdev / prank / pxc)<br/>per-fold smoothed target encoding"]
B --> E["Image crops @128 / @224<br/>small ImageNet-pretrained backbone<br/>per-epoch neg-undersample ~1:1<br/>BCE+LS · EMA(0.995) · mixup · TTA"]
C --> D["Bagged GBDT expert<br/>LightGBM (5-seed) + CatBoost<br/>rank-blend<br/><b>OOF pAUC 0.16890</b>"]
E --> F["Image expert<br/>OOF malignancy prob<br/>(+ embedding, ablated)<br/><b>OOF pAUC 0.15821</b>"]
D --> G["Combiner = rank-average<br/>(param-free)"]
F --> G
G --> H["Final OOF score<br/><b>pAUC 0.17376</b><br/>15.8M params · 2.46 GFLOPs · 61 ms CPU"]
H --> I["Submission / frontier point"]
style A fill:#0b6e4f,color:#fff
style B fill:#117a65,color:#fff
style H fill:#1e8449,color:#fff
style I fill:#073b2c,color:#fff
ISIC-2024 SLICE-3D
An efficiency-frontier, single-dataset, no-external-data study of skin-lesion classification
Pakistan.AI · Neural Networks · National Yunlin University of Science and Technology
ISIC-2024 SLICE-3D — quality–efficiency frontier
Automated triage of skin-lesion crops from 3D total-body photography: classifying 393 malignant lesions among 400,666 benign crops (a 1,021:1 imbalance) using only the ISIC-2024 SLICE-3D dataset, with no external dermoscopy and no synthetic positives.
Claim under test: state-of-the-art among single-dataset, no-external-data, no-synthetic solutions, reported as a quality-vs-cost Pareto frontier rather than a single number.
0.1738 Stack OOF pAUC@80%TPR
0.1689 Tabular GBDT OOF pAUC
0.098% Malignant prevalence
401k Lesion crops, 1 dataset
12 Backbones on the frontier
Abstract
The ISIC-2024 SLICE-3D task requires classifying ~401,059 lesion crops, extracted from 3D total-body photography (TBP), as malignant or benign. The constraint regime is deliberately strict. The competition’s top private-leaderboard solutions reached pAUC ≈ 0.173 by importing external dermoscopy archives and approximately 30,000 diffusion-synthesized malignant lesions; both are banned here. Training or validating on fabricated pathology is a label-validity hole, and external data breaks the single-dataset premise. Removing both is what this study isolates.
Because the absolute leaderboard number is then out of reach by construction, the problem is re-posed as a Pareto frontier: the best official pAUC above 80% TPR achievable per unit of inference cost (parameters, FLOPs, single-thread CPU latency) on SLICE-3D alone. The architecture is a hybrid. A LightGBM tabular expert over intrinsic engineered features (geometry, color/hue contrast, size ratios, patient-relative “ugly-duckling” deviations) provides the efficient anchor. A small ImageNet-pretrained image backbone contributes an out-of-fold (OOF) malignancy probability, fused into the final score by a trivial rank-average combiner.
Every tested increase in model complexity reduced the score. The trivial two-way rank-average exceeds a meta-LightGBM stacker, a learned per-lesion gate, and the addition of image embeddings; the small convnextv2_nano backbone exceeds heavier transformers (SwinV2, EVA-02), which collapse to near-random at 393 positives. The best frontier point, rank-avg[GBDT, convnextv2_nano@224], reaches OOF pAUC 0.17376 at 15.8 M params / 2.46 GFLOPs / 61 ms CPU per image.
Headline results
All numbers are out-of-fold (OOF) on the frozen patient-grouped folds (data/folds.parquet, 5 folds, SEED = 42). Each pAUC is computed only by the official scorer (src/cv.py), pinned at 80% TPR (range [0, 0.20]; random ≈ 0.02, perfect = 0.20). All values were independently re-derived from the OOF parquet files by the cv-guardian agent.
| Model | OOF pAUC@80%TPR | Params (M) | GFLOPs | CPU ms/img | Role |
|---|---|---|---|---|---|
| GBDT (tabular only) | 0.16890 | 0.86 | ~0 | 0.02 | efficient anchor (near-free) |
Best image (convnextv2_nano@224) |
0.15821 | 14.98 | 2.46 | 60.9 | primary backbone |
Stack — rank-avg[GBDT, nano@224] |
0.17376 | 15.84 | 2.46 | 60.9 | best overall; Pareto-optimal |
Cheap stack (gbdt+3img+udk) |
0.17117 | 23.84 | 1.22 | 32.9 | frontier point (cheaper) |
Cheap frontier (effvit_b0) |
0.13706 | 2.13 | 0.034 | 3.5 | most accuracy per millisecond |
Cost floor (mnv4_small) |
0.11242 | 2.49 | 0.062 | 3.1 | latency floor |
The metric. pAUC@80%TPR integrates the ROC curve only in the high-sensitivity tail (TPR ≥ 0.80). It is bounded in [0, 0.20], so 0.169 → 0.174 is a +2.9% relative improvement. See Methods → The metric.
Source of the score
The dominant signal is not the raw lesion but how anomalous a lesion is relative to the same patient’s other moles — the ugly-duckling sign. Patient-relative features (pdev_*, prank_*, pxc_*) account for ~65% of total GBDT gain.
| Rank | Feature | Mean gain % | Definition |
|---|---|---|---|
| 1 | tbp_lv_H |
3.28 | lesion hue angle (univariate AUC 0.81) |
| 2 | pdev_tbp_lv_H |
3.05 | hue deviation from the patient’s mean |
| 3 | pdev_clin_size_long_diam_mm |
2.19 | size deviation vs the patient’s own lesions |
| 4 | pxc_tbp_lv_H_location |
1.88 | hue × body-location patient interaction |
Pipeline
Scope
- This study does not claim to beat the unconstrained leaderboard winner’s private pAUC ≈ 0.1755. That score used external data plus ~30k synthetic positives; matching it under these rules is impossible by construction, which is the purpose of the experiment.
- The claim is SOTA within the no-external-data / no-synthetic / single-dataset class, on a transparent quality-vs-cost frontier with leak-audited CV.
- The 1st-place team’s own ablation reports their ~30k synthetic lesions added only +0.0007 pAUC, a quantitative bound on the value of synthetic augmentation for a 393-positive task.
- The OOF number is a conservative generalization estimate, because no part of the pipeline saw out-of-distribution data. The expected public→private drop on this task is ~0.013–0.021; the OOF CV is reported with that projection stated.
Every figure on this site is generated by reports/eda/make_eda.py, reports/perf/make_perf.py, and reports/frontier.py, plus Mermaid diagrams. No external copyrighted images are used, consistent with the dataset’s CC BY-NC 4.0 terms.
Continue to Data & EDA →