ISIC-2024 SLICE-3D

An efficiency-frontier, single-dataset, no-external-data study of skin-lesion classification

Authors

Raja, Muhammad Junaid Ali Asif

Sultan, Adil

Hassan, Shahzaib Ahmed

Published

June 18, 2026

Pakistan.AI · Neural Networks · National Yunlin University of Science and Technology

ISIC-2024 SLICE-3D — quality–efficiency frontier

Automated triage of skin-lesion crops from 3D total-body photography: classifying 393 malignant lesions among 400,666 benign crops (a 1,021:1 imbalance) using only the ISIC-2024 SLICE-3D dataset, with no external dermoscopy and no synthetic positives.

Claim under test: state-of-the-art among single-dataset, no-external-data, no-synthetic solutions, reported as a quality-vs-cost Pareto frontier rather than a single number.

0.1738 Stack OOF pAUC@80%TPR

0.1689 Tabular GBDT OOF pAUC

0.098% Malignant prevalence

401k Lesion crops, 1 dataset

12 Backbones on the frontier

NoteContents

This site is the companion report. Each page is self-contained and quantitative.

  • Data & EDA — dataset, imbalance, exploratory figures, dropped leak columns.
  • Methods — the metric, the CV spine, tabular features, the GBDT, the image experts, the stack.
  • Results — the three-axis Pareto frontier, the scored ROC region, per-fold stability, calibration.
  • Ablations — each complexity increase tested, including those that reduced the score.
  • Reproduce — environment, seeds, commands, and the re-verification of each number.
  • Credits — attribution to the leaderboard solutions and the public-notebook lineage.
  • References — citations for the dataset, metric, backbones, and GBDTs.

Presentation: slide deck (PDF)

Abstract

The ISIC-2024 SLICE-3D task requires classifying ~401,059 lesion crops, extracted from 3D total-body photography (TBP), as malignant or benign. The constraint regime is deliberately strict. The competition’s top private-leaderboard solutions reached pAUC ≈ 0.173 by importing external dermoscopy archives and approximately 30,000 diffusion-synthesized malignant lesions; both are banned here. Training or validating on fabricated pathology is a label-validity hole, and external data breaks the single-dataset premise. Removing both is what this study isolates.

Because the absolute leaderboard number is then out of reach by construction, the problem is re-posed as a Pareto frontier: the best official pAUC above 80% TPR achievable per unit of inference cost (parameters, FLOPs, single-thread CPU latency) on SLICE-3D alone. The architecture is a hybrid. A LightGBM tabular expert over intrinsic engineered features (geometry, color/hue contrast, size ratios, patient-relative “ugly-duckling” deviations) provides the efficient anchor. A small ImageNet-pretrained image backbone contributes an out-of-fold (OOF) malignancy probability, fused into the final score by a trivial rank-average combiner.

Every tested increase in model complexity reduced the score. The trivial two-way rank-average exceeds a meta-LightGBM stacker, a learned per-lesion gate, and the addition of image embeddings; the small convnextv2_nano backbone exceeds heavier transformers (SwinV2, EVA-02), which collapse to near-random at 393 positives. The best frontier point, rank-avg[GBDT, convnextv2_nano@224], reaches OOF pAUC 0.17376 at 15.8 M params / 2.46 GFLOPs / 61 ms CPU per image.

Headline results

All numbers are out-of-fold (OOF) on the frozen patient-grouped folds (data/folds.parquet, 5 folds, SEED = 42). Each pAUC is computed only by the official scorer (src/cv.py), pinned at 80% TPR (range [0, 0.20]; random ≈ 0.02, perfect = 0.20). All values were independently re-derived from the OOF parquet files by the cv-guardian agent.

Model OOF pAUC@80%TPR Params (M) GFLOPs CPU ms/img Role
GBDT (tabular only) 0.16890 0.86 ~0 0.02 efficient anchor (near-free)
Best image (convnextv2_nano@224) 0.15821 14.98 2.46 60.9 primary backbone
Stack — rank-avg[GBDT, nano@224] 0.17376 15.84 2.46 60.9 best overall; Pareto-optimal
Cheap stack (gbdt+3img+udk) 0.17117 23.84 1.22 32.9 frontier point (cheaper)
Cheap frontier (effvit_b0) 0.13706 2.13 0.034 3.5 most accuracy per millisecond
Cost floor (mnv4_small) 0.11242 2.49 0.062 3.1 latency floor

The metric. pAUC@80%TPR integrates the ROC curve only in the high-sensitivity tail (TPR ≥ 0.80). It is bounded in [0, 0.20], so 0.169 → 0.174 is a +2.9% relative improvement. See Methods → The metric.

Source of the score

The dominant signal is not the raw lesion but how anomalous a lesion is relative to the same patient’s other moles — the ugly-duckling sign. Patient-relative features (pdev_*, prank_*, pxc_*) account for ~65% of total GBDT gain.

Rank Feature Mean gain % Definition
1 tbp_lv_H 3.28 lesion hue angle (univariate AUC 0.81)
2 pdev_tbp_lv_H 3.05 hue deviation from the patient’s mean
3 pdev_clin_size_long_diam_mm 2.19 size deviation vs the patient’s own lesions
4 pxc_tbp_lv_H_location 1.88 hue × body-location patient interaction

Pipeline

flowchart TD
    A["ISIC-2024 SLICE-3D<br/>401,059 crops · 1,042 patients<br/>393 malignant / 400,666 benign"] --> B

    B["Patient-grouped<br/>StratifiedGroupKFold<br/>(5 folds · SEED=42 · frozen)<br/><b>no patient straddles folds</b>"]

    B --> C["Tabular feature engineering<br/>geometry · color/hue · size ratios<br/>3D position · patient-relative<br/>ugly-duckling (pdev / prank / pxc)<br/>per-fold smoothed target encoding"]
    B --> E["Image crops @128 / @224<br/>small ImageNet-pretrained backbone<br/>per-epoch neg-undersample ~1:1<br/>BCE+LS · EMA(0.995) · mixup · TTA"]

    C --> D["Bagged GBDT expert<br/>LightGBM (5-seed) + CatBoost<br/>rank-blend<br/><b>OOF pAUC 0.16890</b>"]
    E --> F["Image expert<br/>OOF malignancy prob<br/>(+ embedding, ablated)<br/><b>OOF pAUC 0.15821</b>"]

    D --> G["Combiner = rank-average<br/>(param-free)"]
    F --> G

    G --> H["Final OOF score<br/><b>pAUC 0.17376</b><br/>15.8M params · 2.46 GFLOPs · 61 ms CPU"]
    H --> I["Submission / frontier point"]

    style A fill:#0b6e4f,color:#fff
    style B fill:#117a65,color:#fff
    style H fill:#1e8449,color:#fff
    style I fill:#073b2c,color:#fff
Figure 1: End-to-end pipeline. Patient-grouped folds are frozen once; every expert and the stack read the same split. No model trains until the no-leak guarantee passes.

Scope

ImportantWhat is and is not claimed
  • This study does not claim to beat the unconstrained leaderboard winner’s private pAUC ≈ 0.1755. That score used external data plus ~30k synthetic positives; matching it under these rules is impossible by construction, which is the purpose of the experiment.
  • The claim is SOTA within the no-external-data / no-synthetic / single-dataset class, on a transparent quality-vs-cost frontier with leak-audited CV.
  • The 1st-place team’s own ablation reports their ~30k synthetic lesions added only +0.0007 pAUC, a quantitative bound on the value of synthetic augmentation for a 393-positive task.
  • The OOF number is a conservative generalization estimate, because no part of the pipeline saw out-of-distribution data. The expected public→private drop on this task is ~0.013–0.021; the OOF CV is reported with that projection stated.
NoteFigures

Every figure on this site is generated by reports/eda/make_eda.py, reports/perf/make_perf.py, and reports/frontier.py, plus Mermaid diagrams. No external copyrighted images are used, consistent with the dataset’s CC BY-NC 4.0 terms.


Continue to Data & EDA →