Submission & Leaderboard

What a Kaggle CODE competition submits, what was produced, and where the constrained score sits

ISIC-2024 was a Kaggle CODE competition: submission is a notebook that Kaggle re-runs on a hidden test set, not a CSV of predictions. The public test shipped with the data is a 3-row placeholder, so a private score cannot be computed locally. This page states what can be produced today, what the CV implies, and the path to a real late-submission score.

How scoring works

ImportantA CODE competition scores a notebook, not a file
  • A notebook is attached to the competition. On submit, Kaggle executes it against the hidden test set (the real evaluation images) and scores the submission.csv it writes.
  • The test-image.hdf5 and test-metadata.csv shipped in the data bundle are a 3-row placeholder, sufficient only to let the notebook run end-to-end.
  • No private pAUC can be produced offline; a private score computed locally would be a number off three rows.

The placeholder held on disk, verified directly:

Artifact Local On Kaggle
data/test-image.hdf5 ≈ 11 KB, 3 images (ISIC_0015657, ISIC_0015729, ISIC_0015740) tens of thousands of hidden crops
data/test-metadata.csv 3 rows (header + 3) the full hidden cohort’s metadata
data/sample_submission.csv 3 rows, constant target = 0.3 format template only

A local run can therefore only confirm that the inference pipeline is wired correctly and emits a format-valid file.

What was produced

A format-valid submission.csv was generated from the saved bagged GBDT — 50 boosters (5 folds × 5 LightGBM seeds + 5 CatBoost seeds, persisted in experiments/gbdt_boosters.joblib) — via a single command:

python -m src.submit          # GBDT-only; reads experiments/gbdt_boosters.joblib

Inference mirrors the leak-free training transform: every booster carries its own per-fold feature state, predict_gbdt re-applies that state before predicting, and the per-fold scores are averaged. The output is the required two-column schema:

isic_id,target
ISIC_0015657,0.41827...
ISIC_0015729,0.06913...
ISIC_0015740,0.22540...
Column Type Meaning
isic_id string lesion-crop identifier, one row per test image
target float predicted malignancy score (rank-comparable; pAUC is rank-based, so calibration is irrelevant)

Scope of the validation. Running python -m src.submit on the 3-row placeholder confirms the inference path end-to-end: boosters load, the fold-local feature transform replays without leakage, and the writer emits a valid 2-column file. It does not produce a real score, which requires Kaggle to run the notebook on the hidden test (a late submission, since the competition is closed).

Estimated private position (from CV)

Because the hidden set cannot be scored offline, the projection is from the out-of-fold CV, which — having seen no out-of-distribution data — is a conservative generalization estimate. The field’s observed CV → private drop on this task was roughly 0.013–0.021.

TipCV → estimated private
Quantity Value Source
Stack OOF pAUC@80%TPR 0.17376 stack_oof.parquet, re-derived by cv-guardian
Tabular GBDT OOF pAUC 0.16890 gbdt_oof.parquet
Observed CV → private drop (field) 0.013 – 0.021 post-competition write-ups
Estimated private (stack) ≈ 0.155 – 0.165 0.1738 − drop
Estimated private (tabular-only, ready today) ≈ 0.148 – 0.156 0.1689 − drop

A range is reported, not a rank: fold variance (~±0.008 std) and the unknown shake-up make any single claimed position unreliable.

Leaderboard context (~2,700 teams)

NoteThe unconstrained ceiling

1st place — private pAUC 0.17264. Used external ISIC-archive dermoscopy data and ~30,000 Stable-Diffusion-synthesized malignant lesions, both banned here. Their own ablation: the 30k synthetic lesions added only +0.0007 pAUC.

NoteThe constrained ceiling

Best no-external-data solutions sat around 0.16 private; a clean tabular-only team reached 0.162. This is the bracket the present work competes in, and the estimated private ≈ 0.155–0.165 lands inside it.

Reference point Private pAUC External data? Synthetic positives?
1st place (unconstrained) 0.17264 yes yes (≈ 30k)
Best no-external solutions ≈ 0.16 no varies
Clean tabular-only team 0.162 no no
Ours — estimated private (stack) ≈ 0.155 – 0.165 no no
Random baseline ≈ 0.02
ImportantThe claim, restated

The estimated private ≈ 0.155–0.165 is competitive in the silver/gold zone and SOTA-class among no-external / no-synthetic / single-dataset solutions, the class the hard rules define. The unconstrained 0.17264 used the two banned ingredients and is impossible to match by construction. See Scope.

Stack vs. tabular submission

The tabular GBDT submission is ready today; the full stack submission is not.

WarningWhat the stack requires that is not yet persisted

A stack submission must run image inference on the hidden test set, which requires the trained per-fold image checkpoints. The image expert’s OOF probabilities and embeddings are persisted (enough to score and ablate the stack on CV), but the per-fold model weights are not. Therefore:

  • Ready now: GBDT-only submission.csv (python -m src.submit), validating the full inference path on the placeholder.
  • Next step: a stack notebook that loads saved image checkpoints, scores the hidden test, and rank-blends with the GBDT. src/submit.py exposes the hook (--image-scores ... --image-weight ...); it is left unwired rather than fabricating image predictions until the checkpoints exist.

The late-submission path

flowchart LR
    A["Trained models<br/>GBDT boosters (done)<br/>+ per-fold image ckpts (next)"] --> B
    B["Upload as a<br/>Kaggle Dataset"] --> C
    C["Inference notebook<br/>attach competition + dataset<br/>load models, predict hidden test<br/>rank-blend GBDT + image"] --> D
    D["Write submission.csv<br/>(isic_id, target)"] --> E["Submit (late)<br/>→ real private pAUC"]

    A -. "image branch<br/>needs saved ckpts" .-> C
    style A fill:#0b6e4f,color:#fff
    style E fill:#1e8449,color:#fff
Figure 1: The path to a real private score via a Kaggle late submission. Steps 1–3 are done today (GBDT-only); the dashed image branch is the next step.

To obtain a genuine private score:

  1. Train + persist per-fold image checkpoints (src/vision/train.py) — the only missing artifact; OOF probs/embeddings already exist.
  2. Upload the trained models (GBDT joblib + image checkpoints) as a private Kaggle dataset.
  3. Author an inference notebook that attaches the competition data plus the model dataset, runs predict_test(...) with --image-scores, and writes the 2-column submission.csv.
  4. Submit as a late submission — Kaggle runs it on the hidden test and returns the private pAUC, projected to land in the 0.155–0.165 band.

The tabular-only path (steps 2–4 without the image branch) is runnable immediately and is the ready-today deliverable; the stack path is a fully-specified follow-up.


Continue to Results →