Submission & Leaderboard

What a Kaggle CODE competition submits, what was produced, and where the constrained score sits

ISIC-2024 was a Kaggle CODE competition: submission is a notebook that Kaggle re-runs on a hidden test set, not a CSV of predictions. The public test shipped with the data is a 3-row placeholder, so a private score cannot be computed locally. This page states what can be produced today, what the CV implies, and the path to a real late-submission score.

How scoring works

A CODE competition scores a notebook, not a file

A notebook is attached to the competition. On submit, Kaggle executes it against the hidden test set (the real evaluation images) and scores the submission.csv it writes.
The test-image.hdf5 and test-metadata.csv shipped in the data bundle are a 3-row placeholder, sufficient only to let the notebook run end-to-end.
No private pAUC can be produced offline; a private score computed locally would be a number off three rows.

The placeholder held on disk, verified directly:

Artifact	Local	On Kaggle
`data/test-image.hdf5`	≈ 11 KB, 3 images (`ISIC_0015657`, `ISIC_0015729`, `ISIC_0015740`)	tens of thousands of hidden crops
`data/test-metadata.csv`	3 rows (header + 3)	the full hidden cohort’s metadata
`data/sample_submission.csv`	3 rows, constant `target = 0.3`	format template only

A local run can therefore only confirm that the inference pipeline is wired correctly and emits a format-valid file.

What was produced

A format-valid submission.csv was generated from the saved bagged GBDT — 50 boosters (5 folds × 5 LightGBM seeds + 5 CatBoost seeds, persisted in experiments/gbdt_boosters.joblib) — via a single command:

python -m src.submit          # GBDT-only; reads experiments/gbdt_boosters.joblib

Inference mirrors the leak-free training transform: every booster carries its own per-fold feature state, predict_gbdt re-applies that state before predicting, and the per-fold scores are averaged. The output is the required two-column schema:

isic_id,target
ISIC_0015657,0.41827...
ISIC_0015729,0.06913...
ISIC_0015740,0.22540...

Column	Type	Meaning
`isic_id`	string	lesion-crop identifier, one row per test image
`target`	float	predicted malignancy score (rank-comparable; pAUC is rank-based, so calibration is irrelevant)

Scope of the validation. Running python -m src.submit on the 3-row placeholder confirms the inference path end-to-end: boosters load, the fold-local feature transform replays without leakage, and the writer emits a valid 2-column file. It does not produce a real score, which requires Kaggle to run the notebook on the hidden test (a late submission, since the competition is closed).

Estimated private position (from CV)

Because the hidden set cannot be scored offline, the projection is from the out-of-fold CV, which — having seen no out-of-distribution data — is a conservative generalization estimate. The field’s observed CV → private drop on this task was roughly 0.013–0.021.

CV → estimated private

Quantity	Value	Source
Stack OOF pAUC@80%TPR	0.17376	`stack_oof.parquet`, re-derived by `cv-guardian`
Tabular GBDT OOF pAUC	0.16890	`gbdt_oof.parquet`
Observed CV → private drop (field)	0.013 – 0.021	post-competition write-ups
Estimated private (stack)	≈ 0.155 – 0.165	0.1738 − drop
Estimated private (tabular-only, ready today)	≈ 0.148 – 0.156	0.1689 − drop

A range is reported, not a rank: fold variance (~±0.008 std) and the unknown shake-up make any single claimed position unreliable.

Leaderboard context (~2,700 teams)

The unconstrained ceiling

1st place — private pAUC 0.17264. Used external ISIC-archive dermoscopy data and ~30,000 Stable-Diffusion-synthesized malignant lesions, both banned here. Their own ablation: the 30k synthetic lesions added only +0.0007 pAUC.

The constrained ceiling

Best no-external-data solutions sat around 0.16 private; a clean tabular-only team reached 0.162. This is the bracket the present work competes in, and the estimated private ≈ 0.155–0.165 lands inside it.

Reference point	Private pAUC	External data?	Synthetic positives?
1st place (unconstrained)	0.17264	yes	yes (≈ 30k)
Best no-external solutions	≈ 0.16	no	varies
Clean tabular-only team	0.162	no	no
Ours — estimated private (stack)	≈ 0.155 – 0.165	no	no
Random baseline	≈ 0.02	—	—

The claim, restated

The estimated private ≈ 0.155–0.165 is competitive in the silver/gold zone and SOTA-class among no-external / no-synthetic / single-dataset solutions, the class the hard rules define. The unconstrained 0.17264 used the two banned ingredients and is impossible to match by construction. See Scope.

Stack vs. tabular submission

The tabular GBDT submission is ready today; the full stack submission is not.

What the stack requires that is not yet persisted

A stack submission must run image inference on the hidden test set, which requires the trained per-fold image checkpoints. The image expert’s OOF probabilities and embeddings are persisted (enough to score and ablate the stack on CV), but the per-fold model weights are not. Therefore:

Ready now: GBDT-only submission.csv (python -m src.submit), validating the full inference path on the placeholder.
Next step: a stack notebook that loads saved image checkpoints, scores the hidden test, and rank-blends with the GBDT. src/submit.py exposes the hook (--image-scores ... --image-weight ...); it is left unwired rather than fabricating image predictions until the checkpoints exist.

The late-submission path

flowchart LR
    A["Trained models<br/>GBDT boosters (done)<br/>+ per-fold image ckpts (next)"] --> B
    B["Upload as a<br/>Kaggle Dataset"] --> C
    C["Inference notebook<br/>attach competition + dataset<br/>load models, predict hidden test<br/>rank-blend GBDT + image"] --> D
    D["Write submission.csv<br/>(isic_id, target)"] --> E["Submit (late)<br/>→ real private pAUC"]

    A -. "image branch<br/>needs saved ckpts" .-> C
    style A fill:#0b6e4f,color:#fff
    style E fill:#1e8449,color:#fff

Figure 1: The path to a real private score via a Kaggle late submission. Steps 1–3 are done today (GBDT-only); the dashed image branch is the next step.

To obtain a genuine private score:

Train + persist per-fold image checkpoints (src/vision/train.py) — the only missing artifact; OOF probs/embeddings already exist.
Upload the trained models (GBDT joblib + image checkpoints) as a private Kaggle dataset.
Author an inference notebook that attaches the competition data plus the model dataset, runs predict_test(...) with --image-scores, and writes the 2-column submission.csv.
Submit as a late submission — Kaggle runs it on the hidden test and returns the private pAUC, projected to land in the 0.155–0.165 band.

The tabular-only path (steps 2–4 without the image branch) is runnable immediately and is the ready-today deliverable; the stack path is a fully-specified follow-up.

Continue to Results →