flowchart LR
A["Trained models<br/>GBDT boosters (done)<br/>+ per-fold image ckpts (next)"] --> B
B["Upload as a<br/>Kaggle Dataset"] --> C
C["Inference notebook<br/>attach competition + dataset<br/>load models, predict hidden test<br/>rank-blend GBDT + image"] --> D
D["Write submission.csv<br/>(isic_id, target)"] --> E["Submit (late)<br/>→ real private pAUC"]
A -. "image branch<br/>needs saved ckpts" .-> C
style A fill:#0b6e4f,color:#fff
style E fill:#1e8449,color:#fff
Submission & Leaderboard
What a Kaggle CODE competition submits, what was produced, and where the constrained score sits
ISIC-2024 was a Kaggle CODE competition: submission is a notebook that Kaggle re-runs on a hidden test set, not a CSV of predictions. The public test shipped with the data is a 3-row placeholder, so a private score cannot be computed locally. This page states what can be produced today, what the CV implies, and the path to a real late-submission score.
How scoring works
The placeholder held on disk, verified directly:
| Artifact | Local | On Kaggle |
|---|---|---|
data/test-image.hdf5 |
≈ 11 KB, 3 images (ISIC_0015657, ISIC_0015729, ISIC_0015740) |
tens of thousands of hidden crops |
data/test-metadata.csv |
3 rows (header + 3) | the full hidden cohort’s metadata |
data/sample_submission.csv |
3 rows, constant target = 0.3 |
format template only |
A local run can therefore only confirm that the inference pipeline is wired correctly and emits a format-valid file.
What was produced
A format-valid submission.csv was generated from the saved bagged GBDT — 50 boosters (5 folds × 5 LightGBM seeds + 5 CatBoost seeds, persisted in experiments/gbdt_boosters.joblib) — via a single command:
python -m src.submit # GBDT-only; reads experiments/gbdt_boosters.joblibInference mirrors the leak-free training transform: every booster carries its own per-fold feature state, predict_gbdt re-applies that state before predicting, and the per-fold scores are averaged. The output is the required two-column schema:
isic_id,target
ISIC_0015657,0.41827...
ISIC_0015729,0.06913...
ISIC_0015740,0.22540...
| Column | Type | Meaning |
|---|---|---|
isic_id |
string | lesion-crop identifier, one row per test image |
target |
float | predicted malignancy score (rank-comparable; pAUC is rank-based, so calibration is irrelevant) |
Scope of the validation. Running python -m src.submit on the 3-row placeholder confirms the inference path end-to-end: boosters load, the fold-local feature transform replays without leakage, and the writer emits a valid 2-column file. It does not produce a real score, which requires Kaggle to run the notebook on the hidden test (a late submission, since the competition is closed).
Estimated private position (from CV)
Because the hidden set cannot be scored offline, the projection is from the out-of-fold CV, which — having seen no out-of-distribution data — is a conservative generalization estimate. The field’s observed CV → private drop on this task was roughly 0.013–0.021.
A range is reported, not a rank: fold variance (~±0.008 std) and the unknown shake-up make any single claimed position unreliable.
Leaderboard context (~2,700 teams)
1st place — private pAUC 0.17264. Used external ISIC-archive dermoscopy data and ~30,000 Stable-Diffusion-synthesized malignant lesions, both banned here. Their own ablation: the 30k synthetic lesions added only +0.0007 pAUC.
Best no-external-data solutions sat around 0.16 private; a clean tabular-only team reached 0.162. This is the bracket the present work competes in, and the estimated private ≈ 0.155–0.165 lands inside it.
| Reference point | Private pAUC | External data? | Synthetic positives? |
|---|---|---|---|
| 1st place (unconstrained) | 0.17264 | yes | yes (≈ 30k) |
| Best no-external solutions | ≈ 0.16 | no | varies |
| Clean tabular-only team | 0.162 | no | no |
| Ours — estimated private (stack) | ≈ 0.155 – 0.165 | no | no |
| Random baseline | ≈ 0.02 | — | — |
The estimated private ≈ 0.155–0.165 is competitive in the silver/gold zone and SOTA-class among no-external / no-synthetic / single-dataset solutions, the class the hard rules define. The unconstrained 0.17264 used the two banned ingredients and is impossible to match by construction. See Scope.
Stack vs. tabular submission
The tabular GBDT submission is ready today; the full stack submission is not.
A stack submission must run image inference on the hidden test set, which requires the trained per-fold image checkpoints. The image expert’s OOF probabilities and embeddings are persisted (enough to score and ablate the stack on CV), but the per-fold model weights are not. Therefore:
- Ready now: GBDT-only
submission.csv(python -m src.submit), validating the full inference path on the placeholder. - Next step: a stack notebook that loads saved image checkpoints, scores the hidden test, and rank-blends with the GBDT.
src/submit.pyexposes the hook (--image-scores ... --image-weight ...); it is left unwired rather than fabricating image predictions until the checkpoints exist.
The late-submission path
To obtain a genuine private score:
- Train + persist per-fold image checkpoints (
src/vision/train.py) — the only missing artifact; OOF probs/embeddings already exist. - Upload the trained models (GBDT joblib + image checkpoints) as a private Kaggle dataset.
- Author an inference notebook that attaches the competition data plus the model dataset, runs
predict_test(...)with--image-scores, and writes the 2-columnsubmission.csv. - Submit as a late submission — Kaggle runs it on the hidden test and returns the private pAUC, projected to land in the 0.155–0.165 band.
The tabular-only path (steps 2–4 without the image branch) is runnable immediately and is the ready-today deliverable; the stack path is a fully-specified follow-up.
Continue to Results →