Reproducibility

The environment, seeds, commands, and re-verification of each number

Every result is config-driven, single-seeded, and re-runnable from disk. The repro and cv-guardian agents re-derived the headline numbers from the OOF parquet files; the spine test suite passes 9/9.

Environment

Component Pin
Conda env isic2024
Python 3.12
PyTorch 2.11.0 + cu128 (CUDA 12.8)
GBDTs LightGBM, CatBoost
Backbones timm (PyTorch Image Models)
Seed SEED = 42 everywhere stochastic
Frozen split data/folds.parquet (5 folds, patient-grouped, target-stratified)
Metric src/cv.py:pauc_above_tpr @ 80% TPR — proven == vendored official scorer
# create + activate
conda create -y -n isic2024 python=3.12
conda activate isic2024
# GPU build (CUDA 12.8); CPU graders swap the index for .../whl/cpu
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
pip install -e ".[dev]"
# one-shot alternative:
# conda env create -f environment.yml

Hardware

A single NVIDIA RTX 4070 Ti SUPER (16 GB). The study fits within the 2-day budget because of per-epoch negative undersampling:

  • an “epoch” sees all 393 positives + a fresh ~equal sample of negatives → a few thousand images,
  • ≈ 4.5 s / epoch,
  • a full 5-fold backbone trains in ≈ 8 minutes.

CPU latency on the frontier is measured single-threaded (src/efficiency.py) so the cost axis reflects a worst-case deployable setting.

The recipe

The pipeline is driven by make targets (each wraps a python -m src.* invocation). make folds must run first; no model trains until the no-leak guarantee passes.

# 0. fetch SLICE-3D into data/ (accept the comp rules on Kaggle once, else 403)
make data

# 1. freeze folds + metric sanity check  — ALWAYS FIRST
make folds                                   # python -m src.cv

# 2. tabular expert -> OOF + pAUC (banks ~90% of achievable score)
make gbdt                                     # python -m src.gbdt

# 3. image expert(s) as a stacked feature (one config = one frontier point)
make vision CFG=configs/vision/convnextv2_nano_r224.yaml
#   ... repeat across the 12-backbone sweep ...

# 4. fuse + log the frontier (params / FLOPs / CPU latency next to pAUC)
make frontier                                 # python -m src.stack ; reports/frontier.py

# 5. spine tests: metric anchors, no-leak, official-equivalence  (9/9)
make test

# 6. build this site (renders to ../docs for GitHub Pages)
make site

Each run is logged under experiments/ (one config + one CSV row per run, seeds fixed) and the frontier is appended to reports/frontier.csv / reports/frontier_cost.csv.

Production and re-verification of each number

Number Produced by Re-verified
Tabular GBDT 0.16890 src/gbdt.py bagged LGB(×5 seeds) + CatBoost rank-blend cv-guardian recomputed cv.oof_pauc on gbdt_oof.parquet
Best image 0.15821 configs/vision/convnextv2_nano_r224.yaml recomputed from convnextv2_nano_r224 OOF parquet
Stack 0.17376 rank-avg[gbdt, r224] in src/stack.py reconstruction recomputes to 0.17376 — exact match to stack_oof.parquet
Per-fold table reports/perf/make_perf.py each fold scored independently via src/cv.py
Feature importance mean gain over 25 LGB + 25 CatBoost boosters normalized + averaged in make_perf.py
Costs (params/FLOPs/ms) src/efficiency.py reports/frontier_cost.csv (authoritative)
TipVerification properties
  • One split, one metric. Every model reads the same frozen data/folds.parquet and is scored by the same src/cv.py, unit-tested to match the official scorer.
  • No patient straddles folds. Verified by cv-guardian; this property prevents the ~200-place private shake-up other teams encountered.
  • Leak columns dropped. iddx_*, mel_*, lesion_id, tbp_lv_dnn_lesion_confidence never enter any model.
  • Recomputed from disk. The headline pAUCs were re-derived from the OOF parquet files, not copied from training stdout.

Determinism caveats

Bit-exact determinism across GPUs and driver versions is not guaranteed for deep-learning training (cuDNN nondeterminism, atomic reductions). SEED = 42 is fixed, every config is logged, and OOF pAUC is reported to 5 decimals. The GBDT and stack numbers are fully deterministic from the frozen folds; the image numbers reproduce to within fold-variance (~±0.008 std, far smaller than the reported effects).


Continue to Credits →