Reproducibility
The environment, seeds, commands, and re-verification of each number
Every result is config-driven, single-seeded, and re-runnable from disk. The repro and cv-guardian agents re-derived the headline numbers from the OOF parquet files; the spine test suite passes 9/9.
Environment
| Component | Pin |
|---|---|
| Conda env | isic2024 |
| Python | 3.12 |
| PyTorch | 2.11.0 + cu128 (CUDA 12.8) |
| GBDTs | LightGBM, CatBoost |
| Backbones | timm (PyTorch Image Models) |
| Seed | SEED = 42 everywhere stochastic |
| Frozen split | data/folds.parquet (5 folds, patient-grouped, target-stratified) |
| Metric | src/cv.py:pauc_above_tpr @ 80% TPR — proven == vendored official scorer |
# create + activate
conda create -y -n isic2024 python=3.12
conda activate isic2024
# GPU build (CUDA 12.8); CPU graders swap the index for .../whl/cpu
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
pip install -e ".[dev]"
# one-shot alternative:
# conda env create -f environment.ymlHardware
A single NVIDIA RTX 4070 Ti SUPER (16 GB). The study fits within the 2-day budget because of per-epoch negative undersampling:
- an “epoch” sees all 393 positives + a fresh ~equal sample of negatives → a few thousand images,
- ≈ 4.5 s / epoch,
- a full 5-fold backbone trains in ≈ 8 minutes.
CPU latency on the frontier is measured single-threaded (src/efficiency.py) so the cost axis reflects a worst-case deployable setting.
The recipe
The pipeline is driven by make targets (each wraps a python -m src.* invocation). make folds must run first; no model trains until the no-leak guarantee passes.
# 0. fetch SLICE-3D into data/ (accept the comp rules on Kaggle once, else 403)
make data
# 1. freeze folds + metric sanity check — ALWAYS FIRST
make folds # python -m src.cv
# 2. tabular expert -> OOF + pAUC (banks ~90% of achievable score)
make gbdt # python -m src.gbdt
# 3. image expert(s) as a stacked feature (one config = one frontier point)
make vision CFG=configs/vision/convnextv2_nano_r224.yaml
# ... repeat across the 12-backbone sweep ...
# 4. fuse + log the frontier (params / FLOPs / CPU latency next to pAUC)
make frontier # python -m src.stack ; reports/frontier.py
# 5. spine tests: metric anchors, no-leak, official-equivalence (9/9)
make test
# 6. build this site (renders to ../docs for GitHub Pages)
make siteEach run is logged under experiments/ (one config + one CSV row per run, seeds fixed) and the frontier is appended to reports/frontier.csv / reports/frontier_cost.csv.
Production and re-verification of each number
| Number | Produced by | Re-verified |
|---|---|---|
| Tabular GBDT 0.16890 | src/gbdt.py bagged LGB(×5 seeds) + CatBoost rank-blend |
cv-guardian recomputed cv.oof_pauc on gbdt_oof.parquet |
| Best image 0.15821 | configs/vision/convnextv2_nano_r224.yaml |
recomputed from convnextv2_nano_r224 OOF parquet |
| Stack 0.17376 | rank-avg[gbdt, r224] in src/stack.py |
reconstruction recomputes to 0.17376 — exact match to stack_oof.parquet |
| Per-fold table | reports/perf/make_perf.py |
each fold scored independently via src/cv.py |
| Feature importance | mean gain over 25 LGB + 25 CatBoost boosters | normalized + averaged in make_perf.py |
| Costs (params/FLOPs/ms) | src/efficiency.py |
reports/frontier_cost.csv (authoritative) |
- One split, one metric. Every model reads the same frozen
data/folds.parquetand is scored by the samesrc/cv.py, unit-tested to match the official scorer. - No patient straddles folds. Verified by
cv-guardian; this property prevents the ~200-place private shake-up other teams encountered. - Leak columns dropped.
iddx_*,mel_*,lesion_id,tbp_lv_dnn_lesion_confidencenever enter any model. - Recomputed from disk. The headline pAUCs were re-derived from the OOF parquet files, not copied from training stdout.
Determinism caveats
Bit-exact determinism across GPUs and driver versions is not guaranteed for deep-learning training (cuDNN nondeterminism, atomic reductions). SEED = 42 is fixed, every config is logged, and OOF pAUC is reported to 5 decimals. The GBDT and stack numbers are fully deterministic from the frozen folds; the image numbers reproduce to within fold-variance (~±0.008 std, far smaller than the reported effects).
Continue to Credits →