Reproducibility

The environment, seeds, commands, and re-verification of each number

Every result is config-driven, single-seeded, and re-runnable from disk. The repro and cv-guardian agents re-derived the headline numbers from the OOF parquet files; the spine test suite passes 9/9.

Environment

Component	Pin
Conda env	`isic2024`
Python	3.12
PyTorch	2.11.0 + cu128 (CUDA 12.8)
GBDTs	LightGBM, CatBoost
Backbones	`timm` (PyTorch Image Models)
Seed	`SEED = 42` everywhere stochastic
Frozen split	`data/folds.parquet` (5 folds, patient-grouped, target-stratified)
Metric	`src/cv.py:pauc_above_tpr` @ 80% TPR — proven == vendored official scorer

# create + activate
conda create -y -n isic2024 python=3.12
conda activate isic2024
# GPU build (CUDA 12.8); CPU graders swap the index for .../whl/cpu
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
pip install -e ".[dev]"
# one-shot alternative:
# conda env create -f environment.yml

Hardware

A single NVIDIA RTX 4070 Ti SUPER (16 GB). The study fits within the 2-day budget because of per-epoch negative undersampling:

an “epoch” sees all 393 positives + a fresh ~equal sample of negatives → a few thousand images,
≈ 4.5 s / epoch,
a full 5-fold backbone trains in ≈ 8 minutes.

CPU latency on the frontier is measured single-threaded (src/efficiency.py) so the cost axis reflects a worst-case deployable setting.

The recipe

The pipeline is driven by make targets (each wraps a python -m src.* invocation). make folds must run first; no model trains until the no-leak guarantee passes.

# 0. fetch SLICE-3D into data/ (accept the comp rules on Kaggle once, else 403)
make data

# 1. freeze folds + metric sanity check  — ALWAYS FIRST
make folds                                   # python -m src.cv

# 2. tabular expert -> OOF + pAUC (banks ~90% of achievable score)
make gbdt                                     # python -m src.gbdt

# 3. image expert(s) as a stacked feature (one config = one frontier point)
make vision CFG=configs/vision/convnextv2_nano_r224.yaml
#   ... repeat across the 12-backbone sweep ...

# 4. fuse + log the frontier (params / FLOPs / CPU latency next to pAUC)
make frontier                                 # python -m src.stack ; reports/frontier.py

# 5. spine tests: metric anchors, no-leak, official-equivalence  (9/9)
make test

# 6. build this site (renders to ../docs for GitHub Pages)
make site

Each run is logged under experiments/ (one config + one CSV row per run, seeds fixed) and the frontier is appended to reports/frontier.csv / reports/frontier_cost.csv.

Production and re-verification of each number

Number	Produced by	Re-verified
Tabular GBDT 0.16890	`src/gbdt.py` bagged LGB(×5 seeds) + CatBoost rank-blend	`cv-guardian` recomputed `cv.oof_pauc` on `gbdt_oof.parquet`
Best image 0.15821	`configs/vision/convnextv2_nano_r224.yaml`	recomputed from `convnextv2_nano_r224` OOF parquet
Stack 0.17376	`rank-avg[gbdt, r224]` in `src/stack.py`	reconstruction recomputes to 0.17376 — exact match to `stack_oof.parquet`
Per-fold table	`reports/perf/make_perf.py`	each fold scored independently via `src/cv.py`
Feature importance	mean gain over 25 LGB + 25 CatBoost boosters	normalized + averaged in `make_perf.py`
Costs (params/FLOPs/ms)	`src/efficiency.py`	`reports/frontier_cost.csv` (authoritative)

Verification properties

One split, one metric. Every model reads the same frozen data/folds.parquet and is scored by the same src/cv.py, unit-tested to match the official scorer.
No patient straddles folds. Verified by cv-guardian; this property prevents the ~200-place private shake-up other teams encountered.
Leak columns dropped. iddx_*, mel_*, lesion_id, tbp_lv_dnn_lesion_confidence never enter any model.
Recomputed from disk. The headline pAUCs were re-derived from the OOF parquet files, not copied from training stdout.

Determinism caveats

Bit-exact determinism across GPUs and driver versions is not guaranteed for deep-learning training (cuDNN nondeterminism, atomic reductions). SEED = 42 is fixed, every config is logged, and OOF pAUC is reported to 5 decimals. The GBDT and stack numbers are fully deterministic from the frozen folds; the image numbers reproduce to within fold-variance (~±0.008 std, far smaller than the reported effects).

Continue to Credits →