Healthcare & Life SciencesBelgian medical imaging startup, CE-marked diagnostic devices

How We Would Get Diagnostic AI Out of Jupyter Notebooks and Into Hospitals

MLOpsHealthcareComplianceAI/ML

<0w

Projected release cycle (was 6–8 weeks)

<0m

Drift detection alert time

Expected training reproducibility

0mo

Regulatory submission timeline

01 — The challenge

Their AI model for detecting early-stage diabetic retinopathy worked beautifully in the lab: 94.7% accuracy on their validation set. In production, it failed silently. Nurses stopped trusting it after 3 months because it flagged healthy patients 12% of the time on real hospital data — a training-serving skew they did not detect until a clinician complained. They were 8 months from a regulatory submission and had no reproducible pipeline, no audit trail, and no way to explain why the model made a specific prediction.

02 — Our approach

Our approach would start with data archaeology, not engineering. We would trace every version of the training dataset and expect to find preprocessing steps that changed in a notebook months earlier without version control. The model in production may have been trained on different image normalisation than the one in the lab. We would freeze the preprocessing, design a pipeline with versioned datasets and model artifacts, and add a gated release process where every model must pass A/B testing on anonymised real hospital data before deployment. For compliance, we would configure immutable audit logs for every training run, every inference, and every human override. The explainability requirement would be the hardest: regulators want to know why the model flagged a specific scan, so we would integrate saliency maps that highlight the regions influencing each prediction.

03 — Expected outcomes

Model release cycle projected to drop from 6–8 weeks to under 2 weeks once the pipeline is operational

Training run reproducibility expected to reach 94% — 100% is unlikely because legacy dependency versions may not be fully recoverable

Drift detection configured to catch model regression in under 15 minutes of abnormal input distribution

Privacy-by-design posture aligned with third-party security and privacy review requirements

How We Would Get Diagnostic AI Out of Jupyter Notebooks and Into Hospitals

Have a challenge like this one?

How We Would Get Diagnostic AI Out of Jupyter Notebooks and Into Hospitals

Have a challenge like this one?