Operational runbook for the EcoMarineAI inference service. Read this when you’re:
Source of truth: models/registry.json.
{
"schema_version": "1.0",
"active": "effb4_15k",
"models": [
{
"name": "effb4_15k",
"display_name": "EfficientNet-B4 ArcFace 13 837-class",
"version": "1.0.0",
"architecture": "CetaceanIdentificationModel",
"weights_url": "https://huggingface.co/0x0000dead/ecomarineai-cetacean-effb4/resolve/main/efficientnet_b4_512_fold0.ckpt",
"sha256": null,
"trained_at": "2022-04-18",
"metrics_snapshot": "reports/metrics_baseline.json"
},
...
]
}
whales_identify/train.py or an external notebook).efficientnet_b4_v1.1.0.ckpt.python3 scripts/compute_metrics.py. Confirm TNR ≥ 0.90, TPR ≥ 0.85.models/registry.json (increment version, update weights_url).scripts/download_models.sh to match..github/workflows/metrics.yml — it runs the regression gate vs metrics_baseline.json."active" in registry.json and merge.Simply change "active" back to the previous entry in registry.json and commit. No code change. Restart pods.
Two signals are exposed via /metrics and /v1/drift-stats:
| Signal | Where | Alarm threshold |
|---|---|---|
cetacean_score_avg |
/metrics gauge |
Drops > 0.10 pp from baseline |
score_mean (window) |
/v1/drift-stats |
Drops > 0.10 pp from baseline |
alarms_total |
/v1/drift-stats |
Non-zero = active drift |
rejections_by_reason{reason="not_a_marine_mammal"} rate |
/metrics |
> 50 % of requests = systemic issue |
whales_be_service/src/whales_be_service/monitoring/drift.py keeps a 1 000-sample rolling deque of cetacean_score and probability values. On every prediction, record() appends and — if the window holds ≥ 50 samples AND a baseline is set — checks (baseline - window_mean) ≥ alarm_drop. A positive check logs WARNING and bumps alarms_total.
Baseline is not set automatically (to avoid silent drift acceptance). To set it, parse metrics_baseline.json and pass it at pipeline construction. See whales_be_service/src/whales_be_service/inference/registry.py.
rejections_by_reason — if not_a_marine_mammal is suddenly dominant, either:
python3 scripts/calibrate_clip_threshold.py on the latest data/test_split/ — does it pick a very different threshold?configs/anti_fraud_threshold.yaml and restart.data/test_split/ so the threshold can be recalibrated on realistic data.docker pull ghcr.io/0x0000dead/ecomarine-backend:v1.0.0
docker compose up -d backend
git revert <bad commit>
git push
# wait for CI + new image build
kubectl rollout restart deployment/ecomarine-backend
If the gate is misbehaving and rejecting everything, set threshold: 0.0 in configs/anti_fraud_threshold.yaml and restart. This effectively accepts everything. Treat as a fire-drill measure — re-calibrate within 24 h.
availability_percent < 95 %Check errors_total rate. Common causes:
docker-entrypoint.sh logs. Fallback: pre-populate /app/src/whales_be_service/models/ via a PVC.p95 latency > 2 skubectl scale --replicas=4).torch.cuda.is_available() may have returned False; check driver logs.Remember: the identification model knows 30 species but only ~13 837 individuals. For unseen individuals the top-1 is unreliable. Tell the user to rely on id_animal (species) rather than class_animal (individual). The cetacean_score value is the “is this a cetacean at all” signal.
| Release type | Frequency | Approval |
|---|---|---|
| Patch (bugfix) | On demand | 1 reviewer, CI green |
| Minor (new feature) | Bi-weekly | 1 reviewer + metrics CI |
| Major (model v2.0) | Quarterly | 2 reviewers + manual QA |
| Hotfix | Within 1 h | 1 reviewer, skipping labs |
Every prediction is reproducible given:
version (from /metrics or model_version field in the response).threshold (from configs/anti_fraud_threshold.yaml).We do not log input images by default (privacy + storage). If a reviewer needs to replay a specific case, they can:
/v1/predict-single against the pinned version.metrics_baseline.json, and replay the calibration numbers.~/.kaggle/kaggle.json on developer machines and CI secrets on GitHub. Never committed.0x0000dead/ecomarineai-cetacean-effb4. Stored in GH secrets as HF_TOKEN. The service itself only reads from the repo (public), so no token is needed at runtime.git filter-repo).Recommended Grafana panels:
availability_percent over 7 days, threshold line at 95%.rate(predictions_total[5m]) + rate(rejections_total[5m]) stacked.latency_avg_ms + histogram of request durations.rejections_by_reason pie chart.cetacean_score_avg with a baseline reference line.rate(errors_total[5m]) alert-backed.