Short answers to the questions we get most often. Longer answers link to the relevant docs.
A free, open-source AI library and web service for identifying individual whales and dolphins from aerial photographs. See SOLUTION_OVERVIEW.md.
For research and conservation work — yes. Anti-fraud gate hits TNR = 90.2%, sensitivity = 95.0%, linear scalability (R² = 1.000), real metrics, full Docker deployment. For safety-critical decisions affecting endangered-species populations — use it as input to a biologist-reviewed workflow, not as a final oracle. See “Область применения и ограничения” in RESEARCH_NOTES.md §7.
Source code is MIT — yes, free forever. Model weights inherit CC-BY-NC-4.0 from the Happy Whale training data, which means non-commercial use only. Commercial users must obtain a separate licence from Happy Whale. See COMPLIANCE.md §2.
@software{ecomarineai2025,
title = {EcoMarineAI: Open Library for Cetacean Identification from Aerial Imagery},
author = {Baltsat, K.I. and Tarasov, A.A. and Vandanov, S.A. and Serov, A.I.},
year = {2025},
url = {https://github.com/0x0000dead/whales-identification}
}
Also cite the Happy Whale dataset (CC-BY-NC-4.0) and, if relevant, the upstream checkpoint author (ktakita/happywhale-exp004-effb4-trainall).
git clone + docker compose up --build. See README.md or DEPLOYMENT.md.
No. CPU-only inference gives p95 latency ≈ 540 ms, which is 12× under the TZ budget. A GPU brings it down to ~25 ms if you need higher throughput.
models/).Yes to all three. Docker Desktop works on all; for native Linux you can poetry install directly.
docker compose up fail on first boot?Most common causes:
docker compose down, change the mapping in docker-compose.yml.HTTPS_PROXY or pre-populate models/ manually via scripts/download_models.sh.30 cetacean species. Full list in MODEL_CARD.md. The species with the most training data are humpback whale, bottlenose dolphin, and blue whale.
13 837 unique individuals. These are the animals the model saw during training. It can recognise them again in new photos with top-1 accuracy that varies by species (humpbacks: high; minkes: lower due to fewer training examples).
The model still predicts a species (because species-level features generalise) but the individual_id will point to the closest known match. Trust id_animal more than class_animal in that case, and look at probability — low values (<0.1) signal “unseen individual”.
The CLIP anti-fraud gate rejects it with rejected: true, rejection_reason: "not_a_marine_mammal". Returns HTTP 200 (a rejection is a valid classification outcome). See API_REFERENCE.md.
Look at cetacean_score. If it’s between 0.2 and 0.5, the image is borderline (heavy crop, low light, partial animal). Try:
If cetacean_score is very high but the identification confidence is low, rely on the species name (id_animal) rather than the individual ID.
Yes — lower the threshold in whales_be_service/src/whales_be_service/configs/anti_fraud_threshold.yaml. For example, 0.2 accepts more borderline photos at the cost of specificity. Re-run scripts/calibrate_clip_threshold.py after adding more real examples to data/test_split/.
individual_id values.whales_identify/train.py) — no backbone retraining needed if the new species is visually close to existing ones.models/registry.json.See ROADMAP.md for the long-term extensibility plan.
The publicly verifiable portion is 51 034 images × 15 587 individuals from the Happy Whale Kaggle competition. The ТЗ also cites an additional private subset from the Ministry of Natural Resources RF (~29 k images). The currently-deployed EfficientNet-B4 checkpoint was trained on the public Happy Whale set only (fold 0) — see MODEL_CARD.md §Training Data for the full breakdown. The ТЗ 80 k aggregate refers to the combined Happy Whale + Ministry RF corpus.
Yes, the public part from Kaggle: https://www.kaggle.com/competitions/happy-whale-and-dolphin/data — requires a Kaggle account and accepting the competition rules. The in-repo scripts/populate_test_split.py reproduces a 100-positive subset for evaluation.
The Ministry RF dataset is covered by the grant agreement and cannot be redistributed (terms in LICENSE_DATA.md §2). As a practical consequence:
Absolutely — the pipeline is data-agnostic. Train a new ArcFace head on your data, upload weights to HuggingFace, set HF_REPO=yourorg/yourweights in the Docker environment, and the service picks up your model on next restart.
Use integrations/sqlite_sink.py for SQLite or integrations/postgres_sink.py for PostgreSQL. See INTEGRATION_GUIDE.md.
Yes. The API is plain HTTP + multipart form data. Examples for curl, Python, JavaScript, and R are in API_REFERENCE.md and INTEGRATION_GUIDE.md.
Scrape /metrics with Prometheus (it’s already Prometheus-formatted), visualize in Grafana. Recommended panels in DEPLOYMENT.md §”Monitoring stack” and MLOPS_PLAYBOOK.md §8.
Not directly yet — but the SQLite schema is Darwin Core-friendly. A darwin_core_sink.py is planned for Q3 2026 (see ROADMAP.md).
Per-image p95 latency of 540 ms on a single CPU worker. With 4 workers on a 4-core VM you get ~7 images/second. With a GPU (T4 class) it’s ~40 images/second per worker. See PERFORMANCE_REPORT.md §3 and §8.
Yes — ZIP them and POST to /v1/predict-batch. The whole batch counts as one rate-limited request.
Yes. scripts/benchmark_scalability.py sweeps 10/25/50/100 images and fits a linear regression with R² = 1.000. Slope ≈ 482 ms/image.
~1.5 GB RSS after warmup (CLIP ~720 MB + EffB4 ~540 MB + overhead). See PERFORMANCE_REPORT.md §7.
ModuleNotFoundError: open_clip_torchThe anti-fraud gate falls back to permissive mode (lets everything through) with an ERROR log. Install with pip install open-clip-torch — already listed in whales_be_service/pyproject.toml.
rembg crashes on Python 3.14Known upstream bug (rembg calls sys.exit(1) during import on 3.14). The service catches this and returns mask: null. Either downgrade to Python 3.11 or just ignore masks.
That’s the cold-load path — CLIP + EffB4 load on demand the first time. Subsequent requests are fast. Warmup happens inside the FastAPI lifespan(), so in Docker the first user request after container start is fast.
Either:
scripts/compute_metrics.py.cp reports/metrics_latest.json reports/metrics_baseline.json, commit, and retry.Open an issue at https://github.com/0x0000dead/whales-identification/issues with:
docker compose logs output if relevant.Fork the repo, create a branch, run pytest -m "not slow" locally, open a PR against main. CI runs lint/unit/security/docker-build. See wiki_content/Contributing.md.
Yes — MIT licence on code, CC-BY-NC-4.0 on data/models for non-commercial academic use. Please cite the project (see “How do I cite it?” above).