Community Fish Detection Dataset

Overview

This dataset is a curated compilation of fish imagery and annotations from a variety of publicly available datasets, with a focus on object detection. It includes >1.9M images/frames and >935,000 bounding boxes from around 17 individual datasets, spanning freshwater, marine, and lab-based captures.

The original datasets include bounding box annotations and metadata in a variety of formats, and the original labels vary in granularity from species-level identifications to generic “fish” labels. This dataset harmonizes the source metadata into a single COCO-formatted archive, with a single category (“fish”).

License and contact information

Each dataset is licensed under its own terms. Most are under permissive licenses including CC-BY, CC-BY-SA, CC0, or CDLA, but some licenses include an NC restriction. See below for individual dataset license information.

For questions about this dataset, contact Filippo Varini.

Data format

Data is provided in COCO .json format, with a single category (“fish”). In addition to the standard COCO fields, the following fields are included:

  • dataset: the name of the original dataset from which this image (or frame) was collected
  • original_data_source: the filename of this image in the original dataset
  • is_train: a reference train/val split was selected, using location information whenever it was available to ensure that any location appeared in the training or validation splits, but not both. Location information was not always available, so it’s not guaranteed to be the case that the same background appears exclusively in one split.

Downloading the data

Metadata is available here.

Images are available in the following cloud storage folders:

  • gs://public-datasets-lila/community-fish-detection-dataset (GCP)
  • s3://us-west-2.opendata.source.coop/agentmorris/lila-wildlife/community-fish-detection-dataset (AWS)
  • https://lilawildlife.blob.core.windows.net/lila-wildlife/community-fish-detection-dataset (Azure)

We recommend downloading images (the whole folder, or a subset of the folder) using gsutil (for GCP), aws s3 (for AWS), or AzCopy (for Azure). For more information about using gsutil, aws s3, or AzCopy, check out our guidelines for accessing images without using giant zipfiles.

If you prefer to download individual images via http, you can. For example, the thumbnail below appears in the metadata as:

JPEGImages/marine_detect_3759335805_train.jpg

This image can be downloaded directly from any of the following URLs (one for each cloud):

Having trouble downloading? Check out our FAQ.

Constituent datasets

Brief descriptions and license information for each dataset are included here; for more detailed information and thumbnails from each dataset, see the fish-datasets GitHub repo. The dataset from which each image was drawn is available in the “dataset” field in the .json file.

The Brackish Dataset

This dataset includes 89 short videos (~14,764 extracted frames) from brackish water in Denmark, annotated with boxes on fish, crabs and other fauna. This dataset is released under CC BY-SA 4.0.

Coralscapes

This dataset contains 2,027 images captured by diver-borne GoPro cameras from a variety of global coral reefs. This dataset is licensed under Apache 2.0.

Deep Vision Fish Dataset

This dataset is constructed from images collected using the Deep Vision system during two surveys from 2017 and 2018 that targeted economically important pelagic species. The fish species annotated are blue whiting, Atlantic herring, mesopelagic fishes and Atlantic mackerel. This dataset is licensed under CC-BY 4.0.

DeepFish

This dataset includes ~40k seafloor images from 20 tropical habitats in northern Australia, aimed at benchmarking habitat‑aware fish detection and segmentation. The original dataset contains a mix of classification, segmentation, and counting labels; only images with segmentation masks were used here, resulting in 310 images with 338 annotations. The subset used here is redistributed under the MIT License.

F4K Detection and Tracking

This dataset includes 17 10-minute videos representing a variety of water conditions, with segmentation masks on a subset of frames. This dataset’s license is not explicitly stated.

FathomNet

FathomNet is an image database containing a variety of marine taxa; the subset used here is intended to capture fish. Specifically, this subset includes descendents of Actinopterygii, Sarcopterygii, Chondrichthyes, and Myxini. This dataset is released under CC-BY-ND 4.0

FishCLEF‑2015

This dataset includes 93 videos annotated with bounding boxes and fish species (14k annotations on 20k frames). This dataset’s license is not explicitly stated.

Tropical freshwater fish in Northern Australia

This dataset includes 44,112 images with 82,904 bounding box annotations for 23 tropical freshwater fish taxa from northern Australia. Images were derived from Remote Underwater Video (RUV) deployments in deep channel and shallow lowland billabongs, Kakadu National Park, Northern Territory Australia. RUV deployments were conducted during the Supervising Scientists annual fish monitoring program in the 2016, 2017 and 2018 recessional flow period (dry season). This dataset is released under CC-BY 4.0.

Marine Detect

This dataset is itself a composite of several other publicly-available datasets; most of the sub-datasets come either from GBIF or from Roboflow Universe. A list of the sub-datasets is available here. Datasets from Roboflow universe are licensed under CC-BY 4.0; datasets from GBIF are licensed under CC-BY-NC 4.0.

MIT Sea Grant River Herring

This dataset was developed from 1,434 video clips from rivers in Massachusetts, USA, comprising a total of ~262k frames, with ~91k bounding boxes annotated on ~60k of those frames (the remaining ~162k frames were reviewed and classified as empty). This dataset is released under CDLA-permissive 1.0.

Puget Sound Nearshore Fish

This dataset contains 77,739 images sampled from video collected on and around shellfish aquaculture farms in an estuary in the Northeast Pacific, in which 67,990 objects (fish and crustaceans) have been annotated on 30,384 images (the remainder have been annotated as “empty”). This dataset is released under CDLA-permissive 1.0.

Project Natick Underwater Video

This auxiliary release from Microsoft’s “Project Natick” underwater data center experiment provides 1,072 RGB images (with box annotations on fish and squid) recorded by cameras fixed on the outside of the data center pod. This dataset is released as part of an MIT-licensed software project, but the dataset license is not stated explicitly.

Roboflow Fish Dataset

This dataset is a collection of 1,350 images and boxes covering 26 fish categories, sourced from the Web. This dataset is released under CC0.

Salmon Computer Vision

This dataset includes 532,000 frames labeled with bounding boxes on 15 species from a weir on the Kitwanga River in BC, Canada. This dataset is licensed under CC-BY-NC-SA 4.0.

Tasmanian Orange Roughy Stereo Image Machine Learning Dataset (TORSI)

This dataset is a collection of annotated stereo image pairs collected by a net-attached Acoustic and Optical System (AOS) during orange roughy (Hoplostethus atlanticus) biomass surveys off the northeast coast of Tasmania, Australia in July 2019. This dataset is released under CC-BY-NC-SA 4.0.

FishTrack23

This dataset includes approximately 850k bounding boxes across 26 tracks in a variety of ocean conditions. This dataset is released under CC-BY 4.0.

AAU Zebrafish Re-Identification Dataset

This dataset includes ~2200 images of zebrafish in a lab environment with bounding box annotations and individual identifications. Six fish were recorded in total, in two sessions with three fish at a time. A 32 x 32 x 32 cm fish tank was used, with the fish constrained to swim within the first 3.5 cm of the tank. This dataset is released under CC BY 4.0.

fish with bounding boxes

Posted by Dan Morris.