List of other conservation data sets

There are lots of labeled data sets relevant to conservation that are not on LILA, of course, and rather than copying them all to LILA, we’ll use this post to track other data sets we know about. This is intended to track data sets that are nearly “machine-learning-ready”, i.e. an interesting set of labels that’s more or less attached to an interesting set of images/documents/etc. We are not tracking, for example, large repositories of unlabeled satellite data. We are also not tracking private data repositories; roughly, if access requires anything more than filling out a form that’s almost auto-approved, it doesn’t go on this list.

If you know of data sets not on this list, or if you own one of these data sets but can no longer maintain it and would like to transfer it over to LILA, email us!

Table of contents

Image data sets (terrestrial animals)
Image data sets (domestic animals)
Image data sets (marine/freshwater)
     …where marine life looks like
     …where marine life is more nuanced
Image data sets (plants)
Image data sets (geospatial)
Image data sets (other)
Acoustic data sets
Competitions
Other lists of data sets

Terrestrial wild animal images

iNaturalist images (animal photos)

The iNaturalist competition provides around 500k labeled handheld-camera photos of around 8k species, varying a bit from year to year. Data originate from iNaturalist, a citizen-science platform for wildlife observation. You can download a much larger – albeit much less curated – subset of iNaturalist data from GBIF.

NABirds (bird photos)

Around 48k images of 400 species of birds, with gender and age labels in many cases.

Caltech-UCSD birds (bird photos)

Caltech-UCSD Birds 200 (CUB-200) is an image dataset with photos of 200 bird species (mostly North American), including species labels, bounding boxes, and coarse segmentation masks.

Animals with attributes (animal photos)

37322 images of 50 animals classes with pre-extracted feature representations for each image.

Carrizo Camera Traps (camera traps)

100k camera trap images from California

Denison University Camera Trap Data (unlabeled camera traps)

~200 camera trap images from Denison University Biological Reserve (unlabeled)

MammalWeb OSF Data (camera traps)

~35k camera trap images from MammalWeb, with species labels

Penguin Counting in the Wild (camera trap photos with keypoints)

73,802 images taken by 15 different cameras from the Penguin Watch project, including keypoints indicating penguin locations within each image. Also available here in .mat and .json formats.

The Aerial Elephant Dataset (aerial images)

>2k images containing >15k annotated elephants

apic.ai bee poses (annotated bee images)

~200 images of bees with keypoint annotations

Arribada Human-Wildlife Conflict (annotated elephant images)

~76k thermal images of elephants, humans, and goats

Chimpanzee Faces in the Wild (individual ID)

Around 80 labeled examples each from around 25 individual chimps, with individual identifications.

Wildlife Image and Localization Dataset (species and bounding box labels)

Around 6k handheld-camera images and around 12k bounding boxes for 28 species.

Global Model of Bird Detection Dataset (birds in aerial images)

Images and around 250,000 keypoint annotations from 13 bird detection projects.

Images and annotations to automate the classification of avian species (birds cropped from aerial images)

~11k birds cropped from aerial images, in 10 categories

Aerial Photo Imagery from Fall Waterfowl Surveys (birds in aerial images)

~130k aerial images with keypoint annotations on birds

Drones and deep learning for seabird colonies (birds in drone images)

28 drone mosaics with ~40k annotations on penguins and albatrosses

Counting animals in aerial images with a density map estimation model (penguins in aerial images)

Keypoint annotations on penguins in aerial images

Identification of free-ranging mugger crocodiles (crocodiles in drone images)

Individual ID annotations on crocodiles in drone images

Drones count wildlife more accurately and precisely than humans (counts and drone images)

Counts of fake bird colonies in drone images

UAV-derived waterfowl thermal imagery dataset (thermal and RGB images)

Waterfowl annotated in thermal drone images

Improving the precision and accuracy of animal population estimates with aerial image object detection

Point and species annotations on aerial images of savanna

Plittersdorf dataset (stereo camera trap images)

221 stereo camera trap videos of deer, with instance masks

Domestic animals

Stanford Dogs (dog photos with bounding boxes)

Around 20k images of 120 dog breeds, with both class labels and bounding boxes. Conservation-related? Not exactly, but let’s face it, lots of us work on this kind of data because we like looking at pictures of animals.

Oxford Pets (pet photos with bounding boxes and masks)

Around 7500 images of pets in 37 classes, with class labels, bounding boxes, and segmentation masks. Again, maybe not squarely related to conservation or biology, but finding furry things with machine learning is finding furry things with machine learning, right?

Fresian Cattle 2015 (individual ID)

~350 images labels as ~50 individual cows

Naemura Lab Cattle Detection

~2000 boxes on cattle in aerial imagery

Cattle Noseprints for Individual ID

~5000 images of cattle muzzles with individual IDs

Marine/freshwater images

This section is broken into datasets where marine life looks like what a little kid thinks a fish looks like (you know, like ), and datasets with a more diverse concept of marine life.

Marine/freshwater images (where fish look fishy)

Project Natick Underwater Video (marine species)

~1k images of fish w/bounding boxes

Application of a Deep Learning Image Classifier for Identification of Amazonian Fishes (segmented fish)

~3k images of out-of-water fish w/species labels and segmentation masks

Roboflow Fish Dataset (boxes on fish)

680 images of fish w/bounding boxes

Labeled Fishes in the Wild (boxes on fish)

~1k images of fish w/boxes, ~3k blanks

Fishnet.AI (images of fishing vessels)

~163k bounding boxes on ~35k images of fish and people on fishing vessels

Croatian Fish (cropped images of fish)

800 images of fish in 12 classes (description)

DeepFish (annotated fish images)

~40k images with a mix of classification, segmentation, and counting labels

The Brackish Dataset (annotated videos of fish)

~90 videos with bounding boxes on fish

Deep Vision Fish Dataset (segmented fish)

Segmented fish and associated empty backgrounds, intended for training data generation

BrackishMOT (annotated videos of fish)

98 videos of fish with tracking boxes (i.e., boxes with stable frame-to-frame IDs)

Visual Marine Animal Tracking (VMAT)

32 video sequences with bounding boxes on a variety of species

OzFish (BRUV images w/boxes)

80k cropped fish images with 45k bounding boxes

VIAME FishTrack (BRUV images w/boxes)

Several thousand BRUV images with bounding boxes on fish and bait

F4K Detection and Tracking (videos with tracking points)

17 10-minute videos with tracking points

FishCLEF-2015 (videos with boxes)

14k boxes on fish in 20k images

Brackish Underwater Dataset (images with boxes)

12.5k boxes on fish and other species in 15k images

WildFish (cropped images of fish from online sources)

54,459 images of fish in 1000 categories

Object detection of tropical freshwater fish in Australia (freshwater species)

~44k images of fish w/ ~83kbounding boxes

AFFiNe (images of fish)

~7k labeled images of freshwater fish, generally not in the water, cropped close

Marine/freshwater images (where marine life doesn’t exactly look fishy)

Sea turtles in drone imagery

Point annotations on sea turtles in drone images

FathomNet (annotated images of ocean life/structures)

~70k labeled images representing a variety of marine entities

NOAA Dolphin ID

1011 dolphin fin images with individual IDs

Whales from Space (exactly what it says)

633 boxes on whales in satellite imagery

SMarTar-ID (Standardised Marine Taxon Reference Image Database) (ocean imagery)

Database of ocean species images, particularly cnidarians and sponges

Eagle rays images (boxes on rays)

~500 aerial images w/boxes on rays

Caltech Fish Counting (freshwater sonar)

>500k annotations on fish in sonar video

Plants

Tree species in Northern Australia (trees in UAV imagery)

2547 polygons on 36 Australian tree species

Urban Tree Detection Data (trees in aerial imagery)

Keypoints on ~40k trees in NAIP data.

The Auto Arborist Dataset (trees in street-level imagery)

>2M trees in street-level imagery annotated by genus

Oxford Flowers (flower photos)

Images of approximately 120 flower species, with between 40 and 250 images of each.

Pl@ntNet-300K (images of plants)

~300k labeled images of plants

Healthy vs. Diseased Leaf Image Dataset (images of leaves)

~4k images labeled w/species and disease status

CanaTree100 (images of trees)

100 images with 920 trees w/segmentation masks

NeonTreeEvaluation (trees in aerial and lidar surveys)

~3k bounding box annotations on RGB, hyperspectral, and lidar survey data

Pasadena Urban Trees (trees in aerial and street view photos)

Around 30k trees, imaged from aerial and street views, with location and species information.

Image data sets (geospatial)

BigEarthNet (land cover, satellite)

>500k Sentinel images with patch-level land cover labels

EuroSAT (land cover, satellite)

>27k Sentinel images with patch-level land cover labels

Image data sets (other)

TACO (trash)

Segmentation labels and taxonomic identifiers for garbage.

Bioacoustic data sets

Fully-Annotated Soundscape Recordings from the Northeastern United States

285 hour-long recordings with 50,760 bounding boxes on 81 bird species

An annotated set of audio recordings of Eastern NA birds

16,052 annotations on 48 species in 385 minutes

BirdVox

Several hundred thousand labeled audio clips of North American birds

xeno-canto

>1M bird IDs in nearly 1M recordings, covering >10k species

Watkins Marine Mammal Sounds (marine mammal recordings)

~15,000 high-quality excerpts from 32 marine mammal species, and additional lower-quality or unannotated data

Orcasound data (orca recordings)

Annotated orca recordings

A bunch of relevant competitions

Competitions are a great way to get started doing machine learning for environmental science, and each comes with a data set. Here are a few competitions that involve wildlife…

Competitions: terrestrial animal images

Conser-vision Practice Area (camera trap image classification)

Deep Chimpact (depth estimation for wildlife conservation)

Hakuna Ma-data (camera trap image classification)

Pri-matrix Factorization (individual chimp recognition)

iNaturalist computer vision competition (handheld photos of animals)

iWildCam (camera trap images)

Amur Tiger Re-identification in the Wild (individual ID for tigers from video)

NOAA Fisheries Steller Sea Lion Population Count (aerial images)

Snake Species ID Challenge (w/ ~100k images of ~4k snake species)

SnakeCLEF (snake images)

Competitions: marine/freshwater images

N+1 fish, N+2 fish (fish detection and classification in ship-deck photos)

Where’s Whale-do? (individual whale identification)

Great Barrier Reef Crown of Thorns Detection (marine video with object labels on starfish)

NOAA Right Whale Recognition (aerial images)

Humpback Whale Identification (individual ID from fluke photos)

Nature Conservancy Fisheries Monitoring (species ID from ship-deck photos)

ImageCLEF coral (coral localization and identification in images)

SeaCLEF (marine animal identification in images and video)

Sea Turtle Face Detection (bounding box annotations on turtles)

Turtle Recall (individual ID)

FathomNet 2023 (species ID)

Competitions: plant images

PlantCLEF (plant identification in images)

Competitions: geospatial images

Amazon Rainforest Challenge (deforestation monitoring from Landsat/Sentinel images)

Understanding the Amazon from Space (land cover from satellite images)

ICLR workshop challenge on crop detection from satellite imagery

Competitions: bioacoustics

Whale Detection Challenge (bioacoustics)

DCASE Bird Audio Detection challenge (bird detection from audio)

BirdCLEF (bird identification in audio)

Cornell Birdcall Identification (bird detection and identification from audio)

NIPS4B Multilabel Bird Species Classification (from audio)

Competitions: other

Random Walk of the Penguins (penguin population change prediction)

GeoLifeCLEF (species distribution estimation)

FungiCLEF (fungi images)

Other useful lists of open data sets

…that are relevant to environmental science, though maybe not directly focused on conservation.

Radiant Earth ML Hub

Esri Living Atlas of the World

Microsoft Planetary Computer Data Catalog

Earth on AWS

Google Earth Engine Data Catalog

GBIF Datasets

Awesome Deep Ecology (GitHub)

Kaggle dataset/competition search for “wildlife”

Kaggle dataset/competition search for “conservation”

Kaggle dataset/competition search for “animals”

Google Dataset Search for “wildlife”

Google Dataset Search for “conservation”

Google Dataset Search for “animals”

Posted by Dan Morris.