List of other conservation data sets

There are lots of labeled data sets relevant to conservation that are not on LILA, of course, and rather than copying them all to LILA, we’ll use this post to track other data sets we know about. This is intended to track data sets that are nearly “machine-learning-ready”, i.e. an interesting set of labels that’s more or less attached to an interesting set of images/documents/etc. We are not tracking, for example, large repositories of unlabeled satellite data. We are also not tracking private data repositories; roughly, if access requires anything more than filling out a form that’s almost auto-approved, it doesn’t go on this list.

If you know of data sets not on this list, or if you own one of these data sets but can no longer maintain it and would like to transfer it over to LILA, email us!

Table of contents

Image data sets (terrestrial animals)
Image data sets (domestic animals)
Image data sets (marine/freshwater)
Image data sets (plants)
Image data sets (geospatial)
Image data sets (other)
Acoustic data sets
Competitions
Other lists of data sets

Terrestrial wild animal images

iNaturalist images (animal photos)

The iNaturalist competition provides around 500k labeled handheld-camera photos of around 8k species, varying a bit from year to year. Data originate from iNaturalist, a citizen-science platform for wildlife observation. You can download a much larger – albeit much less curated – subset of iNaturalist data from GBIF.

NABirds (bird photos)

Around 48k images of 400 species of birds, with gender and age labels in many cases.

Caltech-UCSD birds (bird photos)

Caltech-UCSD Birds 200 (CUB-200) is an image dataset with photos of 200 bird species (mostly North American), including species labels, bounding boxes, and coarse segmentation masks.

Animals with attributes (animal photos)

37322 images of 50 animals classes with pre-extracted feature representations for each image.

Carrizo Camera Traps (camera traps)

100k camera trap images from California

Denison University Camera Trap Data (unlabeled camera traps)

~200 camera trap images from Denison University Biological Reserve (unlabeled)

MammalWeb OSF Data (camera traps)

~35k camera trap images from MammalWeb, with species labels

Penguin Counting in the Wild (camera trap photos with keypoints)

73,802 images taken by 15 different cameras from the Penguin Watch project, including keypoints indicating penguin locations within each image. Also available here in .mat and .json formats.

The Aerial Elephant Dataset (aerial images)

>2k images containing >15k annotated elephants

apic.ai bee poses (annotated bee images)

~200 images of bees with keypoint annotations

Arribada Human-Wildlife Conflict (annotated elephant images)

~76k thermal images of elephants, humans, and goats

Chimpanzee Faces in the Wild (individual ID)

Around 80 labeled examples each from around 25 individual chimps, with individual identifications.

Wildlife Image and Localization Dataset (species and bounding box labels)

Around 6k handheld-camera images and around 12k bounding boxes for 28 species.

Domestic animals

Stanford Dogs (dog photos with bounding boxes)

Around 20k images of 120 dog breeds, with both class labels and bounding boxes. Conservation-related? Not exactly, but let’s face it, lots of us work on this kind of data because we like looking at pictures of animals.

Oxford Pets (pet photos with bounding boxes and masks)

Around 7500 images of pets in 37 classes, with class labels, bounding boxes, and segmentation masks. Again, maybe not squarely related to conservation or biology, but finding furry things with machine learning is finding furry things with machine learning, right?

Fresian Cattle 2015 (individual ID)

~350 images labels as ~50 individual cows

Naemura Lab Cattle Detection

~2000 boxes on cattle in aerial imagery

Marine/freshwater images

Project Natick Underwater Video (marine species)

~1k images of fish w/bounding boxes

Roboflow Fish Dataset (boxes on fish)

680 images of fish w/bounding boxes

Labeled Fishes in the Wild (boxes on fish)

~1k images of fish w/boxes, ~3k blanks

Eagle rays images (boxes on rays)

~500 aerial images w/boxes on rays

Caltech Fish Counting (freshwater sonar)

>500k annotations on fish in sonar video

Fishnet.AI (images of fishing vessels)

~163k bounding boxes on ~35k images of fish and people on fishing vessels

Croatian Fish (cropped images of fish)

800 images of fish in 12 classes

DeepFish (annotated fish images)

~40k images with a mix of classification, segmentation, and counting labels

FathomNet (annotated images of ocean life/structures)

~70k labeled images representing a variety of marine entities

The Brackish Dataset (annotated videos of fish)

~90 videos with bounding boxes on fish

AFFiNe (images of fish)

~7k labeled images of freshwater fish

Whales from Space (exactly what it says)

633 boxes on whales in satellite imagery

SMarTar-ID (Standardised Marine Taxon Reference Image Database) (ocean imagery)

Database of ocean species images, particularly cnidarians and sponges

Deep Vision Fish Dataset (segmented fish)

Segmented fish and associated empty backgrounds, intended for training data generation

Plants

Urban Tree Detection Data (trees in aerial imagery)

Keypoints on ~40k trees in NAIP data.

The Auto Arborist Dataset (trees in street-level imagery)

>2M trees in street-level imagery annotated by genus

Oxford Flowers (flower photos)

Images of approximately 120 flower species, with between 40 and 250 images of each.

Pl@ntNet-300K (images of plants)

~300k labeled images of plants

Healthy vs. Diseased Leaf Image Dataset (images of leaves)

~4k images labeled w/species and disease status

CanaTree100 (images of trees)

100 images with 920 trees w/segmentation masks

Image data sets (geospatial)

NeonTreeEvaluation (trees in aerial and lidar surveys)

~3k bounding box annotations on RGB, hyperspectral, and lidar survey data

Pasadena Urban Trees (trees in aerial and street view photos)

Around 30k trees, imaged from aerial and street views, with location and species information.

BigEarthNet (land cover, satellite)

>500k Sentinel images with patch-level land cover labels

EuroSAT (land cover, satellite)

>27k Sentinel images with patch-level land cover labels

Image data sets (other)

TACO (trash)

Segmentation labels and taxonomic identifiers for garbage.

Bioacoustic data sets

Fully-Annotated Soundscape Recordings from the Northeastern United States

285 hour-long recordings with 50,760 bounding boxes on 81 bird species

An annotated set of audio recordings of Eastern NA birds

16,052 annotations on 48 species in 385 minutes

BirdVox

Several hundred thousand labeled audio clips of North American birds

xeno-canto

>1M bird IDs in nearly 1M recordings, covering >10k species

MobySound (marine mammal recordings) (paper)

~14,000 annotations on eight species of baleen whales, a mix of annotated and un-annotated data on other marine mammals

Watkins Marine Mammal Sounds (marine mammal recordings)

~15,000 high-quality excerpts from 32 marine mammal species, and additional lower-quality or unannotated data

Orcasound data (orca recordings)

Annotated orca recordings

A bunch of relevant competitions

Competitions are a great way to get started doing machine learning for environmental science, and each comes with a data set. Here are a few competitions that involve wildlife…

Competitions: terrestrial animal images

Conser-vision Practice Area (camera trap image classification)

Deep Chimpact (depth estimation for wildlife conservation)

Hakuna Ma-data (camera trap image classification)

Pri-matrix Factorization (individual chimp recognition)

iNaturalist computer vision competition (handheld photos of animals)

iWildCam (camera trap images)

Amur Tiger Re-identification in the Wild (individual ID for tigers from video)

NOAA Fisheries Steller Sea Lion Population Count (aerial images)

Snake Species ID Challenge (w/ ~100k images of ~4k snake species)

SnakeCLEF (snake images)

Competitions: marine/freshwater images

N+1 fish, N+2 fish (fish detection and classification in ship-deck photos)

Where’s Whale-do? (individual whale identification)

Great Barrier Reef Crown of Thorns Detection (marine video with object labels on starfish)

NOAA Right Whale Recognition (aerial images)

Humpback Whale Identification (individual ID from fluke photos)

Nature Conservancy Fisheries Monitoring (species ID from ship-deck photos)

ImageCLEF coral (coral localization and identification in images)

SeaCLEF (marine animal identification in images and video)

Sea Turtle Face Detection (bounding box annotations on turtles)

Turtle Recall (individual ID)

Competitions: plant images

PlantCLEF (plant identification in images)

Competitions: geospatial images

Amazon Rainforest Challenge (deforestation monitoring from Landsat/Sentinel images)

Understanding the Amazon from Space (land cover from satellite images)

ICLR workshop challenge on crop detection from satellite imagery

Competitions: bioacoustics

Whale Detection Challenge (bioacoustics)

DCASE Bird Audio Detection challenge (bird detection from audio)

BirdCLEF (bird identification in audio)

Cornell Birdcall Identification (bird detection and identification from audio)

NIPS4B Multilabel Bird Species Classification (from audio)

Competitions: other

Random Walk of the Penguins (penguin population change prediction)

GeoLifeCLEF (species distribution estimation)

FungiCLEF (fungi images)

Other useful lists of open data sets

…that are relevant to environmental science, though maybe not directly focused on conservation.

Radiant Earth ML Hub

Esri Living Atlas of the World

Microsoft Planetary Computer Data Catalog

Earth on AWS

Google Earth Engine Data Catalog

GBIF Datasets

Awesome Deep Ecology (GitHub)

Kaggle dataset/competition search for “wildlife”

Kaggle dataset/competition search for “conservation”

Kaggle dataset/competition search for “animals”

Google Dataset Search for “wildlife”

Google Dataset Search for “conservation”

Google Dataset Search for “animals”

Posted by Dan Morris.