List of other conservation data sets

There are lots of labeled data sets relevant to conservation that are not on LILA, of course, and rather than copying them all to LILA, we’ll use this post to track other data sets we know about. This is intended to track data sets that are nearly “machine-learning-ready”, i.e. an interesting set of labels that’s more or less attached to an interesting set of images/documents/etc. We are not tracking, for example, large repositories of unlabeled satellite data. We are also not tracking private data repositories; roughly, if access requires anything more than filling out a form that’s almost auto-approved, it doesn’t go on this list.

If you know of data sets not on this list, or if you own one of these data sets but can no longer maintain it and would like to transfer it over to LILA, email us!

Table of contents

Image data sets (wildlife)
Image data sets (geospatial)
Image data sets (other)
Acoustic data sets
Competitions
Other lists of data sets

Image data sets (wildlife)

iNaturalist images (animal photos)

The iNaturalist competition provides around 500k labeled handheld-camera photos of around 8k species, varying a bit from year to year. Data originate from iNaturalist, a citizen-science platform for wildlife observation. You can download a much larger – albeit much less curated – subset of iNaturalist data from GBIF.

Caltech-UCSD birds (bird photos)

Caltech-UCSD Birds 200 (CUB-200) is an image dataset with photos of 200 bird species (mostly North American), including species labels, bounding boxes, and coarse segmentation masks.

Oxford Flowers (flower photos)

Images of approximately 120 flower species, with between 40 and 250 images of each.

Animals with attributes (animal photos)

37322 images of 50 animals classes with pre-extracted feature representations for each image.

Penguin Counting in the Wild (camera trap photos with keypoints)

73,802 images taken by 15 different cameras from the Penguin Watch project, including keypoints indicating penguin locations within each image. Also available here in .mat and .json formats.

Chimpanzee Faces in the Wild (individual ID)

Around 80 labeled examples each from around 25 individual chimps, with individual identifications.

Wildlife Image and Localization Dataset (species and bounding box labels)

Around 6k handheld-camera images and around 12k bounding boxes for 28 species.

Stanford Dogs (dog photos with bounding boxes)

Around 20k images of 120 dog breeds, with both class labels and bounding boxes. Conservation-related? Not exactly, but let’s face it, lots of us work on this kind of data because we like looking at pictures of animals.

Oxford Pets (pet photos with bounding boxes and masks)

Around 7500 images of pets in 37 classes, with class labels, bounding boxes, and segmentation masks. Again, maybe not squarely related to conservation or biology, but finding furry things with machine learning is finding furry things with machine learning, right?

NABirds (bird photos)

Around 48k images of 400 species of birds, with gender and age labels in many cases.

Pasadena Urban Trees (trees in aerial and street view photos)

Around 30k trees, imaged from aerial and street views, with location and species information.

Project Natick Underwater Video (marine species)

~1k images of fish w/bounding boxes

Carrizo Camera Traps (camera traps)

100k camera trap images from California

Fresian Cattle 2015 (individual ID)

~350 images labels as ~50 individual cows

Denison University Camera Trap Data (unlabeled camera traps)

~200 camera trap images from Denison University Biological Reserve (unlabeled)

MammalWeb OSF Data (camera traps)

~35k camera trap images from MammalWeb, with species labels

Fishnet.AI (images of fishing vessels)

~163k bounding boxes on ~35k images of fish and people on fishing vessels

Croatian Fish (cropped images of fish)

800 images of fish in 12 classes

The Aerial Elephant Dataset (aerial images)

>2k images containing >15k annotated elephants

apic.ai bee poses (annotated bee images)

~200 images of bees with keypoint annotations

Arribada Human-Wildlife Conflict (annotated elephant images)

~76k thermal images of elephants, humans, and goats

DeepFish (annotated fish images)

~40k images with a mix of classification, segmentation, and counting labels

Image data sets (geospatial)

BigEarthNet (land cover, satellite)

>500k Sentinel images with patch-level land cover labels

EuroSAT (land cover, satellite)

>27k Sentinel images with patch-level land cover labels

Image data sets (other)

TACO (trash)

Segmentation labels and taxonomic identifiers for garbage.

Bioacoustic data sets

MobySound (marine mammal recordings)

Maintained by the CIMRS Bioacoustics Group

Watkins Marine Mammal Sounds (marine mammal recordings)

Maintained by Woods Hole

OrcaSound training data (orca recordings)

Maintained by OrcaSound

North American bird audio with bbox spectral annotations

Maintained by the Kitzes Lab

The following table was curated by the Kitzes Lab at the University of Pittsburgh.

Also see this fantastic list of bioacoustic data sets curated by Justin Salamon.

NameMinutes of audioFile lengthNumber of recordingsNumber of speciesTypeLabel type: time/freqLabel type: classes
Animal Sound Archive    foreground speciesentire filespecies
BirdVox-70k~3600~10hr6NANFCs in contexttime & frequencypresence of any bird call, not labeled to species
BirdVox-DCASE-20k3333.310s20000 NFCentire filepresence of any bird call, not labeled to species
BirdVox-full-night3720      
CLO-43SDat maximum: ~90<1s542843single NFCsentire filespecies
CLO-WTSPat maximum: ~278<1s167031single NFCsentire fileWTSP present/other species present/no NFCs present
CLO-SWTHat maximum: ~298<1s1791111single NFCsentire fileSWTH present/other species present/no NFCs present
BirdDB (paper)  689~153foreground speciestimespecies/call type for single species in each recording
Freefield 10101281.710s7690NAsoundscapeentire filepresence of any bird call, not labeled to species
NIPS4B~24various68787 (classes)unsure; foreground species or soundscapeentire filespecies (call and song are separate classes in some species)
BirdCLEF 2016+2017/2018 Xeno-Canto (training)guess: 14,000various36,4961500foreground speciesentire filespecies
BirdCLEF 2016+2017/2018 Peru/Colombia (validation)204min5 soundscapetimespecies
BirdCLEF 2019 (training) various~50,000659foreground speciesentire filespecies
BirdCLEF 2019 (validation)~4345to be split into 5s segments  soundscape5s incrementsall species present in 5s segment
warblrb10k1333.310s8000NAsoundscapeentire filepresence of any bird call, not labeled to species
MLSP 2013107.510s64519soundscapeentire filewhich species are in the file
Macaulay Master Setguess: 2,000various4928737foreground speciesentire fileforeground species
Xeno Canto       
Xeno Canto ~600 US birdsguess: 18,000various77173>600foreground speciesentire fileforeground species

A bunch of relevant competitions

Competitions are a great way to get started doing machine learning for environmental science, and each comes with a data set. Here are a few competitions that involve wildlife…

iNaturalist (handheld photos of animals)

iWildCam (camera trap images)

Amur Tiger Re-identification in the Wild (individual ID for tigers from video)

Whale Detection Challenge (bioacoustics)

NOAA Fisheries Steller Sea Lion Population Count (aerial images)

NOAA Right Whale Recognition (aerial images)

Humpback Whale Identification (individual ID from fluke photos)

Nature Conservancy Fisheries Monitoring (species ID from ship-deck photos)

ImageCLEF coral (coral localization and identification in images)

BirdCLEF (bird identification in audio recordings)

PlantCLEF (plant identification in images)

GeoLifeCLEF (species distribution estimation)

SeaCLEF (marine animal identification in images and video)

Snake Species ID Challenge (w/ ~100k images of ~4k snake species)

NIPS4B Multilabel Bird Species Classification (from audio recordings)

DCASE Bird Audio Detection challenge

ICLR workshop challenge on crop detection from satellite imagery

Other useful lists of open data sets

…that at least touch on environmental science.

Radiant Earth ML Hub

Awesome Public Datasets (GitHub)

Awesome Deep Ecology (GitHub)

Datasets on Kaggle

Registry of Open Data on AWS

Earth on AWS

Google Cloud Public Datasets

Google Earth Engine Data Catalog

Google AI Datasets

Microsoft Research Open Data

Posted by Dan Morris.