List of other conservation data sets

There are lots of labeled data sets relevant to conservation that are not on LILA, of course, and rather than copying them all to LILA, we’ll use this post to track other data sets we know about. This is intended to track data sets that are nearly “machine-learning-ready”, i.e. an interesting set of labels that’s more or less attached to an interesting set of images/documents/etc. We are not tracking, for example, large repositories of unlabeled satellite data. We are also not tracking private data repositories; roughly, if access requires anything more than filling out a form that’s almost auto-approved, it doesn’t go on this list.

If you know of data sets not on this list, or if you own one of these data sets but can no longer maintain it and would like to transfer it over to LILA, email us!

iNaturalist images (animal photos)

The iNaturalist competition provides around 500k labeled handheld-camera photos of around 8k species, varying a bit from year to year. Data originate from iNaturalist, a citizen-science platform for wildlife observation. You can download a much larger – albeit much less curated – subset of iNaturalist data from GBIF.

Caltech-UCSD birds (bird photos)

Caltech-UCSD Birds 200 (CUB-200) is an image dataset with photos of 200 bird species (mostly North American), including species labels, bounding boxes, and coarse segmentation masks.

Oxford Flowers (flower photos)

Images of approximately 120 flower species, with between 40 and 250 images of each.

Animals with attributes (animal photos)

37322 images of 50 animals classes with pre-extracted feature representations for each image.

Penguin Counting in the Wild (camera trap photos with keypoints)

73,802 images taken by 15 different cameras from the Penguin Watch project, including keypoints indicating penguin locations within each image. Also available here in .mat and .json formats.

Chimpanzee Faces in the Wild (individual ID)

Around 80 labeled examples each from around 25 individual chimps, with individual identifications.

Wildlife Image and Localization Dataset (species and bounding box labels)

Around 6k handheld-camera images and around 12k bounding boxes for 28 species.

Stanford Dogs (dog photos with bounding boxes)

Around 20k images of 120 dog breeds, with both class labels and bounding boxes. Conservation-related? Not exactly, but let’s face it, lots of us work on this kind of data because we like looking at pictures of animals.

Oxford Pets (pet photos with bounding boxes and masks)

Around 7500 images of pets in 37 classes, with class labels, bounding boxes, and segmentation masks. Again, maybe not squarely related to conservation or biology, but finding furry things with machine learning is finding furry things with machine learning, right?

NABirds (bird photos)

Around 48k images of 400 species of birds, with gender and age labels in many cases.

Pasadena Urban Trees (trees in aerial and street view photos)

Around 30k trees, imaged from aerial and street views, with location and species information.

Project Natick Underwater Video (marine species)

~1k images of fish w/bounding boxes

Carrizo Camera Traps (camera traps)

100k camera trap images from California

Fresian Cattle 2015 (individual ID)

~350 images labels as ~50 individual cows

Denison University Camera Trap Data (unlabeled camera traps)

~200 camera trap images from Denison University Biological Reserve (unlabeled)

MammalWeb OSF Data (camera traps)

~35k camera trap images from MammalWeb, with species labels

Fishnet.AI (images of fishing vessels)

~163k bounding boxes on ~35k images of fish and people on fishing vessels

Competitions are a great way to get started doing machine learning for environmental science, and each comes with a data set. Here are a few competitions that involve wildlife…

iNaturalist (handheld photos of animals)

iWildCam (camera trap images)

Amur Tiger Re-identification in the Wild (individual ID for tigers from video)

Whale Detection Challenge (bioacoustics)

NOAA Fisheries Steller Sea Lion Population Count (aerial images)

NOAA Right Whale Recognition (aerial images)

Humpback Whale Identification (individual ID from fluke photos)

Nature Conservancy Fisheries Monitoring (species ID from ship-deck photos)

ImageCLEF coral (coral localization and identification in images)

BirdCLEF (bird identification in audio recordings)

PlantCLEF (plant identification in images)

GeoLifeCLEF (species distribution estimation)

SeaCLEF (marine animal identification in images and video)

Snake Species ID Challenge (w/ ~100k images of ~4k snake species)

NIPS4B Multilabel Bird Species Classification (from audio recordings)

DCASE Bird Audio Detection challenge

…that at least touch on environmental science.

Awesome Public Datasets (GitHub)

Awesome Deep Ecology (GitHub)

Datasets on Kaggle

Registry of Open Data on AWS

Earth on AWS

Google Cloud Public Datasets

Google Earth Engine Data Catalog

Google AI Datasets

Microsoft Research Open Data