There are lots of labeled data sets relevant to conservation that are not on LILA, of course, and rather than copying them all to LILA, we’ll use this post to track other data sets we know about. This is intended to track data sets that are nearly “machine-learning-ready”, i.e. an interesting set of labels that’s more or less attached to an interesting set of images/documents/etc. We are not tracking, for example, large repositories of unlabeled satellite data. We are also not tracking private data repositories; roughly, if access requires anything more than filling out a form that’s almost auto-approved, it doesn’t go on this list.
If you know of data sets not on this list, or if you own one of these data sets but can no longer maintain it and would like to transfer it over to LILA, email us!
Image data sets (terrestrial animals)
Image data sets (domestic animals)
Image data sets (marine/freshwater)
Image data sets (plants)
Image data sets (geospatial)
Image data sets (other)
Acoustic data sets
Other lists of data sets
The iNaturalist competition provides around 500k labeled handheld-camera photos of around 8k species, varying a bit from year to year. Data originate from iNaturalist, a citizen-science platform for wildlife observation. You can download a much larger – albeit much less curated – subset of iNaturalist data from GBIF.
Around 48k images of 400 species of birds, with gender and age labels in many cases.
Caltech-UCSD Birds 200 (CUB-200) is an image dataset with photos of 200 bird species (mostly North American), including species labels, bounding boxes, and coarse segmentation masks.
37322 images of 50 animals classes with pre-extracted feature representations for each image.
100k camera trap images from California
~200 camera trap images from Denison University Biological Reserve (unlabeled)
~35k camera trap images from MammalWeb, with species labels
>2k images containing >15k annotated elephants
~200 images of bees with keypoint annotations
~76k thermal images of elephants, humans, and goats
Around 80 labeled examples each from around 25 individual chimps, with individual identifications.
Around 6k handheld-camera images and around 12k bounding boxes for 28 species.
Around 20k images of 120 dog breeds, with both class labels and bounding boxes. Conservation-related? Not exactly, but let’s face it, lots of us work on this kind of data because we like looking at pictures of animals.
Around 7500 images of pets in 37 classes, with class labels, bounding boxes, and segmentation masks. Again, maybe not squarely related to conservation or biology, but finding furry things with machine learning is finding furry things with machine learning, right?
~350 images labels as ~50 individual cows
~2000 boxes on cattle in aerial imagery
~1k images of fish w/bounding boxes
680 images of fish w/bounding boxes
~1k images of fish w/boxes, ~3k blanks
~500 aerial images w/boxes on rays
>500k annotations on fish in sonar video
~163k bounding boxes on ~35k images of fish and people on fishing vessels
800 images of fish in 12 classes
~40k images with a mix of classification, segmentation, and counting labels
~70k labeled images representing a variety of marine entities
~90 videos with bounding boxes on fish
~7k labeled images of freshwater fish
633 boxes on whales in satellite imagery
Database of ocean species images, particularly cnidarians and sponges
Segmented fish and associated empty backgrounds, intended for training data generation
Keypoints on ~40k trees in NAIP data.
>2M trees in street-level imagery annotated by genus
Images of approximately 120 flower species, with between 40 and 250 images of each.
~300k labeled images of plants
~4k images labeled w/species and disease status
100 images with 920 trees w/segmentation masks
~3k bounding box annotations on RGB, hyperspectral, and lidar survey data
Around 30k trees, imaged from aerial and street views, with location and species information.
>500k Sentinel images with patch-level land cover labels
>27k Sentinel images with patch-level land cover labels
Segmentation labels and taxonomic identifiers for garbage.
285 hour-long recordings with 50,760 bounding boxes on 81 bird species
16,052 annotations on 48 species in 385 minutes
Several hundred thousand labeled audio clips of North American birds
>1M bird IDs in nearly 1M recordings, covering >10k species
~14,000 annotations on eight species of baleen whales, a mix of annotated and un-annotated data on other marine mammals
~15,000 high-quality excerpts from 32 marine mammal species, and additional lower-quality or unannotated data
Annotated orca recordings
Competitions are a great way to get started doing machine learning for environmental science, and each comes with a data set. Here are a few competitions that involve wildlife…
Competitions: terrestrial animal images
Conser-vision Practice Area (camera trap image classification)
Deep Chimpact (depth estimation for wildlife conservation)
Hakuna Ma-data (camera trap image classification)
Pri-matrix Factorization (individual chimp recognition)
iNaturalist computer vision competition (handheld photos of animals)
iWildCam (camera trap images)
Amur Tiger Re-identification in the Wild (individual ID for tigers from video)
NOAA Fisheries Steller Sea Lion Population Count (aerial images)
Snake Species ID Challenge (w/ ~100k images of ~4k snake species)
SnakeCLEF (snake images)
Competitions: marine/freshwater images
N+1 fish, N+2 fish (fish detection and classification in ship-deck photos)
Where’s Whale-do? (individual whale identification)
Great Barrier Reef Crown of Thorns Detection (marine video with object labels on starfish)
NOAA Right Whale Recognition (aerial images)
Humpback Whale Identification (individual ID from fluke photos)
Nature Conservancy Fisheries Monitoring (species ID from ship-deck photos)
ImageCLEF coral (coral localization and identification in images)
SeaCLEF (marine animal identification in images and video)
Sea Turtle Face Detection (bounding box annotations on turtles)
Turtle Recall (individual ID)
Competitions: plant images
PlantCLEF (plant identification in images)
Competitions: geospatial images
Amazon Rainforest Challenge (deforestation monitoring from Landsat/Sentinel images)
Understanding the Amazon from Space (land cover from satellite images)
Whale Detection Challenge (bioacoustics)
DCASE Bird Audio Detection challenge (bird detection from audio)
BirdCLEF (bird identification in audio)
Cornell Birdcall Identification (bird detection and identification from audio)
NIPS4B Multilabel Bird Species Classification (from audio)
Random Walk of the Penguins (penguin population change prediction)
GeoLifeCLEF (species distribution estimation)
FungiCLEF (fungi images)
…that are relevant to environmental science, though maybe not directly focused on conservation.
Posted by Dan Morris.