There are lots of labeled data sets relevant to conservation that are not on LILA, of course, and rather than copying them all to LILA, this pagetracks other data sets we know about. This page complements the list of LILA datasets; the union of these two list is every conservation-related labeled dataset we’re aware of.
A few boundaries we draw around this list:
- This page is intended to track data sets that are nearly “machine-learning-ready”, i.e. an interesting set of labels that’s more or less attached to an interesting set of images/documents/etc. We are not tracking, for example, large repositories of unlabeled data.
- We are also not tracking private data repositories; roughly, if access requires anything more than filling out a form that’s almost auto-approved, it doesn’t go on this list.
- This page is not intended to track labeled geospatial data, even though many labeled geospatial datasets are critical for conservation. We do try to track other lists of labeled geospatial data in the “other lists of data sets” section at the end of this page.
If you know of data sets not on this list, or if you own one of these data sets but can no longer maintain it and would like to transfer it over to LILA, email us!
Image data sets (terrestrial wild animals) (ground-based sensors)
Image data sets (terrestrial wild animals) (aerial/drone)
Image data sets (domestic animals)
Image data sets (marine/freshwater)
…where marine life looks like
…where marine life is more nuanced
Image data sets (plants)
Image data sets (geospatial)
Image data sets (other)
Acoustic data sets
Other lists of data sets
The iNaturalist competition provides around 500k labeled handheld-camera photos of around 8k species, varying a bit from year to year. Data originate from iNaturalist, a citizen-science platform for wildlife observation.
Around 48k images of 400 species of birds, with gender and age labels in many cases.
Caltech-UCSD Birds 200 (CUB-200) is an image dataset with photos of 200 bird species (mostly North American), including species labels, bounding boxes, and coarse segmentation masks.
37322 images of 50 animals classes with pre-extracted feature representations for each image.
100k camera trap images from California
~200 camera trap images from Denison University Biological Reserve (unlabeled)
~35k camera trap images from MammalWeb, with species labels
~200 images of bees with keypoint annotations
~76k thermal images of elephants, humans, and goats
Around 80 labeled examples each from around 25 individual chimps, with individual identifications.
Around 6k handheld-camera images and around 12k bounding boxes for 28 species.
221 stereo camera trap videos of deer, with instance masks
1431 video sequences of 13 individual captive polar bears
~5k frames of ants with boxes
8524 images of grazing animals in Kenya from custom camera traps
~6k 224×224 crops from camera trap images in New Zealand
~30k boxes on nine taxa of insects
>2k images containing >15k annotated elephants
Images and around 250,000 keypoint annotations from 13 bird detection projects.
~130k aerial images with keypoint annotations on birds
28 drone mosaics with ~40k annotations on penguins and albatrosses
Keypoint annotations on penguins in aerial images
Individual ID annotations on crocodiles in drone images
Counts of fake bird colonies in drone images
Waterfowl annotated in thermal drone images
Point and species annotations on aerial images of savanna
Drone images of ungulates and geladas with 40532 bounding boxes
Oblique aerial videos of zebras with 162931 bounding boxes and behavioral labels (standing, grazing, etc.).
Around 20k images of 120 dog breeds, with both class labels and bounding boxes. Conservation-related? Not exactly, but let’s face it, lots of us work on this kind of data because we like looking at pictures of animals.
Around 7500 images of pets in 37 classes, with class labels, bounding boxes, and segmentation masks. Again, maybe not squarely related to conservation or biology, but finding furry things with machine learning is finding furry things with machine learning, right?
~350 images labels as ~50 individual cows
~2000 boxes on cattle in aerial imagery
~5000 images of cattle muzzles with individual IDs
~10k images and ~300 videos of in-barn cattle with boxes and individual IDs
~3700 images of in-barn cattle with boxes and individual IDs
This section is broken into datasets where marine life looks like what a little kid thinks a fish looks like (you know, like ), and datasets with a more diverse concept of marine life.
~1k images of fish w/bounding boxes
~3k images of out-of-water fish w/species labels and segmentation masks
680 images of fish w/bounding boxes
~1k images of fish w/boxes, ~3k blanks
~163k bounding boxes on ~35k images of fish and people on fishing vessels
800 images of fish in 12 classes (description)
~40k images with a mix of classification, segmentation, and counting labels
~90 videos with bounding boxes on fish
Segmented fish and associated empty backgrounds, intended for training data generation
98 videos of fish with tracking boxes (i.e., boxes with stable frame-to-frame IDs)
32 video sequences with bounding boxes on a variety of species
80k cropped fish images with 45k bounding boxes
Several thousand BRUV images with bounding boxes on fish and bait
17 10-minute videos with tracking points
14k boxes on fish in 20k images
12.5k boxes on fish and other species in 15k images
54,459 images of fish in 1000 categories
~44k images of fish w/ ~83kbounding boxes
~7k labeled images of freshwater fish, generally not in the water, cropped close
~68k boxes on fish and crabs. (I don’t generally include LILA datasets on this page, but I’m breaking my own rule just this once, because I use this section as a de facto list of public fish-y-fish datasets.)
435 images of brook trout with individual ID labels
~2200 images of zebrafish with individual IDs
Eight long stereo video sequences of zebrafish with boxes and keypoints
Boxes on 532,000 frames from 1,567 videos of salmon in two weirs
Point annotations on sea turtles in drone images
~70k labeled images representing a variety of marine entities
1011 dolphin fin images with individual IDs
633 boxes on whales in satellite imagery
Database of ocean species images, particularly cnidarians and sponges
~500 aerial images w/boxes on rays
>500k annotations on fish in sonar video
Images of ringed seals with individual ID labels and segmentation masks
Aerial imagery from New Zealand in which Kahikatea trees have been masked
2547 polygons on 36 Australian tree species
Keypoints on ~40k trees in NAIP data.
>2M trees in street-level imagery annotated by genus
Images of approximately 120 flower species, with between 40 and 250 images of each.
~300k labeled images of plants
~4k images labeled w/species and disease status
100 images with 920 trees w/segmentation masks
~3k bounding box annotations on RGB, hyperspectral, and lidar survey data
Around 30k trees, imaged from aerial and street views, with location and species information.
1130 trees (with classes) manually segmented in airborne lidar data
986 drone images of avocado plantations with segmented individual trees
>500k Sentinel images with patch-level land cover labels
>27k Sentinel images with patch-level land cover labels
Segmentation labels and taxonomic identifiers for garbage.
285 hour-long recordings with 50,760 bounding boxes on 81 bird species
33 hour-long recordings with 20,147 bounding boxes on 56 bird species
21 hour-long recordings with 14,798 bounding boxes on 132 bird species
100 10-minute recordings with 10,296 bounding boxes on 21 bird species
635 recordings with 59,583 bounding boxes on 27 bird species
34 hour-long recordings with 6,952 bounding boxes on 89 bird species
16,052 annotations on 48 species in 385 minutes
Several hundred thousand labeled audio clips of North American birds
>1M bird IDs in nearly 1M recordings, covering >10k species
93k recordings of frogs
~40k annotations on ~12 hours of audio for 118 sound types including 58 bird species
~15,000 high-quality excerpts from 32 marine mammal species, and additional lower-quality or unannotated data
Annotated orca recordings
~50k recordings with species labels
Competitions are a great way to get started doing machine learning for environmental science, and each comes with a data set. Here are a few competitions that involve wildlife…
Competitions: terrestrial animal images
Conser-vision Practice Area (camera trap image classification)
Deep Chimpact (depth estimation for wildlife conservation)
Hakuna Ma-data (camera trap image classification)
Pri-matrix Factorization (individual chimp recognition)
iNaturalist computer vision competition (handheld photos of animals)
iWildCam (camera trap images)
Amur Tiger Re-identification in the Wild (individual ID for tigers from video)
NOAA Fisheries Steller Sea Lion Population Count (aerial images)
Snake Species ID Challenge (w/ ~100k images of ~4k snake species)
SnakeCLEF (snake images)
Competitions: marine/freshwater images
N+1 fish, N+2 fish (fish detection and classification in ship-deck photos)
Where’s Whale-do? (individual whale identification)
Great Barrier Reef Crown of Thorns Detection (marine video with object labels on starfish)
NOAA Right Whale Recognition (aerial images)
Humpback Whale Identification (individual ID from fluke photos)
Nature Conservancy Fisheries Monitoring (species ID from ship-deck photos)
ImageCLEF coral (coral localization and identification in images)
SeaCLEF (marine animal identification in images and video)
Sea Turtle Face Detection (bounding box annotations on turtles)
Turtle Recall (individual ID)
FathomNet 2023 (species ID)
Competitions: plant images
PlantCLEF (plant identification in images)
Competitions: geospatial images
Amazon Rainforest Challenge (deforestation monitoring from Landsat/Sentinel images)
Understanding the Amazon from Space (land cover from satellite images)
Whale Detection Challenge (bioacoustics)
DCASE Bird Audio Detection challenge (bird detection from audio)
BirdCLEF (bird identification in audio)
Cornell Birdcall Identification (bird detection and identification from audio)
NIPS4B Multilabel Bird Species Classification (from audio)
Random Walk of the Penguins (penguin population change prediction)
GeoLifeCLEF (species distribution estimation)
FungiCLEF (fungi images)
…that are relevant to environmental science, though maybe not directly focused on conservation.
LILA datasets (just in case someone landed here and isn’t aware of the context… the page you’re looking at right now is a list of datasets that aren’t hosted on LILA)
OpenForest (a list of open-access forestry data)
Posted by Dan Morris.