There are lots of labeled data sets relevant to conservation that are not on LILA, of course, and rather than copying them all to LILA, we’ll use this post to track other data sets we know about. This is intended to track data sets that are nearly “machine-learning-ready”, i.e. an interesting set of labels that’s more or less attached to an interesting set of images/documents/etc. We are not tracking, for example, large repositories of unlabeled satellite data. We are also not tracking private data repositories; roughly, if access requires anything more than filling out a form that’s almost auto-approved, it doesn’t go on this list.
If you know of data sets not on this list, or if you own one of these data sets but can no longer maintain it and would like to transfer it over to LILA, email us!
The iNaturalist competition provides around 500k labeled handheld-camera photos of around 8k species, varying a bit from year to year. Data originate from iNaturalist, a citizen-science platform for wildlife observation. You can download a much larger – albeit much less curated – subset of iNaturalist data from GBIF.
Caltech-UCSD Birds 200 (CUB-200) is an image dataset with photos of 200 bird species (mostly North American), including species labels, bounding boxes, and coarse segmentation masks.
Images of approximately 120 flower species, with between 40 and 250 images of each.
37322 images of 50 animals classes with pre-extracted feature representations for each image.
Around 80 labeled examples each from around 25 individual chimps, with individual identifications.
Around 6k handheld-camera images and around 12k bounding boxes for 28 species.
Around 20k images of 120 dog breeds, with both class labels and bounding boxes. Conservation-related? Not exactly, but let’s face it, lots of us work on this kind of data because we like looking at pictures of animals.
Around 7500 images of pets in 37 classes, with class labels, bounding boxes, and segmentation masks. Again, maybe not squarely related to conservation or biology, but finding furry things with machine learning is finding furry things with machine learning, right?
Around 48k images of 400 species of birds, with gender and age labels in many cases.
Around 30k trees, imaged from aerial and street views, with location and species information.
~1k images of fish w/bounding boxes
100k camera trap images from California
~350 images labels as ~50 individual cows
~200 camera trap images from Denison University Biological Reserve (unlabeled)
~35k camera trap images from MammalWeb, with species labels
~163k bounding boxes on ~35k images of fish and people on fishing vessels
Competitions are a great way to get started doing machine learning for environmental science, and each comes with a data set. Here are a few competitions that involve wildlife…
iNaturalist (handheld photos of animals)
iWildCam (camera trap images)
Amur Tiger Re-identification in the Wild (individual ID for tigers from video)
Whale Detection Challenge (bioacoustics)
NOAA Fisheries Steller Sea Lion Population Count (aerial images)
NOAA Right Whale Recognition (aerial images)
Humpback Whale Identification (individual ID from fluke photos)
Nature Conservancy Fisheries Monitoring (species ID from ship-deck photos)
ImageCLEF coral (coral localization and identification in images)
BirdCLEF (bird identification in audio recordings)
PlantCLEF (plant identification in images)
GeoLifeCLEF (species distribution estimation)
SeaCLEF (marine animal identification in images and video)
Snake Species ID Challenge (w/ ~100k images of ~4k snake species)
NIPS4B Multilabel Bird Species Classification (from audio recordings)
…that at least touch on environmental science.