There are lots of labeled data sets relevant to conservation that are not on LILA, of course, and rather than copying them all to LILA, we’ll use this post to track other data sets we know about. This is intended to track data sets that are nearly “machine-learning-ready”, i.e. an interesting set of labels that’s more or less attached to an interesting set of images/documents/etc. We are not tracking, for example, large repositories of unlabeled satellite data. We are also not tracking private data repositories; roughly, if access requires anything more than filling out a form that’s almost auto-approved, it doesn’t go on this list.
If you know of data sets not on this list, or if you own one of these data sets but can no longer maintain it and would like to transfer it over to LILA, email us!
The iNaturalist competition provides around 500k labeled handheld-camera photos of around 8k species, varying a bit from year to year. Data originate from iNaturalist, a citizen-science platform for wildlife observation. You can download a much larger – albeit much less curated – subset of iNaturalist data from GBIF.
Caltech-UCSD Birds 200 (CUB-200) is an image dataset with photos of 200 bird species (mostly North American), including species labels, bounding boxes, and coarse segmentation masks.
Images of approximately 120 flower species, with between 40 and 250 images of each.
37322 images of 50 animals classes with pre-extracted feature representations for each image.
Around 80 labeled examples each from around 25 individual chimps, with individual identifications.
Around 6k handheld-camera images and around 12k bounding boxes for 28 species.
Around 20k images of 120 dog breeds, with both class labels and bounding boxes. Conservation-related? Not exactly, but let’s face it, lots of us work on this kind of data because we like looking at pictures of animals.
Around 7500 images of pets in 37 classes, with class labels, bounding boxes, and segmentation masks. Again, maybe not squarely related to conservation or biology, but finding furry things with machine learning is finding furry things with machine learning, right?
Around 48k images of 400 species of birds, with gender and age labels in many cases.
~1k images of fish w/bounding boxes
100k camera trap images from California
~350 images labels as ~50 individual cows
~200 camera trap images from Denison University Biological Reserve (unlabeled)
~35k camera trap images from MammalWeb, with species labels
~163k bounding boxes on ~35k images of fish and people on fishing vessels
800 images of fish in 12 classes
>2k images containing >15k annotated elephants
~200 images of bees with keypoint annotations
~76k thermal images of elephants, humans, and goats
~40k images with a mix of classification, segmentation, and counting labels
~3k bounding box annotations on RGB, hyperspectral, and lidar survey data
Around 30k trees, imaged from aerial and street views, with location and species information.
>500k Sentinel images with patch-level land cover labels
>27k Sentinel images with patch-level land cover labels
Segmentation labels and taxonomic identifiers for garbage.
Maintained by the CIMRS Bioacoustics Group
Maintained by Woods Hole
Maintained by OrcaSound
Maintained by the Kitzes Lab
The following table was curated by the Kitzes Lab at the University of Pittsburgh.
|Name||Minutes of audio||File length||Number of recordings||Number of species||Type||Label type: time/freq||Label type: classes|
|Animal Sound Archive||foreground species||entire file||species|
|BirdVox-70k||~3600||~10hr||6||NA||NFCs in context||time & frequency||presence of any bird call, not labeled to species|
|BirdVox-DCASE-20k||3333.3||10s||20000||NFC||entire file||presence of any bird call, not labeled to species|
|CLO-43SD||at maximum: ~90||<1s||5428||43||single NFCs||entire file||species|
|CLO-WTSP||at maximum: ~278||<1s||16703||1||single NFCs||entire file||WTSP present/other species present/no NFCs present|
|CLO-SWTH||at maximum: ~298||<1s||179111||1||single NFCs||entire file||SWTH present/other species present/no NFCs present|
|BirdDB (paper)||689||~153||foreground species||time||species/call type for single species in each recording|
|Freefield 1010||1281.7||10s||7690||NA||soundscape||entire file||presence of any bird call, not labeled to species|
|NIPS4B||~24||various||687||87 (classes)||unsure; foreground species or soundscape||entire file||species (call and song are separate classes in some species)|
|BirdCLEF 2016+2017/2018 Xeno-Canto (training)||guess: 14,000||various||36,496||1500||foreground species||entire file||species|
|BirdCLEF 2016+2017/2018 Peru/Colombia (validation)||20||4min||5||soundscape||time||species|
|BirdCLEF 2019 (training)||various||~50,000||659||foreground species||entire file||species|
|BirdCLEF 2019 (validation)||~4345||to be split into 5s segments||soundscape||5s increments||all species present in 5s segment|
|warblrb10k||1333.3||10s||8000||NA||soundscape||entire file||presence of any bird call, not labeled to species|
|MLSP 2013||107.5||10s||645||19||soundscape||entire file||which species are in the file|
|Macaulay Master Set||guess: 2,000||various||4928||737||foreground species||entire file||foreground species|
|Xeno Canto ~600 US birds||guess: 18,000||various||77173||>600||foreground species||entire file||foreground species|
Competitions are a great way to get started doing machine learning for environmental science, and each comes with a data set. Here are a few competitions that involve wildlife…
iNaturalist (handheld photos of animals)
iWildCam (camera trap images)
Amur Tiger Re-identification in the Wild (individual ID for tigers from video)
Whale Detection Challenge (bioacoustics)
NOAA Fisheries Steller Sea Lion Population Count (aerial images)
NOAA Right Whale Recognition (aerial images)
Humpback Whale Identification (individual ID from fluke photos)
Nature Conservancy Fisheries Monitoring (species ID from ship-deck photos)
ImageCLEF coral (coral localization and identification in images)
BirdCLEF (bird identification in audio recordings)
PlantCLEF (plant identification in images)
GeoLifeCLEF (species distribution estimation)
SeaCLEF (marine animal identification in images and video)
Snake Species ID Challenge (w/ ~100k images of ~4k snake species)
NIPS4B Multilabel Bird Species Classification (from audio recordings)
…that at least touch on environmental science.
Posted by Dan Morris.