This dataset contains high-resolution aerial imagery from the USDA NAIP program , high-resolution land cover labels from the Chesapeake Conservancy , low-resolution land cover labels from the USGS NLCD 2011 dataset , low-resolution multi-spectral imagery from Landsat 8 , and high-resolution building footprint masks from Microsoft Bing , formatted to accelerate machine learning research into land cover mapping. The Chesapeake Conservancy spent over 10 months and $1.3 million creating a consistent six-class land cover dataset covering the Chesapeake Bay watershed. While the purpose of the mapping effort by the Chesapeake Conservancy was to create land cover data to be used in conservation efforts, the same data can be used to train machine learning models that can be applied over even wider areas.
The organization of this dataset (detailed below) will allow users to easily test questions related to this problem of geographic generalization, i.e. how to train machine learning models that can be applied over even wider areas. For example, this dataset can be used to directly estimate how well a model trained on data from Maryland can generalize over the remainder of the Chesapeake Bay.
Python code for training and testing deep learning models (Keras/TensorFlow based) can be found in the accompanying GitHub repository:
Further developments in models and related tools can be found at:
If you use this data set, please cite the associated manuscript:
Robinson C, Hou L, Malkin K, Soobitsky R, Czawlytko J, Dilkina B, Jojic N. Large Scale High-Resolution Land Cover Mapping with Multi-Resolution Data. Proceedings of the 2019 Conference on Computer Vision and Pattern Recognition (CVPR 2019). (bibtex)
At the highest level this dataset is organized by tiles. A tile is a spatial area measuring roughly 6km x 7.5km (with definitions that roughly match up with USGS quarter quadrangles). Each tile comes with seven corresponding GeoTIFFs:
- NAIP 2013/2014 imagery ("_naip-new.tif" suffix)
- NAIP 2011/2012 imagery ("_naip-old.tif" suffix)
- Chesapeake Conservancy land cover labels ("_lc.tif" suffix)
- NLCD 2011 labels ("_nlcd.tif" suffix)
- Landsat 8 leaf-on composite ("_landsat-leaf-on.tif" suffix)
- Landsat 8 leaf-off composite ("_landsat-leaf-off.tif" suffix)
- Building footprint mask ("_buildings.tif" suffix)
These GeoTIFFs are all aligned and at a 1m spatial resolution. Here, the low-resolution NLCD labels (natively at a 30m spatial resolution) have been reprojected to 1m with nearest-neighbor upsampling, while the NAIP and high-resolution land cover labels are natively aligned at 1m.
The Landsat 8 leaf-on and leaf-off composites are created from the median of the non-cloudy T1 surface reflectance pixels between April 1-September 30 and October 1st-March 31 in the years 2013-2017 respectively. The final composites are upsampled to 1m spatial resolution. Finally, the building footprints have been rasterized to a 1m resolution from their native polygon format, also with nearest-neighbor sampling.
There are 732 total tiles, 125 sampled uniformly from each of the following (state, year) pairs:
- Delaware 2013 (only 107 tiles)
- New York 2013
- Maryland 2013
- Pennsylvania 2013
- West Virginia 2014
- Virginia 2014
The ~125 tiles from each (state, year) pair are further split into 100 "train tiles" (except for Delaware, which has 82 train tiles), 5 "validation tiles", and 20 "test tiles". The filenames for each split are listed in the accompanying CSV files. For example, the filenames associated with the 125 tiles from West Virginia can be found in the following CSVs:
This dataset also includes 500 pre-generated patches from each training and validation tile. Here, a patch is defined as a random 256×256 (meter) crop from the tile’s extent. This results in 50,000 training patches and 2,500 validation patches per (state, year) pair. The filenames for each patch (and accompanying metadata) are also listed in a CSV for each (state, year). For example, the training and validation patches for West Virginia can be found in the following CSVs:
Furthermore, the spatial extent of each patch can be found in a similarly named GeoJSON file:
Each shape in the GeoJSON has a patch_id key that can be matched back to a row in the CSV mentioned above.
Each patch is a (29x256x256) tensor. The channels are described as follows:
- Channels 1-4 contain the R, G, B, and NIR bands respectively of the NAIP "new" imagery (from 2013/2014 NAIP). Values are uint8s (i.e. in the range [0, 255]).
- Channels 5-8 contain the R, G, B, and NIR bands respectively of the NAIP "old" imagery (from 2011/2012 NAIP).
- Channel 9 contains the high-resolution land cover labels:
1 = water
2 = tree canopy / forest
3 = low vegetation / field
4 = barren land
5 = impervious (other)
6 = impervious (road)
15 = no data
- Channel 10 contains the low-resolution NLCD labels. Values match those described here. The values 0 and 255 indicate that no data is available.
- Channels 11-19 contain the 9 bands of our Landsat 8 surface reflectance leaf-on imagery. Bands are described here. We take B1, B2, B3, B4, B5, B6, B7, B10, B11. Values are float32.
- Channels 20-28 contain the 9 bands of our Landsat 8 surface reflectance leaf-off imagery.
- Channel 29 contains a building footprint mask generated from Bing Building Footprints.
Having trouble downloading? Check out our FAQ.
ContactFor questions about this dataset, contact firstname.lastname@example.org.
The organizations responsible for generating and funding this dataset make no representations of any kind including, but not limited to the warranties of merchantability or fitness for a particular use, nor are any such warranties to be implied with respect to the data. Although every effort has been made to ensure the accuracy of information, errors may be reflected in data supplied. The user must be aware of data conditions and bear responsibility for the appropriate use of the information with respect to possible errors, original map scale, collection methodology, currency of data, and other conditions. Credit should always be given to the data source when this data is transferred, altered, or used for analysis.
- United States Department of Agriculture. National Aerial Imagery Program. Online.
- Chesapeake Conservancy. Land cover data project. Online.
- Homer C, Dewitz J, Yang L, Jin S, Danielson P, Xian G, Coulston J, Herold N, Wickham J, Megown K. Completion of the 2011 National Land Cover Database for the conterminous United States – representing a decade of land cover change information. Photogrammetric Engineering & Remote Sensing. May 2015.
- United States Geological Survey. Landsat 8. Online.
- Microsoft. US Building Footprints. Online.
- Malkin K, Robinson C, Hou L, Soobitsky R, Czawlytko J, Samaras D, Saltz J, Joppa L, Jojic N. Label super-resolution networks. International Conference on Learning Representations (ICLR). 2019.
- Robinson C, Hou L, Malkin K, Soobitsky R, Czawlytko J, Dilkina B, Jojic N. Large Scale High-Resolution Land Cover Mapping with Multi-Resolution Data. Computer Vision and Pattern Recognition (CVPR). 2019.
- Robinson C, Ortiz A, Malkin K, Elias B, Peng A, Morris D, Dilkina B, Jojic N. Human-Machine Collaboration for Fast Land Cover Mapping. arXiv 1096.04176, June 2019.