Chesapeake Land Cover

Overview

This dataset contains high-resolution aerial imagery from the USDA NAIP program [1], high-resolution land cover labels from the Chesapeake Conservancy [2], and low-resolution land cover labels from the USGS NLCD 2011 dataset 3 formatted to accelerate machine learning research into land cover mapping. The Chesapeake Conservancy spent over 10 months and $1.3 million creating a consistent six-class land cover dataset covering the Chesapeake Bay watershed. While the purpose of the mapping effort by the Chesapeake Conservancy was to create land cover data to be used in conservation efforts, the same data can be used to train machine learning models that can be applied over even wider areas.

The organization of this dataset (detailed below) will allow users to easily test questions related to this problem of geographic generalization, i.e. how to train machine learning models that can be applied over even wider areas. For example, this dataset can be used to directly estimate how well a model trained on data from Maryland can generalize over the remainder of the Chesapeake Bay.

Python code for training and testing deep learning models (Keras/TensorFlow based) can be found in the accompanying GitHub repository:

https://github.com/calebrob6/land-cover

Further developments in models and related tools can be found at:

https://github.com/Microsoft/landcover

Papers using a superset of this data include [4, 5]. Paper [6] uses data from the same sources.

Citation

If you use this data set, please cite the associated manuscript:

Robinson C, Hou L, Malkin K, Soobitsky R, Czawlytko J, Dilkina B, Jojic N. Large Scale High-Resolution Land Cover Mapping with Multi-Resolution Data. Proceedings of the 2019 Conference on Computer Vision and Pattern Recognition (CVPR 2019). (bibtex)

Dataset organization

Tiles

At the highest level this dataset is organized by tiles. A tile is a spatial area measuring roughly 6km x 7.5km (with definitions that roughly match up with USGS quarter quadrangles). Each tile comes with three corresponding GeoTIFFs: NAIP imagery, high-resolution land cover labels, and low-resolution land cover labels. These GeoTIFFs are all aligned and at a 1m spatial resolution. Here, the low-resolution NLCD labels (natively at a 30m spatial resolution) have been reprojected to 1m with nearest-neighbor upsampling, while the NAIP and high-resolution land cover labels are natively aligned at 1m.

There are 300 total “tiles”, 50 sampled uniformly from each of the following (state, year) pairs:

  • Delaware 2013
  • New York 2013
  • Maryland 2013
  • Pennsylvania 2013
  • West Virginia 2014
  • Virginia 2014

The 50 tiles from each (state, year) pair are further split into 25 “train tiles”, 5 “validation tiles”, and 20 “test tiles”. The filenames for each split are listed in the accompanying CSV files. For example, the filenames associated with the 50 tiles from West Virginia can be found in the following CSVs:

  • wv_1m_2014_train_tiles.csv
  • wv_1m_2014_val_tiles.csv
  • wv_1m_2014_test_tiles.csv

Patches

This dataset also includes 500 pre-generated patches from each training and validation tile. Here, a patch is defined as a random 240×240 (meter) crop from the tile’s extent. This results in 12,500 training patches and 2,500 patches per (state, year) pair. The filenames for each patch (and accompanying metadata) are also listed in a CSV for each (state, year). For example, the training and validation patches for West Virginia can be found in the following CSVs:

  • wv_1m_2014_train_patches.csv
  • wv_1m_2014_val_patches.csv

Furthermore, the spatial extent of each patch can be found in a similarly named GeoJSON file:

  • wv_1m_2014_train_patches.geojson
  • wv_1m_2014_val_patches.geojson

Each shape in the GeoJSON has a patch_id key that can be matched back to a row in the CSV mentioned above.

Each patch is a (6x240x240) tensor. The channels are described as follows:

  • Channels 1-4 contain the R, G, B, and NIR bands respectively of the NAIP imagery. Values are uint8s (i.e. in the range [0, 255]).
  • Channel 5 contains the high-resolution land cover labels:
    1 = water
    2 = tree canopy / forest
    3 = low vegetation / field
    4 = barren land
    5 = impervious (other)
    6 = impervious (road)
    15 = no data
  • Channel 6 contains the low-resolution NLCD labels. Values match those described here. The values 0 and 255 indicate that no data is available.

Download links

Data (small) (88GB)

Having trouble downloading? Check out our FAQ.

Contact

For questions about this dataset, contact calebrob6+lcmcvpr2019@gmail.com.

References

  1. United States Department of Agriculture. National Aerial Imagery Program. Online.
  2. Chesapeake Conservancy. Land cover data project. Online.
  3. Homer C, Dewitz J, Yang L, Jin S, Danielson P, Xian G, Coulston J, Herold N, Wickham J, Megown K. Completion of the 2011 National Land Cover Database for the conterminous United States–representing a decade of land cover change information. Photogrammetric Engineering & Remote Sensing. May 2015.
  4. Malkin K, Robinson C, Hou L, Soobitsky R, Czawlytko J, Samaras D, Saltz J, Joppa L, Jojic N. Label super-resolution networks. International Conference on Learning Representations (ICLR). 2019.
  5. Robinson C, Hou L, Malkin K, Soobitsky R, Czawlytko J, Dilkina B, Jojic N. Large Scale High-Resolution Land Cover Mapping with Multi-Resolution Data. Computer Vision and Pattern Recognition (CVPR). 2019.
  6. Robinson C, Ortiz A, Malkin K, Elias B, Peng A, Morris D, Dilkina B, Jojic N. Human-Machine Collaboration for Fast Land Cover Mapping. arXiv 1096.04176, June 2019.