Accessing images outside of giant zipfiles

Table of contents

  1. What’s wrong with giant zipfiles?
  2. Downloading a whole data set (without the giant zipfiles)
  3. Downloading a folder from a data set
  4. Downloading a list of files from a data set
  5. Downloading with a GUI (using Azure Storage Explorer)
  6. Mounting a container with rclone

What’s wrong with giant zipfiles?

Many of our data sets are posted as big zipfiles, which are convenient in that they can be downloaded in your browser with no special tools. However, there are some major issues with giant zipfiles… most notably, you have to unzip them. This can take almost as long as downloading in some cases, and it means you have to have twice as much storage available as the data set really requires. Furthermore, if you only want part of a data set, e.g. one folder, or one species, you would still have to download and unzip the whole zipfile.

So we’re also providing unzipped copies of many of our data sets, to facilitate simpler or smaller downloads.

This page will give you a few ways to download images without dealing with giant zipfiles. Most datasets on LILA are available on Google Cloud Platform (GCP), Amazon Web Services (AWS), and Azure, and each cloud has a command-line tool for downloading files or folders.

  • For downloading data from GCP, you want gsutil.
  • For downloading data from AWS, you want aws s3.
  • For downloading data from Azure, you want AzCopy.
Below, we’ll provide examples of using gsutil/aws/azcopy to download whole data sets or single folders. Even if you are going to download zipfiles, if possible, we recommend using gsutil/aws/azcopy rather than your browser, for a more stable download experience.

These approaches will depend on having a URL for the base folder associated with the data set you want to access. Each dataset’s page lists the base folder for each cloud, and we have posted a list of base URLs for all the datasets that have unzipped images here; we’ll refer back to that list later.

Downloading a whole data set (without the giant zipfiles)

Let’s experiment with the Missouri Camera Traps data set. The data set page gives me GCP, AWS, and Azure URLs for this dataset:

  • GCP: gs://public-datasets-lila/missouricameratraps/images
  • AWS: s3://us-west-2.opendata.source.coop/agentmorris/lila-wildlife/missouricameratraps/images
  • Azure: https://lilablobssc.blob.core.windows.net/missouricameratraps/images
  • To download the entire data set to the folder c:\blah, I can do this with gsutil:

    gsutil -m cp -r "gs://public-datasets-lila/missouricameratraps/images" "c:\blah"

    …or this with aws s3 cp:

    aws s3 cp --recursive "s3://us-west-2.opendata.source.coop/agentmorris/lila-wildlife/missouricameratraps/images" "c:\blah"

    …or this with azcopy:

    azcopy cp "https://lilablobssc.blob.core.windows.net/missouricameratraps/images" "c:\blah" --recursive

    Downloading just one folder from a data set

    If I look at the metadata file for this data set, I see that there’s a folder called Set1/1.02-Agouti/SEQ75520 containing just one sequence of camera trap images. What if I want to download just that one folder to “c:\blah”? I can just stick that folder name onto the base URL for this dataset, like this with gsutil:

    gsutil -m cp -r "gs://public-datasets-lila/missouricameratraps/images/Set1/1.02-Agouti/SEQ75520" "c:\blah"

    …or like this with aws s3:

    aws s3 cp --recursive "s3://us-west-2.opendata.source.coop/agentmorris/lila-wildlife/missouricameratraps/images/Set1/1.02-Agouti/SEQ75520" "c:\blah" --recursive

    …or like this with azcopy:

    azcopy cp "https://lilablobssc.blob.core.windows.net/missouricameratraps/images/Set1/1.02-Agouti/SEQ75520" "c:\blah" --recursive

    Downloading a list of files from a data set

    If you want to download, e.g., all the images for a particular species from a data set, this is supported too, but it requires a little code. We have an example of how to do this here.

    Mounting a container/bucket with rclone

    If you are doing machine learning work with images that are available on LILA, and are already unzipped, you may not need to download them at all; instead, you can mount one or more storage containers on your compute nodes so they appear like drives. If at all possible, set up your compute where the data is stored:

    • In the US-East-4 or US-West-1 GCP regions (if you’re working on GCP)
    • In the us-west-2 region AWS region (if you’re working on AWS)
    • In the South Central US Azure region (if you’re working on Azure)

    …although the steps below will work no matter where you set up your compute.

    We recommend mounting a container with the open-source tool rclone, specifically the rclone mount command. The instructions here will assume that you are working on Linux, but rclone runs just about anywhere, and can mount just about anything. So if you’re on Windows, for example, only the filenames will change in these instructions.

    Regardless of whether you’re going to mount from GCP, AWS, or Azure…

    First, download rclone using the instructions here. On Linux, this should be as easy as “curl https://rclone.org/install.sh | sudo bash”.

    To mount the LILA GCP bucket…

    • Create the directory that you’d like to use as a mount point… this is where things are a little different on Windows, but on Linux, let’s just do:
      mkdir -p ~/lilatest-gcp
    • Run “rclone config”, then click “n” (for “new remote”), and choose a name. We won’t actually specify a bucket here, this will be for all GCP public data, so let’s call our remote “gcp-public”.
    • rclone will ask you what kind of storage you want to access; choose “Google Cloud Storage” (not “Google Drive”).
    • Leave “client_id”, “client_secret”, “project_number”, “user_project”, and “service_account_file” blank.
    • When you get to the “anonymous” option, choose “true”.
    • Leave everything else blank, just keep pressing enter until you’ve passed the last question.
    • Click “q” to quit. You still haven’t specified the bucket name, but we’ll get there.
    • Mount the dataset by running:
      rclone mount gcp-public:public-datasets-lila ~/lilatest-gcp
    • The shell will hang, that’s the right behavior. If you do this in another shell now:
      ls ~/lilatest-gcp
      …you should see all the LILA datasets. You can use this folder just like it’s a folder on your local machine.