Accessing images outside of giant zipfiles

Table of contents

  1. What’s wrong with giant zipfiles?
  2. Downloading a whole data set (without the giant zipfiles)
  3. Downloading a folder from a data set
  4. Downloading a list of files from a data set
  5. Downloading with a GUI (using Azure Storage Explorer)
  6. Mounting a container with rclone

What’s wrong with giant zipfiles?

Many of our data sets are posted as big zipfiles, which are convenient in that they can be downloaded in your browser with no special tools. However, there are some major issues with giant zipfiles… most notably, you have to unzip them. This can take almost as long as downloading in some cases, and it means you have to have twice as much storage available as the data set really requires. Furthermore, if you only want part of a data set, e.g. one folder, or one species, you would still have to download and unzip the whole zipfile.

So we’re also providing unzipped copies of many of our data sets, to facilitate simpler or smaller downloads.

This page will give you a few ways to download images without dealing with giant zipfiles. Most datasets on LILA are available on both Google Cloud Platform (GCP) and Azure, and each cloud has a command-line tool for downloading files or folders. For downloading data from GCP, you want gsutil; for downloading data from Azure, you want AzCopy. We will provide examples of using gsutil/AzCopy to download whole data sets or single folders. We also recommend using gsutil/AzCopy even if you want to download zipfiles.

These approaches will depend on having a URL for the base folder associated with the data set you want to access. We have posted a list of base URLs for all the datasets that have unzipped images here, and we’ll refer back to that list later.

Downloading a whole data set (without the giant zipfiles)

Let’s experiment with the Missouri Camera Traps data set. The data set page gives me both GCP and Azure URLs for this dataset… the GCP URL is:

gs://public-datasets-lila/missouricameratraps/images

…and the Azure URL is:

https://lilablobssc.blob.core.windows.net/missouricameratraps/images

To download the entire data set to the folder c:\blah, I can do this with gsutil:

gsutil -m cp -r "gs://public-datasets-lila/missouricameratraps/images" "c:\blah"

…or this with azcopy:

azcopy cp "https://lilablobssc.blob.core.windows.net/missouricameratraps/images" "c:\blah" --recursive

Downloading just one folder from a data set

If I look at the metadata file for this data set, I see that there’s a folder called Set1/1.02-Agouti/SEQ75520 containing just one sequence of camera trap images. What if I want to download just that one folder to c:\blah? I can just stick that folder name onto the base URL for this dataset, like this with gsutil:

gsutil -m cp -r "gs://public-datasets-lila/missouricameratraps/images/Set1/1.02-Agouti/SEQ75520" "c:\blah"

…or like this with azcopy:

azcopy cp "https://lilablobssc.blob.core.windows.net/missouricameratraps/images/Set1/1.02-Agouti/SEQ75520" "c:\blah" --recursive

Downloading a list of files from a data set

If you want to download, e.g., all the images for a particular species from a data set, this is supported too, but it requires a little code. We have an example of how to do this here.

Downloading files with a graphical tool (Azure Storage Explorer)

Most of the operations we describe above – downloading a whole dataset, downloading individual files, or downloading folders – can also be done using the graphical tool Azure Storage Explorer. Each data set on LILA has its own container on Azure; let’s say we’re going to connect to the “SWG Camera Traps” dataset. The documentation for that data set tells us that the container is:

https://lilablobssc.blob.core.windows.net/swg-camera-traps

First download, install, and run Storage Explorer, then click the little connection icon in the upper-left corner:

Then click “blob container”, and when it asks “how will you connect to the blob container?”, click “anonymously (my blob container allows public access)”, then click “next”:

Under “display name”, put anything you want, this is just how the container will appear in your Storage Explorer window. Let’s go with “swg-camera-traps” for now. Paste the URL for this container into the URL box:

Click “next”, then click “connect”. A tab will open showing you the container:

When you start Storage Explorer in the future, you can get back to this tab by clicking the the little Morse-code-looking icon in the upper-left, then expanding “local and attached”, then “storage accounts”, then “blob containers”, and you will see your container in the list of blob containers:

From here, using the GUI is like using a typical file explorer: you can browse folders, right-click to get information (e.g. size) about a file or folder, and download folders or folders.

Mounting a container with rclone

If you are doing machine learning work with images that are available on LILA, and are already unzipped, you may not need to download them at all; instead, you can mount one or more storage containers on your compute nodes so they appear like drives. If at all possible, set up your compute where the data is stored: in the US-East-4 or US-West-1 GCP regions (if you’re working on GCP) or the South Central US Azure region (if you’re working on Azure), although the steps below will work no matter where you set up your compute.

We recommend mounting a container with the open-source tool rclone, specifically the rclone mount command. The instructions here will assume that you are working on Linux, but rclone runs just about anywhere, and can mount just about anything. So if you’re on Windows, for example, only the filenames will change in these instructions.

Regardless of whether you’re going to mount from GCP or Azure…

First, download rclone using the instructions here. On Linux, this should be as easy as “curl https://rclone.org/install.sh | sudo bash”.

To mount this dataset from the GCP bucket…

  • Create the directory that you’d like to use as a mount point… this is where things are a little different on Windows, but on Linux, let’s just do:
    mkdir -p ~/lilatest/swg-camera-traps
  • Run “rclone config”, then click “n” (for “new remote”), and choose a name. We won’t actually specify a bucket here, this will be for all GCP public data, so let’s call our remote “gcppublic”.
  • rclone will ask you what kind of storage you want to access; choose “Google Cloud Storage”.
  • Leave “client_id”, “client_secret”, “project_number”, and “service_account_file” blank.
  • When you get to the “anonymous” option, choose “true”.
  • Leave everything else blank, just keep pressing enter until you’ve passed the last question.
  • Click “q” to quit. You still haven’t specified the bucket name, but we’ll get there.
  • Mount the dataset by running:
    rclone mount gcppublic:public-datasets-lila/swg-camera-traps ~/lilatest/swg-camera-traps

To mount this dataset from the Azure container…

  • Create the directory that you’d like to use as a mount point… this is where things are a little different on Windows, but on Linux, let’s just do:
    mkdir -p ~/lilatest
  • Run “rclone config”, then click “n” (for “new remote”), and choose a name. We’ll use the “SWG Camera Traps” dataset in this example, so let’s call our mount “swgcameratraps”.
  • rclone will ask you what kind of storage you want to access; choose “Microsoft Azure Blob Storage”.
  • Leave “account”, “env_auth”, and “key” blank.
  • When you get to “sas_url”, enter the URL for the container you want to mount. The documentation for the SWG Camera Traps dataset tells us that the container for this dataset is:
    https://lilablobssc.blob.core.windows.net/swg-camera-traps
    …so enter that.
  • Leave everything else blank, just keep pressing enter until you’ve passed the last question.
  • Click “q” to quit; you’re done setting up the remote container, but you still need to mount it.
  • Mount the container by running:
    rclone mount swgcameratraps:/ ~/lilatest

Regardless of whether you used GCP or Azure…

Now you’re good to go! That’s mounted in the foreground, so the mount is alive until you kill that command or close that shell. There are also options to run in the background, although it’s often convenient to run in the foreground but use screen or tmux to keep that shell alive. If you open a new shell now, you should be able to run:

ls ~/lilatest/swg-camera-traps

…to see the contents of the container. At this point, you can use paths from the container in your ML code just like it was a disk attached to your computer.