Accessing images outside of giant zipfiles

Table of contents

  1. What’s wrong with giant zipfiles?
  2. Downloading a whole data set (without the giant zipfiles)
  3. Downloading a folder from a data set
  4. Downloading a list of files from a data set
  5. Downloading with a GUI (using Azure Storage Explorer)
  6. Mounting a container with rclone

What’s wrong with giant zipfiles?

Many of our data sets are posted as big zipfiles, which are convenient in that they can be downloaded in your browser with no special tools. However, there are some major issues with giant zipfiles… most notably, you have to unzip them. This can take almost as long as downloading in some cases, and it means you have to have twice as much storage available as the data set really requires. Furthermore, if you only want part of a data set, e.g. one folder, or one species, you would still have to download and unzip the whole zipfile.

So we’re also providing unzipped copies of many of our data sets, to facilitate simpler or smaller downloads.

This page will give you a few ways to download images without dealing with giant zipfiles. For bulk downloads of folders or data sets, we will recommend AzCopy, a command-line tool for downloading files from Azure storage, which works on Linux/Windows/Mac. We will provide examples of using AzCopy to download whole data sets or single folders. We recommend using AzCopy even if you do want to download zipfiles.

These approaches will depend on having a URL for the base folder associated with the data set you want to access. We have posted a list of base URLs for all the datasets that have unzipped images here, and we’ll refer back to that list later.

For camera trap data sets, a copy is also hosted on Google Cloud Storage, so if you’re working on GCP, much better to use those than to copy from Azure. This page is a little out of date in that we provide detailed instructions for using AzCopy to download images, and haven’t included those instructions yet for GCS links, but every camera trap data set also includes a “gs://” link, and everything this page talks about doing with AzCopy, you can also do with gsutil.

Downloading a whole data set (without the giant zipfiles)

Let’s experiment with the Missouri Camera Traps data set. If I open the list of base URLs, I’ll see that the base URL for this dataset is:

https://lilablobssc.blob.core.windows.net/missouricameratraps/images

To download the entire data set to the folder c:\blah, I can do this:

azcopy cp "https://lilablobssc.blob.core.windows.net/missouricameratraps/images" "c:\blah" --recursive

Downloading just one folder from a data set

If I look at the metadata file for this data set, I see that there’s a folder called Set1/1.02-Agouti/SEQ75520 containing just one sequence of camera trap images. What if I want to download just that one folder to c:\blah? I can just stick that folder name onto the base URL for this dataset, like this:

azcopy cp "https://lilablobssc.blob.core.windows.net/missouricameratraps/images/Set1/1.02-Agouti/SEQ75520" "c:\blah" --recursive

Downloading a list of files from a data set

If you want to download, e.g., all the images for a particular species from a data set, this is supported too, but it requires a little code. We have an example of how to do this here.

Alternatively, there is not-officially-supported functionality in AzCopy to download a list of files; if you don’t mind using not-officially-supported features that theoretically could cease to exist, this works quite well and will be faster than even a parallel Python loop. See AzCopy: Listing specific files to transfer.

Downloading files with a graphical tool (Azure Storage Explorer)

Most of the operations we describe above – downloading a whole dataset, downloading individual files, or downloading folders – can also be done using the graphical tool Azure Storage Explorer. Each data set on LILA has its own container on Azure; let’s say we’re going to connect to the “SWG Camera Traps” dataset. The documentation for that data set tells us that the container is:

https://lilablobssc.blob.core.windows.net/swg-camera-traps

First download, install, and run Storage Explorer, then click the little connection icon in the upper-left corner:

Then click “blob container”, and when it asks “how will you connect to the blob container?”, click “anonymously (my blob container allows public access)”, then click “next”:

Under “display name”, put anything you want, this is just how the container will appear in your Storage Explorer window. Let’s go with “swg-camera-traps” for now. Paste the URL for this container into the URL box:

Click “next”, then click “connect”. A tab will open showing you the container:

When you start Storage Explorer in the future, you can get back to this tab by clicking the the little Morse-code-looking icon in the upper-left, then expanding “local and attached”, then “storage accounts”, then “blob containers”, and you will see your container in the list of blob containers:

From here, using the GUI is like using a typical file explorer: you can browse folders, right-click to get information (e.g. size) about a file or folder, and download folders or folders.

Mounting a container with rclone

If you are doing machine learning work with images that are available on LILA, and are already unzipped, you may not need to download them at all; instead, you can mount one or more storage containers on your compute nodes so they appear like drives. If at all possible, set up your compute in the South Central US Azure region (where all LILA data is stored), although the steps below will work no matter where you set up your compute.

We recommend mounting a container with the open-source tool rclone, specifically the rclone mount command. The instructions here will assume that you are working on Linux, but rclone runs just about anywhere, and can mount just about anything. So if you’re on Windows, for example, only the filenames will change in these instructions.

  1. First, download rclone using the instructions here. On Linux, this should be as easy as “curl https://rclone.org/install.sh | sudo bash”.
  2. Run “rclone config”, then click “n” (for “new remote”), and choose a name. We’ll use the “SWG Camera Traps” dataset in this example, so let’s call our mount “swgcameratraps”.
  3. rclone will ask you what kind of storage you want to access; enter “azureblob”.
  4. Leave “account”, “servical principal file”, and “key” blank.
  5. When you get to “sas_url”, enter the URL for the container you want to mount. The documentation for the SWG Camera Traps dataset tells us that the container for this dataset is https://lilablobssc.blob.core.windows.net/swg-camera-traps, so enter that.
  6. Leave everything else blank, just keep pressing enter until you’ve passed the last question, and you’re looking at the list of remotes:
  7. Click “q” to quit; you’re done setting up the remote container, but you still need to mount it.
  8. Create the directory that you’d like to use as a mount point… this is where things are a little different on Windows, but on Linux, let’s just do:
    mkdir ~/swgcameratraps
  9. Mount the container by running:
    rclone mount swgcameratraps:/ ~/swgcameratraps

And you’re good to go! That’s mounted in the foreground, so the mount is alive until you kill that command or close that shell. There are also options to run in the background, although it’s often convenient to run in the foreground but use screen or tmux to keep that shell alive. If you open a new shell now, you should be able to run:

ls ~/swgcameratraps

…to see the contents of the container. At this point, you can use paths from the container in your ML code just like it was a disk attached to your computer.