Downloading Multiple Full Datasets from Digital Coast


Recently I had a request for all of our bathymetric lidar-based DEM data for the Gulf of Mexico and the Atlantic seaboard. That’s a lot of data. Normally, if someone wants a piece of a dataset, perhaps with custom processing, they could use the Data Access Viewer (DAV). If they need a whole data set they can hit the bulk download link in the DAV or use the secret page of data links. With potentially lots of datasets to grab, it’s a bit overwhelming to sit there clicking links. Today I’m going to talk about how to approach this type of task.

Before I do, I should probably answer the obvious question of why not just ship a hard drive? First, we’re really not staffed to handle hard drive requests, particularly if they are going to require figuring out which datasets are needed in addition to babysitting and verifying transfers. Second, as I write this the office has been evacuated and we’re all teleworking, so plugging in a hard drive isn’t an option.

Solution outline

What we need to do is pretty straightforward in concept. We need to:

  1. Find the links to all the datasets we want
  2. For each dataset, download all the DEMs

This type of task is definitely ripe for scripting in your favorite language. I’m assuming you’ve got enough scripting chops to do the scripting part if I lay out the steps needed.

Use the services

That first step is the one that is least documented. There are a couple of approaches. One way would be to take the secret page linked above, grab the page source, and scrap it for the information you want since all the links and the titles are there. Certainly doable, but I think there’s a better way.

As it happens, the DAV system uses map services to pass the information about datasets whenever you do a search. The main URL for the elevation data is https://maps.coast.noaa.gov/arcgis/rest/services/DAV/ElevationFootprints/MapServer, where you will see four entries under the Layers section. It doesn’t matter which layer you pick, they all have the same information. I’ll just look at the first layer. That page includes a long list of the fields you can get back in a query. You can also make queries based on those fields. To play around with some of that capability, you can hit the Query link at the bottom of the page. In the end, we’ll form a URL and use it to get our data, but the form can help you see how to form the URL you’ll need.

The query interface. Here I’ve circled in red the things I changed from the original state.

In the example query interface above, there are a few things circled to show what changed to get only the pieces I wanted. Figuring out what to put in the Out Fields and in the Where box requires a little insider knowledge, so lets talk about that.

Map service insider knowledge

Our task is to come up with the DEMs that have bathymetric lidar data and get their links. Luckily, there is a field that helps us tell which things are DEMs. Those have dataTypeId = 2. That’s not super obvious, but you might have been able to figure it out by noticing that all the entries with DEM in the title had dataTypeId = 2. It would have taken me a long time to notice that.

The bathymetric part is a little harder as there is no field in the service that differentiates topography from bathymetry directly. However, all the datasets with bathymetry will either say topobathy or bathymetric in the name field somewhere.

Taking that info, we can construct the input for Where:. Using standard SQL notation, the where clause is name like ‘%bathy%’ and dataTypeId = 2. If we run the query with just that, we’ll get the names and geometries of about 80 records, however we need the links and not the names and geometries. We get rid of the geometries by setting “Return Geometry” to false.

To get the links, we need some more insider knowledge. The links to the bulk download site for each data set is stored in the field externalProviderLink. That really makes no sense as a name and is an historical anomaly caused by reuse of an existing field for a different purpose. Additionally, the link is preceded by a comma, which is annoying. For the “Out Fields”, put “externalProviderLink”.

Update July 22, 2020: Sorry, I pulled the rug out on this a little and changed what’s in that field. As of this morning, it has more info in it with various links for the dataset and it’s in JSON format. The link you’ll want is the one that has serviceID=46. In your scripting, you should be able to parse that JSON string and pull it out.

Geometry

While we’ve can create a nice list of links now, they cover too much area. We only wanted the Gulf of Mexico and the Atlantic seaboard. If we were to draw a longitude latitude box around the area, it would be from something like -97, 24 in the lower left corner to -66, 45 in the upper right corner. For the Input Geometry, we can simply put -97, 24,-66, 45. We’ll also have to tell what those coordinates represent by giving the Input Spatial Reference as 4269, which is geographic NAD83.

If we want to be a bit better in our geometry definition than a big box that happens to include the Great Lakes, we can define a polygon in JSON format instead. For example, the geometry could be:

{"rings":[[[-80.649,23.847],[-73.887,33.878],[-65.991,44.226],[-66.935,45.751],[ -73.249,42.909],[ -97.312,31.083],[-98.585,25.940],[ -80.649,23.847]]]}

along with a geometryType=esriGeometryPolygon and it would cut out the Great Lakes.

URL for the list

If we run that query on the query page, we’ll see that it has a big long URL in the address window. The URL looks like this (using the bounding box, not the polygon): https://maps.coast.noaa.gov/arcgis/rest/services/DAV/ElevationFootprints/MapServer/0/query?where=name+like+%27%25bathy%25%27+and+dataTypeId+%3D+2&text=&objectIds=&time=&geometry=-90%2C25%2C-65%2C45&geometryType=esriGeometryEnvelope&inSR=4269&spatialRel=esriSpatialRelIntersects&relationParam=&outFields=externalProviderlink&returnGeometry=false&returnTrueCurves=false&maxAllowableOffset=&geometryPrecision=&outSR=&returnIdsOnly=false&returnCountOnly=false&orderByFields=&groupByFieldsForStatistics=&outStatistics=&returnZ=false&returnM=false&gdbVersion=&returnDistinctValues=false&resultOffset=&resultRecordCount=&f=html

A bunch of those settings are just empty and could be tossed out. For instance, “&outStatistics=” isn’t doing anything and you could delete it from the URL. The last thing we want to do is to change the return format from html (which is giving us the form back each time) to json. At the end of the URL, make it “f=json” instead of “f=html”. Once you do that, when you run it you’ll get something that looks more like:

JSON output from URL.

Downloading

We’ve got the link to the bulk download site for each dataset we want now, so how do we download? Luckily, I wrote a previous post that focused on downloading everything from a single dataset. If you’re using Linux and have wget, this is pretty easy and you can now script a loop over all those links and pull the data. Your computer and network will be busy for a long time. If you’re using Windows, it’s a little harder to get wget working, particularly a current version that knows the latest security protocols. You might want to use the uGet route instead.

For uGet, you’ll need the list of the URLs for all the files to be fetched. The list exists in a file for each dataset and there is a naming convention that lets you piece it together. Every download site link ends in a numeric ID, the same ID that’s in the ID field of the map service if you wanted to grab it. The list of URLs is just a file called urllistXXXX.txt under the main dataset link where the XXXX is the ID for the dataset. So, if the main link is https://coast.noaa.gov/stuff/NOAA_topobathy_2015_1234, then the list of URLS is https://coast.noaa.gov/stuff/NOAA_topobathy_2015_1234/urllist1234.txt

Scripting

Naturally, to put this all together, you’re going to need to do some scripting. There are many scripting languages to choose from and I’m not going to really try to cover them. The main things you’ll want to do in your scripting are:

  1. Pull down the JSON formatted list of links based on the URL. You could just do it in the browser and save the output to ingest in your script if you wanted to.
  2. Read the JSON list and extract the list of dataset links.
  3. Decide an output file structure to use so the datasets get their own directory/folder. Probably just the last part of the link.
  4. If using linux, run the appropriate wget command for each dataset.
  5. If using windows, pull down the url list and run the appropriate uGet command for each dataset.

One last thing. While this example was targeted at the DEMs, you can do the same thing with the imagery and point clouds. For imagery, the datatypeID is 3 and for lidar point clouds it’s 5.

One comment

Leave a Reply. Comments are moderated.

This site uses Akismet to reduce spam. Learn how your comment data is processed.