Last week Don wrote about a couple of data villains, and that prompted me to think about another villain that has been hiding in plain sight for years, but has only recently really begun to strut its stuff. And this villain has the potential to really wreak havoc with many customers’ plans and IT infrastructures if they don’t keep a careful watch over ever increasing data volumes.
Don and I often like to tell people how back in the 90s, a really huge dataset wouldn’t quite fit onto one 1.44MB floppy disk (link provided for the youngsters out there). A statement that brought instant respect to a speaker at any GIS-related gathering was “My dataset is so big, I had to use PKZIP’s span mode to get it onto these 3 floppies”. Slowly, “big” datasets began to spill across Zip drive boundaries, but by the time they would overflow recordable CDs and DVDs, no one seemed to care. By then, the Data Volume Villain (DVV) was subjugated by fast networks, FTP sites, and most recently, Dropbox.
However, in the past few weeks I’ve noticed that the DVV has once again worked its way lose and is up to no good. First came word of a Swiss partner trying to make a DEM from over a billion LiDAR points living in CSV files. Next, reports of customers with thousands of Oracle tables slowing down certain workflows. Then, a LiDAR customer chose to send me a hard drive with a couple of 35+ gigabyte LAS files on it. The next day, another customer had to resort to shipping me 3 DVDs containing a few REVIT and IFC building models. If the DVV’s goal is to slow progress, clog networks, and generally burn computing time, it is again enjoying success.
More Efficient File Formats?
So what is to be done? One approach is to use more efficient file formats. Certainly the DVV suffered a bit setback in the raster world with formats like JPEG, JPEG2000, ECW, and MrSID. These formats have facilitated easy sharing and use of otherwise extremely inconvenient volumes of data. In the Point Cloud world, the lossless MrSID LiDAR Compressor impressively shrinks Point Clouds, creating files that are not only drastically smaller, but also easy to access.
Very recently, Martin Isenburg‘s very impressive LASzip has been added into the open source libLAS codebase, and it won’t be long I’m sure until we see incredible shrinking LAS files being created and used by software packages (including my favorite one), and shared and archived by grateful users worldwide. Open source lossless compression of LiDAR files to 10-20% of their original size – put that in your pipe and smoke it, DVV.
What about the Cloud?
Recently I read of some folks looking to the cloud to beat back their Geospatial DVV. Notably I don’t see any Canadian governments making similar plans – with recent usage-based billing policies, some have calculated that it is cheaper to buy a hard drive and courier it across the country than to move the data over the net. Even one of our American customers mentioned that the Comcast network traffic caps can put a bit of a chill into usage of the cloud for certain scenarios.
In our experience, for the Geospatial DVV to be defeated by the Cloud, the apps will have to follow the data into the Cloud, and stay there – otherwise, the costs and time needed to move the raw data up and down are too prohibitive. While reading Gene Roe’s blog, I came across this reference to a plan at the San Diego Supercomputer Center to do LiDAR processing in the cloud – does anyone know where that project is at by now?
So, what about you? How has the DVV impacted your work in the past few years, and what have you done to keep it at bay? For us at Safe, it keeps us focused on performance, on making sure our software works for way more data than we think is reasonable, so that we can provide everyday spatial superheroes with a worthy tool to keep the DVV from dragging them down.
After all, it won’t be long until someone will have to use a 64-bit version of ZIP to span their data file onto 3 Blu-ray discs…