This post I want to cover a format that used to be in FME in the old Mapping File days, but never made the leap to Workbench for various reasons. However, new technology in FME2012 has allowed us to bring that beautiful format back into the FME fold.
So ladies and gentlemen, may I present to you today’s article: The Return of the Column-Aligned Text Reader or – as we at Safe prefer – The CAT Came Back.
What is Column-Aligned Text?
I’m still mildly amused by the number of plain text spatial datasets that still exist. There must be something comforting to humans to have a file you can open in Notepad and still be able to decipher, although it would take a Matrix-like effort to be able to open an OS NTF dataset and imagine the landscape it represents:
Anyway, for years FME has supported Comma-Separated Values (CSV) files, which are coordinate (or other) values separated by a comma (or any other) character like this:
X,Y,Z 10.1,10.3,4.0 10.1,15.2,3.1 12.8,15.0,5.4
And now we support Column-Aligned Text files, where values are stored in fixed-length fields padded by spaces.
X Y Z 10.1 10.3 4.0 10.1 15.2 3.1 12.8 15.0 5.4
See how it all lines up nicely? You’d be surprised by the amount of data floating around in formats like this. Usually they are that much more complicated though, because they have headers and extra columns that may need to be dealt with or ignored. That’s partly why this format took so long to get reimplemented in Workbench.
The CAT Reader
So, FME now has a reader to handle CAT files. The secret to the reader is the new parameters GUI that allows the user to define what column sizes exist, and where.
It doesn’t look much different to the CSV reader dialog, but the big difference is that the fields can be resized by dragging on the divider line:
Now a user can manually define the field boundaries they require, which makes it much easier to handle fields of a known width but with no specific delimiter character.
You obviously have to do this when you create the workspace (or add the reader), because once it’s in Workbench you can’t really edit the chosen schema (well you can, but it’s easier not to).
One other useful capability is a format attribute (cat_line_number) that records the line number of each record:
That helps in the scenario where each feature consists of a number of records (lines). For example you might be able to identify a header record because its line number is a multiple of 5.
Still, despite the ease of use, the most complex files are unlikely to give a perfect result. The almost infinite variety of CAT files makes it impossible to provide a complete solution. The above screenshot shows how the header does not conform to a fixed width as does the coordinates. However, you will get data in a form that is way more easily handled than anything any other reader could produce.
So, let me show you an example of this where I part-process the data with the CAT reader, and part-process it with the rest of the workspace.
This example came about when attempting Don’s challenge to find an unreadable XML file. In looking for such a dataset I came across an old file of data from my student days: a GENIO format dataset of a surveying project I undertook in Bradgate Park, Leicestershire.
I’d created the data in that format as being one I thought I’d always be able to read it back from – take note and learn from that mistake, folks – and had made sporadic attempts to read it with FME’s TEXTFILE reader over the years.
Unfortunately for this occasion, GENIO format isn’t XML related (sorry Don). However, it is mostly column-aligned text (CAT), and since FME2012 had the new CAT format reader it seemed like synchronicity was knocking on my office door and I had to try it out.
GENIO format is basically a column-based format with a series of headers to define each feature.:
001FORMAT(2(2F15.3,F9.3)) 080WS,0.0,0.0,3.0,-1.0,-1.0 170.036 1750.610 -999.000 173.044 1750.325 -999.000 176.054 1750.089 -999.000 179.065 1749.928 -999.000 -1.000 -1.000
Incidentally, the record shown above is a wall. 080 records define the feature type, and anything beginning with a ‘W’ (W1, W2, W2…. WS, etc) will be a wall. Anything beginning with ‘P’ is a Point feature, so PT1, PT2, PT3, etc are trees.
At the time I really liked GENIO format because of the simple feature type naming plus its flexibility in defining schema with the 001FORMAT command. Can you decipher how the 001 line works? Look how you can put multiple columns of coordinates in one row.
However, such flexibility has not helped in trying to develop a reader!
So, what I decided was to set the column widths to match “(2(2F15.3,F9.3))” and not worry about the headers since all the important parts of the header appear in the first 15 characters. The CAT reader was ideal for this, and the screenshots in the previous section showed how I set up the reader parameters.
After the reader, the first of the workspace does the old trick of separating the headers out, recording the information as FME Variables, and passing that information on to the real features using a VariableRetriever transformer (click to enlarge):
The second part then turns the records into the correct feature geometry, and outputs them to the correct feature type depending on the 080 record (click to enlarge):
The most amusing part is needing to use the Chopper transformer on all tree features.
After running the workspace I get this:
Ha! The first time in (mumble, mumble) years that I’ve seen the data. Now I think I’ll have to turn it into a 3D surface and/or overlay it in Google Earth to see how accurate my surveying skills were. Oh! Looks like I was using a local grid, not even aligned to north!
Never mind, I’ve stored the data safely in a Shape dataset, because I know that format will never die!
The excuse I used for creating this workspace (apart from being able to report how fantastic the CAT reader is) was to create a sample project for FME’ers wanting to get their FME Professional Certification.
So, after you have downloaded the workspace and data from our website, consider if you are working on any projects that have similar complexity and interest. If so, you could be our very next FME Certified Professional. The download includes a workspace with many annotations explaining why it is a good certification role model.
And for more details about certification, see the Safe Software website.
Well, I hope this workspace is of interest. I had fun creating it and I know other users have already been using the CAT reader and found it very useful.
By the way, if you aren’t in on the joke, “The Cat Came Back” is best known in Canada for the animated film version. You can find it on the website of the National Film Board of Canada. I created an FME version (in June when I guess my workload must have been at a low point) and the lyrics are pasted below.
The CAT Came Back Lyrics:
Now, old Mr. Johnson, had troubles he din’t need.
He had some column data, that he couldn’t parse or read.
Flat file structure, without index or key.
One little dataset, how hard could it be?
But the CAT came back, the very next day
The CAT came back, he thought it was a gonner
But the CAT came back, it just wouldn’t stay away
Well first Mr Johnson, tried to use his GIS
He worked for 40 hours, but he didn’t find no bliss
Used Ruby, C and Python, but none of those would do
So he opened a text editor, to fudge it like you do
But the CAT came back, the very next day
The CAT came back, he didn’t that foresee
Yes the CAT came back, should have used FME.
Should have used FMEeeeeeee!