Understanding XML: The Human’s Guide to Machine-Readable Data

Tiana Warner

July 13, 2016•5 min

Data formats are designed to be either machine readable (structured so computers can process it) or human readable (easy for humans to understand). Some formats, like XML and JSON, are...

Data formats are designed to be either machine readable (structured so computers can process it) or human readable (easy for humans to understand). Some formats, like XML and JSON, are allegedly both. Let’s call these ambivert formats. This means when you, a human, open the file, you can read what’s inside — but the data is still structured and therefore optimized for machine processing.

Machine vs human readable data — Left: Machine-readable data (SQLite). Middle: Human-readable data (PDF). Right: Allegedly both (XML).

Ambivert formats like XML, JSON, and GML are the language of the web, mobile, and REST APIs. The syntax is so powerful and versatile that it has become the basis for hundreds of formats, from desktop tools to data exchange over the web.

See: 7 Reasons to Love XML and JSON

The Problem with Ambivert Formats

Just because a format is classified as “human-readable” doesn’t mean it’s understandable. Yes, I can read this JSON snippet:

{“name”:”Electric Vehicle Charging Stations”,”type”:”FeatureCollection” ,”crs”:{
“type”:”name”,”properties”:{“name”:”EPSG:26910″}} ,”features”:[ {“type”:”Feature”
,”geometry”:{“type”:”Point”,”coordinates”:[492699.273158735,5452194.867862]}
,”properties”:{“LOT_OPERATOR”:”City of Vancouver”,”ADDRESS”:”6810 Main Street”}}

… but it’s not exactly understandable. If it was truly human-readable, the data would look like this:

This file contains 32 points representing electric vehicle charging stations
in Vancouver. The first station is located at latitude 49.22249551, longitude
-123.1002624, is operated by the City of Vancouver, and the address is
6810 Main Street. The second …

How can we quickly and easily understand data that’s optimized for machines? Beyond that, how can we work with it in advanced ways? Often, we need to use XML to exchange highly structured data with other applications it was not originally intended for.

Use XML they said It will be fun they said

Understanding Any XML-Based Data

The key is to leverage automation where you can. In other words, get a machine to parse the data before you try to gather useful information from it.

Raw XML vs Parsed in FME — Left: Raw XML. Right: Parsed by FME and shown as a map and a list of attribute information.

Automating the hard part — interpreting the syntax — frees you to work with the attributes like any other data type. Here are a few automations that make it easier to understand these formats. These are new features we added in FME 2016, with the goal that users shouldn’t need to know anything about XML, GML, or JSON to read these data sources (or to write to them, or modify, map, template, update, validate, or any other transformations).

Working with XML Hierarchies

Understanding XML starts with presenting the data’s hierarchical structure in a straightforward way. In FME, the tree navigation interface lets you choose which XML node you want to become a feature type (or layer or feature class). Then the fields are auto-generated. Using the data in your workflow becomes simple — for example to create geometry out of lat/long and georeference the data points.

Workbench2016-XMLElementSelectNew

Working with Nested JSON

JSON doesn’t have XML’s schema complexities, but it can be highly nested. This often means counting brackets and braces and trying to write complicated JSON queries. The solution: automatically flatten JSON into features and parse common forms into attributes. We introduced an automatic schema scan mode to FME’s JSON reader for this. From there, you can work with the features and attributes in a human-friendly interface. (And like XML hierarchies, we are also planning to add a tree navigation UI in the next FME release.)

Working with Schemaless GML/WFS

Often, your GML schema is in a separate .xsd file, and that doc is missing. Now what? When you don’t need to know the GML schema, don’t worry about it. We added an ignore schema mode to FME so you can read almost any GML. While a schema is still needed to write and validate GML, in cases where the exact schema isn’t critical (e.g. going from GML to GIS) it’s helpful to be able to ignore it.

The moral of the story: Make your computer do the hard part and interpret XML/JSON syntax for you. Once you’re able to understand what’s going on inside an XML-based dataset, you can better use it in your data integration and automation scenarios. You could even build your own web services!

If you want to learn more about working with XML, GML and JSON, we have lots of resources: