Follow these steps to transform raw data into a useful format that helps generate insight.
When we asked “What does data-wrangling mean to you?”, your answers included some great definitions and analogies:
- “Getting your data under control.”
- “Rolling up your sleeves to wrestle with data.”
- “Grouping data together and getting it moving in the same direction.”
- “Roping a bunch of things together.”
Formally, data wrangling (or “data munging”) is the process of transforming data from its raw form into a more useful format, making it more valuable for analytics and insights. Data tends to be messy—it’s scattered across different systems and formats, trapped in old programs or difficult formats, and it’s full of errors. The image of “wrangling” data is accurate, because managing different systems and transforming data into something useful can feel chaotic and disorganized.
- The above answers came from our webinar, Let’s Get Ready to Wrangle! 6 Steps Towards Data Quality Bliss. Watch the recording to see a live demo and dive into this topic in more detail.
When data is properly wrangled, businesses can generate insight and make better decisions. Let’s go through the six steps of data wrangling using FME, which will take data from a scattered mess to a valuable format ready for analytics.
- FME is the data integration platform with the best support for spatial data. It helps you spend less time fighting with your data and more time using it. Learn more
1. Data Discovery
What data do you have, and what do you need?
The first step of data wrangling is to gather data, inspect it, and understand it. For example, in the below dataset, we can see that the dataset includes inconsistent date formats, as well as some null values. It also contains location data, but the location isn’t very useful as-is, and would be better if we converted it into point geometry. So we know upfront that this data should be cleaned up.
Inspecting a dataset of lost and found pets in the FME Data Inspector. Watch the demo
Next, organize the raw data to make it more useful and ready for analytics. For example, in the above dataset with inconsistently formatted dates, FME’s DateTimeConverter can automatically format date-time values, making it easy to standardize all of the dates in the file (e.g. transform all of them into ISO dates).
Consider what other information would be useful to have in the dataset. For example, spatial location, satellite images, dates and times, a customer database, or historical data could be gathered from other sources and merged with the dataset.
In our example of lost and found pets, we already have a location column, but it’s not very useful. It would be better if the location were an actual point geometry. We can use a series of transformers to turn the lat-long values into point geometry, enabling us to view the data on a map.
The VertexCreator or the CoordinateExtractor can be useful for working with latitude and longitude values or XY coordinates. The below screenshot of FME Workbench shows a workflow that uses the CoordinateExtractor, with attributes containing coordinates on the left, and their respective point geometry on the right.
Data validation is an important part of any workflow. It’s always good practice to check for correctness, completeness, and compliance with standards. Check for null values, check the geometry, check the schema, check anything that could compromise the dataset’s quality and usefulness.
For example, in our lost and found pets dataset, we can identify null values and replace them with the text “Unknown”.
Useful FME transformers in this step include:
Conditional Values can also be useful for setting a field based on criteria.
After applying a set of validation rules, it’s time to clean up the data. This is where you can remove extra columns or attributes, and otherwise prepare the schema for the final step of the workflow. FME’s AttributeRemover can be useful here.
The final step in data wrangling is to share it in a way that makes it the most useful. This could be as simple as writing it to a shareable format and putting it into a folder, like a PDF, spreadsheet, or document. You could also send it to another system, like business intelligence software, for analysis.
Useful FME transformers and functionality for this step include:
- PDFStyler and PDFPageFormatter (Tutorial: Creating PDF Cartographic Output in FME)
- ExcelStyler (Tutorial: Getting Started with Excel in FME)
You could also share it via a web application or portal so that anyone can access the data in their preferred format. In the below example, we have created an FME Server Web App to allow people to search the dataset for a specific pet name, and get back a table and web map.
When you use these six steps to wrangle your data from messy to manageable, you can gain valuable insights. The transformers and functionality recommended above will help, and creating a web application in FME Server will allow you to create an interface for easy sharing in the final step. Watch our webinar, Let’s Get Ready to Wrangle! 6 Steps Towards Data Quality Bliss, to dive deeper into this topic and see a live demo.