Why Excel is not a database
Excel is not a database… Here’s why.
So it happened again. 16,000 COVID-19 tests in England got lost, and MS Excel gets the blame. The reason? It was being relied on for huge amounts of data, leading to big problems.
After more than 20 years as a consultant and system developer, I certainly have seen many creative and sometimes, quite frankly, life-threatening solutions where a spreadsheet has developed uncontrollably into a critical business or documentation system (and by “creative solutions”, I really mean horror stories).
Excel is a popular tool, and for a good reason. It’s flexible and often really quick to get started with to perform calculations and create diagrams. Also, with integration into Sharepoint, Excel can be set up quickly as a multi-user environment.
But, it certainly also has some shortcomings.
Some of these can be related to technical issues, as in the initial example of lost COVID-19 test data where apparently a text file was imported into Excel, but with more lines/records than could be handled in the spreadsheet resulting in the lost records.
Now, experienced and Excel-savvy users may very well be aware of these limitations, but when you are in a hurry (and in crisis-mode), this can easily be forgotten. And, it’s not easy to know the exact limitations you have at the time as this also can be version-dependent. Even if the limitation would have been discovered in time, some sort of workaround had to be figured out. Maybe splitting the spreadsheet into several sheets?
What is a “Real Database”?
In this case, it actually seems that Excel was used as a replacement for a “real database” and in many cases that may be a pretty bad idea. Certainly when you are handling more than a million records and most definitely when it is data related to serious health issues. It’s not surprising though that Excel can be assumed to be a real database: A spreadsheet looks very much like a database for a non-database user.
In terms of a “real database”, I think of systems like PostgreSQL, SQL Server, Oracle, or even more lightweight systems like SQLite. These systems focus on storing large amounts of data — with integrity. Meaning you can set up rules regarding your precious data. For the case of COVID-19, this could mean ensuring that the same test is not registered twice.
However, loading a “real database” with data can be daunting as there are so many different ways of importing data. And certainly automating the continuous update of the database can be a real challenge.
Excel and Databases Can Still Work Together
There are efficient ways of setting up automated workflows for these database scenarios. The best solution is often to use the right tools for the right purpose, and Excel can still be a part of that solution. A good example can be found at the European Environment Authority where these scenarios have been carefully thought through after many years of experience. They regularly collect environmental data from a great number of member states. Mr. Jan Bliki, Head of the group on data management, explains:
“Excel is still the easiest input tool among non-database users. It’s hard to replace if users are that used to that format. In our case, we ask to deliver multiple files using the same template: Excel files that only store a fraction of the data (for example, by hospital and by day). Then, we use a central service that brings all these daily results together into one single database using a single ETL tool. To make easy use of such a database, a BI tool could be linked so those same users would still have the ability to produce an overview of all results.”
Key Benefits of a Real Database
There are some really good points here:
- A single database helps preserve the integrity of your data.
- The automation capabilities of an ETL tool, such as FME, can include QA rules that send alerts when something suspicious happens. But, it also makes it easier to keep the database up to date.
- As Jan also mentions, using BI alongside a database is a great way to provide feedback and insights to users.
So how can an ETL tool like FME handle validation and QA in a process like this? Think of the database as a secure bank-vault. To get data in (and out), it has to have the right credentials. FME is the safety-system checking that your data brings the right credentials and, if not, warns you that something fishy is going on.
A few examples of checks you could include:
- Are there more than a certain number of rows in the Excel sheet? This could have helped in preventing the COVID-incident mentioned in the beginning.
- Are there duplicate records?
- Are data-values within reasonable ranges?
- Do all fields have values?
The list goes on. One of the greatest benefits of having an automated process is that you can constantly add more checks as the situation evolves! Making the safety-system stronger with time.
Don’t Blame Individual Tools
To sum this up, I honestly think that blaming the spreadsheet’s shortcomings for accidents, like the one with the COVID-19 tests, is far too easy and avoids the core problem. The tool has most likely been used in a way that it was not intended for. And most importantly: Any information owner should take measures to treat their data as the valuable resource it really is!
Want to learn more about Excel? Check out “5 Ways FME Can Improve Your Excel Data” webinar hosted by Safe Software co-founder Dale Lutz.