DM0 - Introduction to Data Management

Modern Data Reporting Tools

Most people are familiar with the word processing productivity tools from Microsoft within their office product line. Some of these tools most of us are familiar with are:

  • Word - A document creation and processing tool which focuses on data processed in paragraphs.
  • Excel - A table creation tool which allows for quick data entry, and data processing on data which resides within the table grid.

The above two tools have become so wide spread in their use and adoption that it has become a joke among many to highlight how companies use excel files as their database tool to store large amounts of information. Though it is possible to do this, there are benefits of using databases to ensure fast sorting and querying of large datasets, as we’ll as data concurrency features of databases to ensure that lost or corrupted data can be recovered or restored.

There are a few terms that would be good to define at the outset:

  • dataset - A collection of raw information, which has meaning when interpreted as a set.
  • database - Structured data. There may be multiple datasets stored within a database.

Using the above two definitions, an excel spreadsheet can be considered a database. Spreadsheets are typically table formatted datasets. The table is the structural component, providing meaning to the data in each cell. The data within the cells is the set. For the vast majority of simple data entry and processing in our daily lives, the spreadsheet is a sufficient tool. For example, tracking personal expenses on a monthly basis, and categorizing the purchases can be done within excel. This data is “low volume data” when compared with most data-sets on a corporate scale, and the loss of this personal expense data, while annoying, will not be catastrophic to our personal finances ( typically ).

Bottom Line - When the volume of data is sufficiently large, or the consequences of lost data are sufficiently high, it is prudent to explore alternative database options which occur scalability, and concurrency to address the issues surround both volume and loss of data critical to business success.

Separation of tasks and responsibilities

If we assume for now, that the spreadsheet database method is not sufficient for our use-case, we begin searching for alternative solutions. Most alternative solutions for data management break down the problem into two distinct parts.

  1. Data Structure and Storage.
  2. Data Viewing.

These two responsibilities are related, but are typically handles using different tools, and different computer languages to accomplish their goals. Fundamentally Data Structure and Storage is responsible for organizing data in a logical structure to allow for easy access/viewing/review. Data Viewing is responsible for accessing the data, filtering the relevant information, and formatting it for human consumption.

Data Structure and Storage

Structured data formats allow for easier computer processing. Depending on the size, and consequences of loss, there are various options to consider. For most simple applications, Data Base Management Systems (DBMS) and Relational DBMS is overkill. When the ammount of data to be stored is small, it can be easily stored in a plain text in a designated format. Some examples are::

  • Comma Separated Values ( CSV )
  • Extensible Markup Language ( XML )
  • Java Script Object Notation ( JSON )

Using a standardized markup language allows for easy processing of data. This is due to the fact that data parsing libraries are widely available to do the heavy lifting.

Data Viewing

The most fundamental and widely used tool for data viewing is the web-browser. Is it the application from which you are access the content of this blog. We use it for everything including and in-between searching online encyclopedias for information to watching our favorite videos on video sharing sites like YouTube.

Modern web-browser have come a long way from formatting statically available content present in HTML files. They are a sophisticated applications with the ability to process/filter/format information before presenting them for human consumption. You should have noticed by now that I have emphasised humans in this article many times. This is because it is at the core of the issue. Viewing data between various computer systems can be in many ways easier than presenting information to people for consumption. Computer structured data can be manipulated and shared between software systems. If that data is present to a human without special care regarding formatting of information, the data would be largely incoherent. Where computers are very capable in processing structured data, humans are not. Where humans are very capable in processing visualized data, computers are not ( though there has been lots of interesting work to change this in the recent past ).

As a result, this series will focus on storing data in structured sets for simplified computer processing, and presenting visualizations such as tables and charts for human consumption.