Many people today are focused on the predictive power of AI, especially the machine learning techniques that process large amounts of data in order to identify data patterns and produce complex insights. Statistical methods vary from simple techniques such as linear regression and neural networks to more complicated algorithms like ensemble methods (which involve multiple algorithms) and deep learning architectures.
The assumption underlying AI’s perceived predictive power is the existence of a perfectly structured set of data, ready to be processed by powerful machine learning algorithms. However, we must ask ourselves: Is this assumption realistic in real-life case studies? For example, if the goal is to understand consumer trends by processing publicly available data in blogs and social networks, is it straightforward to transform the data in such a way that AI can process them? Or if the objective is to predict healthcare costs, would it be trivial to aggregate and integrate medical insurance claims data? In most of these cases, the task of cleaning and filtering data coming from various data sources is daunting.
Let’s start by pointing out the three categories into which data can fall, based on how they can be formatted: structured, semi-structured, and unstructured. Structured data (e.g., SQL tables) are expressed in the form of tables and traditionally live inside relational databases; semi-structured data (e.g., CSV, XML, JSON) are characterized by the lack of rigid structure and are self-descriptive, i.e., contain descriptors/tags that explain every data element; and unstructured data, such as binary files (images, audio) or pure text (media articles, legal documents), do not belong to either of the previous categories.
The main objective of data science is to analyze the data for better decision making. In most organizations, the primary data storage system for drawing insights is either a data warehouse or a data lake. A data warehouse is an On-Line Analytical Processing (OLAP) database management system that supports complex data queries (e.g., find the total sales for each department per quarter). These queries access large amounts of data, which are modeled in simpler structures (“schemas”) than in a traditional On-Line Transaction Processing (OLTP) database. A data lake is used if we need to store all types of data, without assuming any predefined data structure.
A data warehouse is primarily accessed by non-data scientists (business users), and the tools used for analyzing the data include the SQL (Structured Query Language) language coupled with Business Intelligence (BI) platforms. These tools do not provide the advanced data-mining capabilities that traditional data science tools like Python or R offer, so data insights drawn from data warehouses are limited. Additionally, data engineers spend a big portion of their time extracting, cleaning, and standardizing the data, building what are called ETL (Extract, Transform, and Load) pipelines, which are necessary before data are stored in the data warehouse. Having the data transformations applied before the load step ensures that data can be made compliant with GDPR, HIPAA, CCPA, or other industry-specific security regulations.
The data lake has become the preferred storage choice for data scientists and the primary so-called source of truth for organizations that want to become data driven. Data lakes enable the use of traditional machine learning tools (such as Python and R). One way to implement a data lake is via the Hadoop Distributed File System (HDFS), which is based on Google’s distributed filesystem implementation (GoogleFS). HDFS maximizes storage capacity by storing large files across a large cluster of machines. Since it is a file system and not a database management system, it can easily store all types of non-structured data. Data lakes, unlike data warehouses, do not enforce a global schema for the data to be stored, so it makes more sense to transform the data after they are loaded in the data lake and not before, hence ETL now becomes ELT (Extract, Load, and Transform).
In our current big data era, distributed architectures over commodity hardware have become the norm, and the cloud has allowed the separation between data storage and computational power needed to execute powerful machine learning algorithms. After storing all raw data in a data lake, data transformations must be performed in order to prepare and standardize the data before AI is applied. Preferably, all data transformations and any computations done in a data lake environment should be done in parallel, since this method accelerates execution. MapReduce is one of the early software frameworks that have attempted to process large data sets in parallel by splitting every process into two steps: first a Map (map data to key-value pairs) and then a Reduce (reduce over the pairs with the same key) function. A more recent parallel data processing framework, Apache Spark, leverages main memory (RAM), is significantly faster than MapReduce and is also able to process real-time streaming data, such as data coming from sensors, mobile devices or clickstream data from websites. Spark is agnostic to the persistent storage system, which can be HDFS or another file system, like the Apache Cassandra File System (CFS), which is HDFS compatible.
Storing raw data in a data lake has been one of the latest trends in the industry, but it has failed in many cases for several companies. Why? The continuous storage of large amounts of data without organizing and maintaining them properly leads to a state in which the data lake becomes a data swamp, where any data analysis is almost impossible. Data transformations are extremely difficult to implement and generalize in a data swamp, even for the best data engineers. It is almost impossible to resolve data ambiguity (whether text or images) among different data sources and understand the meaning of seemingly different data fields that come from tens or even hundreds of data warehouses or from the web. At a minimum, metadata available for all data must be stored in the data lake.
Finding the optimal data transformation for large amounts of data coming from hundreds or thousands of sources is not feasible. One way to tackle this problem is to have AI suggest several digital transformations and have human experts to decide which data transformation best fits the data. The best analogy for this solution is when several AI platforms used for image recognition ask humans questions about the images so that they can improve their accuracy in recognizing patterns in the images.
It is essential for data scientists and data engineers to understand that a data lake is neither a black box nor a panacea for all the limitations that data warehouses present. The term data lake is used loosely as a catch-all phrase for a centralized repository that stores everything. However, the data still must be organized in a way that enables useful insights to be drawn. Transformed data are visualized for humans, who become the decision makers regarding which data transformation methods are correct. Ideally, metadata in the form of a graph or a relational database should be stored in the data lake to accelerate data exploration. Based on the complexities of maintaining a data lake, enterprises should gradually transition from the current state of having multiple data warehouses and localized databases to one centralized repository where all data are stored. Enterprise data strategies should be focused on the organization’s needs to improve existing products and processes by thoroughly measuring the effectiveness of the latest data architectures.