In the trending landscape of Machine Learning and AI, companies are tirelessly innovating to deliver cutting-edge solutions for their customers. However, amidst this rapid evolution, ensuring a robust data universe characterized by high quality and integrity is indispensable. While much emphasis is often placed on refining AI models, the significance of pristine datasets can sometimes be overshadowed.
This article sets out to explore some of the essential tools required by organizations in the domain of data engineering to efficiently improve data quality and triage/analyze data for effective business-centric machine learning analytics, reporting, and anomaly detection. To illustrate these tools/frameworks and their importance, let us consider a scenario within the fintech industry.
Imagine a customer support team relying on a customer referral platform for sales or marketing leads. These representatives engage with customers over the phone, discussing various offers and programs. Recently, they've encountered instances where recommended phone numbers lead to inaccurate customer details, with no discernible pattern. This challenge not only underscores the importance of data integrity but also highlights the critical role of data engineers in resolving such issues. As stewards of the data universe, primarily data engineering teams are tasked with addressing these challenges by working with the sales team closely.
Please refer to the below figure wherein the sales team works with customers to ensure accurate data, the left side represents the data engineering processes, where data is sourced from various systems, including filesystems, APIs, and databases. Data engineers build and manage complex pipelines and workflows to consolidate this data into a final dataset used by customer support teams. Identifying the source of data issues becomes challenging due to the complexity and number of pipelines in an enterprise organization. Thus, simple questions like, "Where are we sourcing this data?" and "What is broken in this data flow?" become a daunting challenge for data engineers, given that an enterprise organization could be maintaining hundreds of pipelines.
To address this challenge, data engineers need robust tools/frameworks in order to respond to simple customer support inquiries to the highest range of critical leadership insights in a timely manner. These tools should provide capabilities to triage the data flow quickly, witness data values at each layer in the flow easily, and proactively validate data to prevent issues. At a basic level, the three tools/frameworks below would add a lot of value to handle this challenge.
A tool captures the data flow from its origin through various transformations and finally to its destination. It provides a clear map of where data comes from, how it is processed, and where it goes, helping data engineers quickly identify the lineage of data constructed.
A data-watching tool allows engineers to monitor data values in real time at different stages of the pipeline. It provides insights into data values, potentially anomalies associated with them and its trends, enabling prompt responses to any irregularities and empowering even business to get involved for triaging.
A data validation tool checks data at various points in the pipeline to ensure it meets predefined standards and rules. This proactive validation helps in catching and correcting data issues before they propagate through the system.
In order to dive deeper into the concept of each of these tools with the challenge posted, we will consider a data structure with a workflow defined. In this case, we have a customer entity represented as a table in which the attributes are fed from a File System and an API.
To simplify, consider a scenario where the and phone numbers are obtained through an API, while address details are sourced from a file system. To re-emphasize the original challenge, the phone number is missing in the final customer support platform. From a data triaging standpoint, a data engineer needs to trace the source of the phone number among numerous data pipelines and tables, find the source of this phone number attribute first, and understand its lineage.
In any data flow at a given point of time, a set of data elements are persisted and the ETL processes are applied to load the transformed data. For effectively triaging the data and finding its lineage, the following basic setups are required:
This involves creating a comprehensive map that links each data element to its respective source. This mapping ensures traceability and helps in understanding where each piece of data originates.
As new workflows are introduced, the configuration should be flexible enough to incorporate these changes without disrupting existing processes. This extensibility is crucial for accommodating the dynamic nature of data pipelines.
Data sources can change over time, whether due to schema updates, new data sources, or modifications in data structure. The configuration must be adaptable to these changes to maintain accurate data lineage.
This lineage can mostly be inferred from code if it involves plain SQL by referencing the code bases. However, it becomes more complex when different languages like Python or Scala are involved alongside SQL. In such cases, manual intervention is needed to maintain the configuration and identify the lineage. This can be achieved in a semi-automated fashion. This complexity arises due to the diverse syntax and semantics of each language, making automated inference challenging.
GraphQL can be utilized for maintaining data lineage by using nodes and edges to represent data elements and their relationships. This approach allows for a flexible and queryable schema that can easily adapt to changes and new requirements. By leveraging GraphQL, organizations can create a more interactive and efficient way to manage and visualize data lineage.
Several data lineage tools are available in the market, each offering unique features and capabilities: Alation, Edge, MANTA, Collibra, Apache Atlas and individual cloud providers are providing their own cloud-based lineage.
After identifying the source, now we need to have the ability to see if the phone number that came from the source is actually propagated in each transformation or load without changing its value. Now in order to have this data matching to be observed, we need a very simple unified mechanism that can bring this data together and show it.
Let's dive into data watching.
The data-watching capability can be achieved by leveraging various database connectors to retrieve and present the data cleanly from distinct data sources. In our example, the phone attribute value is properly ingested from the API to the table, but it is getting lost when writing to the front end. This is a classic case of data loss. By having visibility into this process, data engineers can quickly address the issue.
Below are the noticeable benefits of having a Unified Data Watching Approach.
Having the ability to track data sourcing, timeliness, and accuracy enhances confidence across the organization. We have discussed the concepts of data lineage and data watching to understand data sourcing, track data at different ingestion and transformation points, and observe its value at each stage. There are no explicit tools that solely offer data-watching capabilities; these functionalities are often by-products of some of the data discovery or data cataloging tools. Organizations need to develop unified platforms based on their specific requirements. Tools like Retool and DOMO are available to unify data into a single view, providing a consolidated and clear representation of data flow.
In the next section, we will explore how to monitor data quality and notify teams of issues to prevent incorrect data from propagating to final systems. This proactive approach ensures data integrity and reliability, fostering trust and efficiency within the organization.
Data validation is a crucial process in ensuring the quality and integrity of data as it flows through various pipelines and systems. Regularly refreshed data needs to be validated to maintain its accuracy and reliability. Data validation can be performed using different methods and metrics to check for consistency, completeness, and correctness. Below are some of the key metrics for data validation:
Several libraries provide built-in functions and frameworks for performing data validation, making it easier for data engineers to implement these checks. Please find below some of the libraries and sample code to get a sense of validation and implementation.
Implementing data validation involves setting up the necessary checks and rules using the chosen library or framework. Here's an example of how to implement basic data validation using Great Expectations:
As part of this article, we explored the options of leveraging data lineage, data watching, and data validation, so that organizations can build a robust data management framework that ensures data integrity, enhances usability, and drives business success. These tools collectively help maintain high data quality, support complex analytics and machine learning initiatives, and provide a clear understanding of data assets across the organization.
In today's data-driven world, the ability to maintain accurate, reliable, and easily discoverable data is critical and these tools enable organizations to leverage their data assets fully, drive innovation, and achieve their strategic objectives effectively. These frameworks, along with a variety of tools like data cataloging and data discovery features, empower business users to have broader visibility of the data, and, thereby, helping in innovation from business and technical arenas.