Data Quality in Big Data: A Comprehensive Guide

The Evolving Landscape of Data Management

More than ten years ago, The Economist predicted an impending deluge of data. The contemporary data stack has arisen as a potential safeguard against this overwhelming volume – driven forward by companies originating in Silicon Valley, including Snowflake, Databricks, and Confluent.

Currently, establishing a data solution capable of scaling alongside business growth is achievable for any entrepreneur within hours, simply by registering for services like BigQuery or Snowflake.

The Rise of Scalable Data Storage

The development of affordable, adaptable, and scalable data storage options was primarily a reaction to evolving requirements fueled by the substantial increase in data volume.

Globally, 2.5 quintillion bytes of data are generated each day. (A quintillion contains eighteen zeros.)

Continued Data Growth and Emerging Challenges

The exponential growth of data persists throughout the 2020s, encompassing both its creation and storage. It is anticipated that the amount of data stored will continue to double at least every four years.

Despite advancements, a crucial component of modern data infrastructure remains underserved by solutions appropriate for the challenges of the big data era: the monitoring of data quality and data validation.

A Look Back and Forward

Let's examine the historical progression leading to the current state and explore the obstacles that lie ahead concerning data quality.

The Shift from Volume to Value in Big Data

The big data revolution was truly ignited in 2005 when Tim O’Reilly’s article, “What is Web 2.0?”, was published. It was also in this year that Roger Mougalas of O’Reilly coined the term “big data” as we understand it today – denoting datasets too substantial for conventional Business Intelligence (BI) tools to effectively handle.

Initially, the primary obstacle concerning data revolved around managing its sheer volume. Data infrastructure was costly and lacked flexibility, and cloud computing was still nascent, with AWS not launching publicly until 2006. Processing speed presented another significant hurdle; even moderate datasets could require extensive processing times before the 2012 release of Redshift, as noted by Tristan Handy of Fishtown Analytics.

how to ensure data quality in the era of big data

Scaling relational databases and data warehouse appliances presented considerable difficulties. A decade ago, understanding customer behavior necessitated significant upfront investment in server hardware before data scientists could begin analysis. Due to the expense of data and its infrastructure, large-scale data storage and ingestion were largely limited to major corporations.

A pivotal change arrived with the introduction of Redshift by AWS in October 2012. This cloud-native, massively parallel processing (MPP) database offered a viable solution to the scaling problem at a fraction of the previous cost – approximately $100, roughly 1,000 times cheaper than traditional on-premise setups.

As Jamin Ball from Altimeter Capital explains, Redshift’s significance lay in being the first cloud-native OLAP warehouse, dramatically reducing the cost of OLAP database ownership. Analytical query processing speeds also saw a substantial increase. Furthermore, innovations like separating compute and storage (pioneered by Snowflake) allowed for independent scaling of these resources.

The consequence of these advancements was a dramatic increase in data collection and storage capabilities.

Widespread adoption of scalable, modern cloud data warehouses began after 2016. This period marked the full emergence of BigQuery and Snowflake as major players (BigQuery didn’t offer standard SQL until 2016, limiting early adoption, and Snowflake’s product wasn’t fully mature until 2017). These platforms not only addressed the volume challenge but also introduced a new paradigm, decoupling the costs associated with volume and compute.

This decoupling enabled companies to store larger datasets more affordably, paying only for the data they actively processed. This represents a shift from consumption-based costs for “storage with compute power” to consumption-based costs for both storage and compute independently.

Prior to the reduction in storage costs, organizations were more selective in their data collection efforts, prioritizing high-value information. Today, with inexpensive data storage and the rise of ELT (Extract, Load, Transform) following the advent of modern cloud data warehouses and lakes, storing vast amounts of data “just in case” is a feasible strategy.

Having overcome the challenges of speed and volume, the current focus is ensuring the quality of these large big data volumes before utilization.

Data failures are frequently encountered, yet their repercussions can be substantial when data anomalies occur

The realm of data quality demonstrably struggles with consistent definitions. Similar to the discourse surrounding artificial intelligence, the concept of “data quality” elicits nearly as many interpretations as there are perspectives on the matter. In the current landscape of big data, data quality centers on the prevention and mitigation of data failures.

A data failure represents a broad concept encompassing scenarios where data within a stream or dataset deviates from anticipated behavior. Instances include notable variations in data ingestion speed (for data streams), discrepancies in the expected number of rows within a data table, alterations to the data schema, substantial shifts in actual data values, or changes in the relationships between features within a dataset.

It’s important to recognize that this definition of data failures is tied to expectations. Data failures can manifest even if newly collected data accurately reflects reality and is free of errors, provided it diverges from established expectations (upon which its utilization is based).

During the COVID-19 pandemic, numerous examples of data failures emerged, despite the absence of underlying data errors. Many organizations operated machine learning models trained on historical data. Training these models inherently assumes that historical data is representative of future data inputs. Consequently, significant differences between new and historical data constitute a data failure, even if the new data is factually accurate.

A blog post from DoorDash illustrates the necessity of retraining their machine learning algorithms following the onset of the COVID-19 pandemic. Lockdowns and social restrictions led to increased food ordering and restaurant sign-ups on the platform.

These evolving consumption patterns were not reflected in the historical data used to train DoorDash’s demand prediction models, rendering them ineffective in the new environment. DoorDash explained, “ML models depend on patterns in historical data to make predictions, but life-as-usual data cannot project to once-in-a-lifetime events like a pandemic… the COVID-19 pandemic brought demand patterns higher and more volatile than ever before, necessitating the retraining of our prediction models to sustain performance.”

Drawing inspiration from former U.S. Secretary of Defense Donald Rumsfeld, a useful framework categorizes data failures based on organizational awareness and understanding. Plotting these categories on a 2x2 matrix, with awareness on the vertical axis and understanding on the horizontal, yields the following diagram.

It’s crucial to note that what is known within one organization may be unknown in another, regarding both awareness and comprehension. Therefore, broadly classifying data failures as simply knowns or unknowns is an oversimplification, as this varies between organizations.

However, here are practical examples of what each category signifies:

Known knowns: These are data failures that an organization recognizes and understands, allowing for the implementation of rules to detect them. This work can be performed by data engineers, data scientists, or business professionals with relevant data expertise. Rules can govern the format or value range for data validity within a specific column.

Known unknowns: These data failures represent concerns for which a company lacks the understanding or resources to address. Several years ago, I advised data teams within large Nordic enterprises. Data owners frequently expressed worry about data from certain sources, citing inconsistent and erroneous values. They stated, “We lack the ability to effectively monitor and proactively address these issues, as the errors vary over time.”

Unknown knowns: These are data failures that a company is unaware of, yet implicitly understands, potentially enabling resolution upon discovery. However, without prior awareness, proactive rule definition and testing are impossible.

For organizations utilizing analytical data, where a time gap exists between data generation and consumption, human involvement is common. Analysts can potentially identify data failures before consumption, converting an unknown known into a known known. However, as data volumes and complexity increase, failures are often overlooked even by experts, as humans are not optimized for high-dimensional pattern recognition or large-scale data processing.

Organizations with operational data use cases, where data is generated and consumed in near real-time, have limited opportunities for human intervention. Knowledge of a potential failure is insufficient, as failures must be addressed before data consumption, which occurs almost immediately after generation.

Statistics Sweden, responsible for official Swedish statistics, encountered an unknown known data failure in 2019. They identified flaws in workforce survey data collection, impacting calculations of Sweden’s GDP and unemployment rate, crucial for government, central bank, and commercial bank decision-making.

Notably, Statistics Sweden employs a high proportion of statisticians and data experts, dedicated to providing high-quality data. In September 2019, they reported a 7.1% unemployment rate, while the actual figure was closer to 6%.

The reported numbers were significant enough to prompt inquiry from the Swedish Minister of Finance, but the concerns were initially dismissed. The discrepancy was discovered months later, leading to recalculations of Sweden’s GDP for several years.

Furthermore, both the central bank and commercial banks had based interest rate decisions on the inaccurate unemployment rate increase. This was interpreted as an economic downturn, prompting banks to lower interest rates aggressively.

What caused this substantial error? The organization had outsourced data collection to a consultancy firm that did not adhere to proper procedures (including fabricating data, as doubled response rates resulted in a 24x fee – a separate issue).

Statistics Sweden employees discovered these practices by identifying discrepancies between internally collected data and the consultancy’s data. Control calls to survey respondents confirmed that they had not participated.

While employees possessed the knowledge that data must be collected properly, they were unaware of the incentives within the outsourcing contract and lacked adequate checks and balances to ensure data quality. Upon awareness, they implemented corrective measures.

Unknown unknowns: These data failures are the most challenging to detect, as organizations are neither aware of nor understand them, making discovery difficult.

Recently, travel and tourism company TUI experienced an unknown-unknown-driven data failure. They use average weights to estimate total passenger weight for flight calculations, employing different standards for men, women, and children. Total passenger weight is critical for determining takeoff thrust and fuel requirements.

A software update caused the system to misclassify passengers titled “Miss” as “children,” resulting in a 1,200 kg underestimation of total passenger weight for one flight. Pilots noted discrepancies between the estimated and load sheet weights, and an unusually high number of children on the load sheet.

However, they dismissed their suspicions and took off with insufficient thrust. Fortunately, despite several planes experiencing the same error before detection, none exceeded their weight safety margins, and no serious harm occurred.

This illustrates the risk of unknown unknown data failures: they can occur even when directly confronted, potentially going unnoticed.

Another example involves the peer-reviewed medical journal “The Lancet,” which published a paper in March 2021 claiming that Spain’s child mortality rate due to COVID-19 was up to four times higher than in countries like the U.S., U.K., Italy, Germany, France, and South Korea. The paper sparked public outcry in Spain, leading to scrutiny of the underlying data, which revealed an exaggeration of child deaths by almost a factor of eight.

The source of the data failure was traced to the Spanish government’s IT system, which could not handle three-digit numbers, causing ages to be truncated – for example, a 102-year-old was recorded as two years old. Highly trained researchers used this erroneous data without detection.

Instacart’s product availability predictions also experienced an unknown-unknown-driven data failure during real-time data consumption. Predictions inform customers about in-stock products. Prior to the pandemic, model accuracy was 93%, but dropped to 61% as shopping behavior drastically changed.

Machine learning models were trained on weeks of historical data, but the system could not accommodate the unexpected hoarding of toilet paper, hand sanitizers, and staples. Engineers only recognized the issue when model performance significantly deteriorated, resulting in lost sales and customer dissatisfaction. The team lacked awareness of the potential for behavior-driven data failures and the ability to recognize the shift in input data before consumption.

These examples represent only a fraction of data failures occurring in data-driven organizations. Many companies are reluctant to publicly discuss them due to embarrassment, despite potentially severe consequences.

The Necessity and Limitations of Observability

The demand for robust solutions to guarantee high data quality in the age of big data has spurred significant interest in the field, particularly in the last year. The term observability has become increasingly prevalent within the data quality landscape.

A growing number of data quality monitoring startups, originating from diverse backgrounds, are utilizing this term to characterize their offerings, often positioning their observability tools as pivotal for business advancement.

“Observability” originates from the DevOps realm – many startups are explicitly aiming to become the “Datadog of data.” This adoption is strategically driven by the expectation that it will facilitate smoother sales cycles and product integration, leveraging the familiarity of tech teams with the concept from application performance and cloud monitoring.

Understanding Data Observability

Within data quality, an observability-focused approach primarily involves monitoring the metadata of data pipelines. This ensures data remains current and complete, verifying that data is progressing through the pipeline as intended.

These solutions typically offer insights into data set schemas and error patterns, such as missing values, often coupled with data lineage tracking.

While data observability represents a valuable initial step in addressing the needs of data-driven organizations, it is not a comprehensive solution. It often overlooks critical issues residing within the data itself.

The Shortcomings of Relying Solely on Observability

Numerous data failures are undetectable through metadata analysis alone and can severely disrupt pipelines despite appearing normal from an observability perspective.

For instance, the data incidents experienced by Statistics Sweden, TUI, the Spanish COVID-19 data initiative, and Instacart would not have been identified by a data observability tool.

In each of these cases, the metadata – completeness, freshness, schema, and error distributions – presented no anomalies. Detection required direct examination of the data values themselves.

Essentially, observability constitutes a component of data quality monitoring, but it represents only a portion of the overall requirements.

Why Conventional Data Failure Detection Methods Are Ineffective

Let's first examine how data failures were detected when data volumes were small and the value of each data point was high, before discussing how to identify data failures in the age of big data. Three primary methods were typically employed:

Reactive Approach: Data failures were addressed only when they became problematic, such as when inaccuracies were discovered in dashboards, BI reports, or other data products by their users. This method is unsuitable for organizations relying on data for critical operations, as the damage is already incurred upon detection.

Manual Validation: Before the widespread role of data engineers, data scientists were tasked with manually verifying data quality. This involved analyzing new data, often column by column, and comparing it to historical data to identify significant deviations and implausible values. While more proactive than the reactive approach, this method lacks scalability.

Rule-Based Systems: Organizations with real-time data applications often relied on domain experts to establish rules within their data pipelines, based on acceptable thresholds. These rules aimed to ensure data quality as it was collected and consumed.

These rules were frequently integrated into extensive master data management systems. However, these rigid systems often hindered agility, as rules required modification when the data environment changed, such as during cloud migrations.

A reactive stance on data quality is unacceptable when data supports high-value applications, a situation increasingly common in the big data era. Having a human review data or implementing fixed rules presents significant challenges.

Firstly, the sheer volume of data renders manual and rule-based methods impractical. The scale and speed of big data make human intervention difficult, and its diversity necessitates flexible approaches beyond rigid rules – the “three V’s of big data”. Our inherent limitations in recognizing patterns within complex datasets impede our ability to understand and identify various data failures, and to anticipate and codify them proactively.

A fourth “V” – veracity – has emerged to address the new contextual challenges surrounding data quality. Veracity refers to the accuracy and trustworthiness of data. Historically, data veracity was often overshadowed by volume, velocity, and variety.

However, this is shifting as only accurate and reliable data can deliver value from analytics and machine learning models. The focus on veracity will intensify with increasing data volumes, diversity, and sources, and is also being driven by emerging regulations like the EU AI Regulation.

Secondly, contemporary data infrastructures and the development of modern data pipelines have substantially increased the complexity of data processing, leading to more potential sources of data failures.

Thirdly, manually defining rules to monitor each data source has become a substantial undertaking, given the proliferation of sources. Thoroughly understanding data sources to define effective rules is time-consuming, and these rules require continuous updates as data evolves.

Finally, and perhaps most critically, traditional manual and rule-based monitoring methods are unable to detect unforeseen data failures – those an organization doesn't understand or isn't aware of. These unknown failures are often the most challenging to identify and the most costly to overlook.

Identifying Data Failures in the Big Data Landscape

Maintaining data quality necessitates strategies that acknowledge several evolving trends influencing data utilization and quality management’s scalability and proactivity.

Contemporary data quality platforms must integrate statistical models and machine learning algorithms alongside traditional rule-based systems. This combination allows for the identification of unforeseen data failures in a manner that is both scalable and adaptable to evolving data characteristics.

Furthermore, these solutions should be capable of processing data in real-time as it moves through data pipelines. This is crucial for supporting operational applications where data is generated and utilized with minimal delay.

Essential Capabilities for Data Engineers

Data engineers require solutions that are straightforward to implement, avoiding complex procurement procedures, and offer customization options.

Data lineage functionality is also vital. It ensures that information regarding upstream data failures is communicated to downstream data teams in a timely and proactive fashion.

Finally, non-intrusive deployment options are essential to guarantee security remains a priority for all users.

Effective data quality monitoring requires a shift towards intelligent, adaptable, and readily deployable solutions that can handle the complexities of modern data environments.

Progressing Through Data Quality Maturity Levels

Through conversations with numerous data teams in recent years, a common pattern has emerged. Modern, data-focused organizations typically advance through distinct stages of maturity concerning the assurance of high data quality.

Level 1 — Foundational: At this initial stage, there are no dedicated processes or technologies implemented for monitoring or detecting data quality issues. Resolution of data quality problems often requires weeks or months, and is typically triggered by reports from end-users or customers after issues have already manifested.

Level 2 — Initial Implementation: Data monitoring dashboards are utilized to observe data patterns, enabling a more rapid response to emerging data quality concerns. However, pinpointing and resolving the underlying cause can still be a lengthy process, particularly within intricate data pipelines.

Understanding the Stages

Level 3 — Intermediate: Systems based on predefined rules are employed to identify data quality problems, allowing for corrective action as they are detected. However, these systems lack the capability to proactively identify unforeseen data failures.

Level 4 — Sophisticated: Rule-based systems are enhanced with machine learning (ML) and statistical testing methodologies. This allows for the detection of both known and previously unknown data failures. Data lineage is leveraged to trace the origins of failures, simplifying error source identification.

Furthermore, this lineage functionality facilitates proactive notification of all downstream data teams reliant on the affected data.

Level 5 — Expert: Building upon ML algorithms, statistical tests, and rules, this level incorporates tools for real-time data operation and filtering. This enables the rectification of data failures while their root causes are being addressed, preventing contamination of the downstream data pipeline.

Organizations at Levels 1 through 3 generally employ methodologies originating from the pre-big data era. While these approaches are relatively straightforward to implement, data-driven companies often quickly recognize their limitations in scalability and their inability to detect unexpected data failures.

Companies operating at Level 4 are capable of identifying unknown data failures in a scalable manner. Alerts triggered by input data failures are more proactive than those based on errors, but they are insufficient for operational applications as data failures aren't mitigated in real-time.

Level 5 represents a truly proactive approach, combining monitoring with intervention strategies. Integrating Level 4 capabilities with automated data operations prevents the propagation of failures by filtering out erroneous data from the primary pipeline or implementing automated corrections as data flows through the system.

To Buy or To Build: A Data Quality Dilemma

Determining whether to purchase a data quality solution or construct one internally is contingent upon a company’s desired level of data maturity and the resources available to them. Generally, achieving a greater degree of maturity necessitates increased investment in both development and ongoing maintenance.

Organizations frequently possess the capacity to develop solutions for data quality Levels 1 through 3 utilizing existing tools such as Grafana, dbt, Great Expectations, and Deequ, or by building from scratch. However, creating bespoke solutions for Level 4 demands the application of statistical testing and machine learning algorithms with broad applicability to detect unforeseen data issues.

As data monitoring shifts its focus from specialized domain expertise to comprehensive, generalized pattern identification, companies specializing in data quality monitoring gain a significant advantage. This is because while domain knowledge is often central to a company’s core business, generalized pattern recognition typically is not.

Level 5 represents a fundamentally different set of performance demands. The initial levels – 1 through 4 – concentrate on identifying data failures in ways that allow human teams to prioritize responses. Developing increasingly adaptable and dynamic pattern recognition and rules requires a strong foundation in data, machine learning, statistics, and software engineering.

These skills remain crucial at Level 5, but real-time data correction for operational applications necessitates systems that perform at a level comparable to the organization’s ETL/ELT pipelines.

Experience suggests that companies often maintain do-it-yourself (DIY) solutions at Levels 1-3, where data quality monitoring is largely dependent on understanding the data itself. When scaling monitoring becomes a priority – often linked to scaling data pipelines – and progression towards Levels 4 or 5 is desired, exploration of external, best-of-breed solutions typically begins.

A limited number of large, data-centric companies have successfully built DIY solutions reaching Level 4. For instance, Uber’s engineering team recently detailed their proprietary system for data quality monitoring across infrastructure. This solution required a year of effort from five data scientists and engineers and will necessitate continuous investment for upkeep.

Such an investment represents a substantial commitment for most organizations, particularly given the scarcity of skilled data scientists and engineers. While Uber’s scale may justify a highly customized, in-house solution, this scenario is likely an exception rather than the rule.

Key Considerations

Data Maturity Level: Higher levels require greater resources.
Resource Availability: Access to skilled data scientists and engineers is critical.
Scalability: External solutions often provide better scalability.
Core Competencies: Focus on what your company does best.

Data quality is paramount for informed decision-making. Choosing between building and buying requires careful evaluation of these factors.

Topics

More

Data Quality in Big Data: A Comprehensive Guide

The Evolving Landscape of Data Management

The Rise of Scalable Data Storage

Continued Data Growth and Emerging Challenges

A Look Back and Forward

The Shift from Volume to Value in Big Data

The Necessity and Limitations of Observability

Understanding Data Observability

The Shortcomings of Relying Solely on Observability

Why Conventional Data Failure Detection Methods Are Ineffective

Identifying Data Failures in the Big Data Landscape

Essential Capabilities for Data Engineers

Progressing Through Data Quality Maturity Levels

Understanding the Stages

To Buy or To Build: A Data Quality Dilemma

Key Considerations

Related Posts

Google's New AI Agent vs. OpenAI GPT-5.2: A Deep Dive

Disney Cease and Desist: Google Faces Copyright Infringement Claim

OpenAI Responds to Google with GPT-5.2 After 'Code Red' Memo

Google Disco: Build Web Apps from Browser Tabs with Gemini

Waymo Baby Delivery: Birth in Self-Driving Car

Google AI Leadership: Promoting Data Center Tech Expert