Data Warehouse Testing

Data Warehouse Testing

When implementing an Extract, Transform and Load (ETL) system for business intelligence, one of the greatest risks is rushing a data warehouse into service without comprehensive testing. QualiTest’s ETL testing process ensures that data and systems are tested systematically for errors, bugs and inconsistencies before data is integrated.

What Is ETL Testing?
Unlike databases (which are designed to easily add and modify records), data warehouses are quite vulnerable during record addition. This structured process known as ETL (Extract, Transform, Load) adds new data in bulk. Let’s appreciate this structure by viewing it as an assembly line.
ps_dt_1
01
EXTRACT
(also called Data Staging or pre-Hadoop)
involves grabbing components and copying them into your system. Input data is extracted from many sources, containing multiple formats and quality issues. A transformation process then alters this extracted data in required ways per business needs (and flags any non-transformable record that cannot be loaded).  Hopefully, it all works cleanly.
02
TRANSFORM
(often called MapReduce)
is the alteration of these components, and Load (often just called Output) involves the assembled end product. Like real-life assembly lines, source materials and their quality concerns are known prior to the process, progress can be monitored mid-flow, and completed end products are inspected for defects.  Hopefully, this cleansing works.
03
LOAD
(send to the repository)
Lastly, the data loads into the data warehouse as records. Good testing determines which step is problematic – this includes but is not limited to using SQL querying, verbose logging and comparing the pre-extract and post-load versions of cherry-picked sample records. Hopefully, nothing gets dropped or garbled.

Problems in the 3-step ETL process risk mangling many records with similar properties. These systems are too big to allow sampling each record then repairing the problem ones, so it is vital to pinpoint and fix bugs BEFORE going live. But ETL involves two more processes. Pre-extraction, you must know the data sources to determine what test record formats we’ll need and their boundary issues. Post-load, you must test the actual stored data by sampling, since bad data ruins good decision-making.

What Is Data Warehouse Testing?
Data warehouse testing concentrates on more than just the critical potentially error-prone ETL stage, where bugs can pollute the whole system.  Testing includes:
icon_6
Reporting stage
icon_5
Check for corruption
icon_4
Security testing
icon_3-1
Backup recovery
icon_2-1
Scheduling software
icon_1-1
Query performance
(especially regarding scalability)
Data Warehouse Testing and ETL Testing are considered synonymous.
ETL Testing tests the whole warehouse, not just the ETL data-addition stage.

Let’s talk more generally, identifying real-life data warehouse scenarios we must test to ensure they work right, instead of dissecting ETL. You are creating a brand new data warehouse, but don’t know if and how anything works yet, including the Initial Load and the ETL scheduling. You are migrating from one data warehouse to another, perhaps to escape a capacity limit (maybe a cloud-based move) or to enjoy a faster platform; in addition to testing the new data warehouse, you need to confirm that the migration does not damage or lose data, and that all transformations happen cleanly. Development has code changes, and you need to check the new/altered features and regression test everything else. You want to ensure that backup recovery works. You want to know scalability’s impact on your report-building speed. You want to know the data quality of ETL’ed data. If you are not confident in your system’s abilities, you need to bring in a professional.

Risks in ETL and Data Warehousing
What can go wrong? Good question.
icon_7
A good record may not get loaded.
icon_10
icon_10_mobile
A record may get truncated.
icon_8
icon_8_new_mobile
A cached section may dribble into the next records.
icon_9
A record may write multiple times.
icon_11
A record may transform wrong.

A 2015 Experian Data Quality survey found that 92% of all organizations suspect data quality issues and find data quality challenging, and on average suspect that 26% of their total data may be inaccurate. Any of the above mistakes can cause the analytics to make bad decisions based on good logic done on the bad queried results. How bad? A targeted mailing may miss people in your intended target group and get mailed to opponents of your cause. A recall notice may miss potential victims. Your audit may have blatant mistakes. Medical labs may notify the wrong patients, not notify the right patients, or report the wrong results. Allergen listings may be inaccurate. Claims may go unprocessed.

icon_12
A record may have “dirty data” (bad values in a field that confuse queries) that violate business rules, like a NULL in a field that can’t have NULL’s, or a third value in a field that’s supposed to be boolean, or a VIN ending in letters. The extracted and loaded record counts may disagree.

A 2015 Experian Data Quality survey found that 92% of all organizations suspect data quality issues and find data quality challenging, and on average suspect that 26% of their total data may be inaccurate. Any of the above mistakes can cause the analytics to make bad decisions based on good logic done on the bad queried results. How bad? A targeted mailing may miss people in your intended target group and get mailed to opponents of your cause. A recall notice may miss potential victims. Your audit may have blatant mistakes. Medical labs may notify the wrong patients, not notify the right patients, or report the wrong results. Allergen listings may be inaccurate. Claims may go unprocessed.

The QualiTest Solution
Data warehouses live or die based on the quality of
their analyzed data and ability to report.
home_bs-200x200
We at QualiTest know the perils that data warehouses can face, and how to ensure that your system behaves as it should, so that you and your customers can succeed.
cuve-200x200
We take your business context and needs into account as part of the overall test design.
base-200x200
We leverage our deep ETL and technology understanding to verify and validate the behind-the-scenes efforts of what is far more complex than merely storing and retrieving data, for physical and cloud-based storage systems.
icon_16-1
A complete system analysis, detailing your system’s structure, how we plan to test it (based on your specific needs), and ensuring that the smaller test environment properly identifies architecture concerns that may be more noticeable in your larger system
icon_15-2
A comprehensive defect report detailing each bug found, including the exact ETL process where the problem first appeared
icon_14-1
An ETL test accelerator tool, enabling us to rapidly increase test coverage while reducing time for test script preparation
icon_13-2
Support, provided from a QualiTest senior test specialist to improve quality and incite continuous improvement
icon_23
Data quality checking, based on analysis of the fields from your different data sources, and transformation’s implementation of your business need changes
icon_22
Independent  Verification and Validation
icon_20
Expertise in Tools and Best Practices
icon_21
Over 15 years  of experience delivering ETL testing project QA solutions to both large and small companies