Tip: Automated Checks in Data Warehouse

Over the last month or so, I have been assisting my manager in recruiting ‘technical’ testers. Apparently, testers with experience in testing Data Warehouse(s) (for simplicity’s sake lets call them Data Warehouse testers) are rare. Going by the candidates who have sent in their CV and/or interviewed with us, I can certainly say, good data warehouse testers are indeed few and far between.

One of the things I had noticed, in fact, almost trait amongst most testers we interviewed, was that nearly all of them depended heavily on row counts to determine whether the data had reached its destination successfully. There’s nothing wrong with the counts, but I resent it being the only way to check the data traversal.

I remember a tester, Peter Morgan presenting his experience of Data Warehouse Testing at a BCS event a couple of years ago and from what I can remember, his team had collaborated with the development team in automating a lot of checks – something I was very pleased to hear since I had been doing precisely that on a couple of projects myself. I also recollect reading something about his presentation but do not have the link to his interesting experience.

So, when I realised that there hasn’t been much (read that ‘any’) technical stuff on my blog, and in the eternal hope that this those interested testers might benefit from or even add to my own learning by sharing their ideas, there was an opportunity to write a (or may be a series of more than one) quick post that show how I try to incorporate automation checks, using simple examples, to assist me and supplement my testing activities when testing data warehouse.

I hope they are not too extensive to bore you to death or put you off in this aspect of testing. I’m hoping to write these posts as individual items that are candidates for automation check rather than picking up eligible items in order of how the data warehouse projects progress.

Data Warehouse projects are ideal candidates for implementing automated regression checks. As much as possible. The data volumes are huge, typically over a couple of millions of rows of transactional data. The bigger the business of the customer, the larger and complex the data set it.

The most basic, yet the most important of all checks is verifying whether the source data matches the target data. Between the source and target, there are numerous stages that a data might go through – something that the Extract-Transform-Load, or the ETL process. As a tester, I normally tend to mimic this ETL process that allows me to check data to its lowest (or desired) granularity. And because we can automate it, my preference is to do these checks on 100% of data and not just a sample set.

So, as a quick tip using a very simple example, let me show how I would try to reconcile data from source table to the target table.

Lets assume the following source table – table1

ID Val
1 10.10
2 10.12
3 12.12
4 13.44
5 1.66
7 1.66

that after the ETL process gets into the target database table – table2 – as

ID Val
1 10.10
2 10.12
3 12.12
4 13.40
5 1.66
6 1.00

Whilst these are too simple an example to be realistic, I just wanted to convey the idea behind the tip I wanted to share.

In most cases, the first check that follows the ETL is to do a count so a tester would write something like

select count(*) from table1;

to check the source row counts and

select count(*) from table2;

to check the target row count.

Now, both these would return the value 6 as the output – which would indicate that total rows in = total rows out. This doesn’t mean it ends here. It doesn’t even mean the ETL has worked. So what next?

The most common answer I have got is to do a minus or an except query. By this, a tester would check what is in one table thats not in the other. So, it would be implemented as:

select id, val from table1
minus
select id, val from table2

or the following using T-SQL

select id, val from table1
except
select id, val from table2

the output would be

ID Val
4 13.44
7 1.66

This only tells me what the source data was that did not make it (or made it incorrectly) to the target. It does not tell me what it turned up into the target table as. I do the except other way around to get this information. So, using T-SQL it would be:

select id, val from table2
except 
select id, val from table1

which give me

ID Val
4 13.40
6 1.00

Again, this tells me of those items that were delivered into the target table that did not match the source. It doesn’t tell me what the expected (or the actual source) data was that was transformed this way.

I would need to use both these individual set of data, marry them to see the whole picture of what exactly is different and where. The example we’ve used is a simple two column one with just two ‘differences’. Scale this up to millions of rows within a table that has tens of columns (typically, these could be anywhere from 5 to 30 columns. Or more.)

I think there is an efficient way to get this information in one go.  Here’s how.

Lets look at what we have. The tables have two columns, an integer ID and a corresponding decimal VAL.

I could use the ID field as an identifier and then use the VAL fields as expected and actual values to see how they match/differ.

In order to see both expected and actual values together, we need to combine the two sets of data into one. I can achieve this using the UNION. Also, because the VAL field is decimal, I could add them to a 0 (zero) without altering the original value. So, I have:

select 
     t1.id    as id
   , t1.val   as expectedVal
   , 0        as actualVal
from @table1 t1

union 

select 
     t2.id    as id
   , 0        as expectedVal
   , t2.val   as actualVal
from @table2 t2

we can then sum up the expected and actual values, like so:

select
     comb.id
   , sum(comb.expectedVal)  as 'Expected Value'
   , sum(comb.actualVal)    as 'Actual Value'
from (
        select 
             t1.id as id
           , t1.val as expectedVal
           , 0 as actualVal
        from @table1 t1
        union
        select 
             t2.id as id
           , 0 as expectedVal
           , t2.val as actualVal
        from @table2 t2
     ) comb group by comb.id

This gives us the following output:

ID Expected Value Actual Value
1 10.10 10.10
2 10.12 10.12
3 12.12 12.12
4 13.44 13.40
5 1.66 1.66
6 0.00 1.00
7 1.66 0.00

Finally, we aren’t really interested in the ones that match. It’s the differences that we want to concentrate on, so, just limit the output to the differences by adding the following condition at the end.

having sum(comb.expectedVal) <> sum(comb.actualVal)

Thus giving us the exceptions as:

ID Expected Value Actual Value
4 13.44 13.40
6 0.00 1.00
7 1.66 0.00

Now, that is more valuable information for the tester and developer that assist both in understanding the root cause of the discrepancy – and possibly fix it at the earliest convenience.

If you’ve reached this far, I thank you for your patience. I hope this has helped you.

I’d appreciate your feedback, comments, critique on this. Please share if you have found this useful!