Tip: Automated Checks in Data Warehouse

Over the last month or so, I have been assisting my manager in recruiting ‘technical’ testers. Apparently, testers with experience in testing Data Warehouse(s) (for simplicity’s sake lets call them Data Warehouse testers) are rare. Going by the candidates who have sent in their CV and/or interviewed with us, I can certainly say, good data warehouse testers are indeed few and far between.

One of the things I had noticed, in fact, almost trait amongst most testers we interviewed, was that nearly all of them depended heavily on row counts to determine whether the data had reached its destination successfully. There’s nothing wrong with the counts, but I resent it being the only way to check the data traversal.

I remember a tester, Peter Morgan presenting his experience of Data Warehouse Testing at a BCS event a couple of years ago and from what I can remember, his team had collaborated with the development team in automating a lot of checks – something I was very pleased to hear since I had been doing precisely that on a couple of projects myself. I also recollect reading something about his presentation but do not have the link to his interesting experience.

So, when I realised that there hasn’t been much (read that ‘any’) technical stuff on my blog, and in the eternal hope that this those interested testers might benefit from or even add to my own learning by sharing their ideas, there was an opportunity to write a (or may be a series of more than one) quick post that show how I try to incorporate automation checks, using simple examples, to assist me and supplement my testing activities when testing data warehouse.

I hope they are not too extensive to bore you to death or put you off in this aspect of testing. I’m hoping to write these posts as individual items that are candidates for automation check rather than picking up eligible items in order of how the data warehouse projects progress.

Data Warehouse projects are ideal candidates for implementing automated regression checks. As much as possible. The data volumes are huge, typically over a couple of millions of rows of transactional data. The bigger the business of the customer, the larger and complex the data set it.

The most basic, yet the most important of all checks is verifying whether the source data matches the target data. Between the source and target, there are numerous stages that a data might go through – something that the Extract-Transform-Load, or the ETL process. As a tester, I normally tend to mimic this ETL process that allows me to check data to its lowest (or desired) granularity. And because we can automate it, my preference is to do these checks on 100% of data and not just a sample set.

So, as a quick tip using a very simple example, let me show how I would try to reconcile data from source table to the target table.

Lets assume the following source table – table1

ID Val
1 10.10
2 10.12
3 12.12
4 13.44
5 1.66
7 1.66

that after the ETL process gets into the target database table – table2 – as

ID Val
1 10.10
2 10.12
3 12.12
4 13.40
5 1.66
6 1.00

Whilst these are too simple an example to be realistic, I just wanted to convey the idea behind the tip I wanted to share.

In most cases, the first check that follows the ETL is to do a count so a tester would write something like

select count(*) from table1;

to check the source row counts and

select count(*) from table2;

to check the target row count.

Now, both these would return the value 6 as the output – which would indicate that total rows in = total rows out. This doesn’t mean it ends here. It doesn’t even mean the ETL has worked. So what next?

The most common answer I have got is to do a minus or an except query. By this, a tester would check what is in one table thats not in the other. So, it would be implemented as:

select id, val from table1
select id, val from table2

or the following using T-SQL

select id, val from table1
select id, val from table2

the output would be

ID Val
4 13.44
7 1.66

This only tells me what the source data was that did not make it (or made it incorrectly) to the target. It does not tell me what it turned up into the target table as. I do the except other way around to get this information. So, using T-SQL it would be:

select id, val from table2
select id, val from table1

which give me

ID Val
4 13.40
6 1.00

Again, this tells me of those items that were delivered into the target table that did not match the source. It doesn’t tell me what the expected (or the actual source) data was that was transformed this way.

I would need to use both these individual set of data, marry them to see the whole picture of what exactly is different and where. The example we’ve used is a simple two column one with just two ‘differences’. Scale this up to millions of rows within a table that has tens of columns (typically, these could be anywhere from 5 to 30 columns. Or more.)

I think there is an efficient way to get this information in one go.  Here’s how.

Lets look at what we have. The tables have two columns, an integer ID and a corresponding decimal VAL.

I could use the ID field as an identifier and then use the VAL fields as expected and actual values to see how they match/differ.

In order to see both expected and actual values together, we need to combine the two sets of data into one. I can achieve this using the UNION. Also, because the VAL field is decimal, I could add them to a 0 (zero) without altering the original value. So, I have:

     t1.id    as id
   , t1.val   as expectedVal
   , 0        as actualVal
from @table1 t1


     t2.id    as id
   , 0        as expectedVal
   , t2.val   as actualVal
from @table2 t2

we can then sum up the expected and actual values, like so:

   , sum(comb.expectedVal)  as 'Expected Value'
   , sum(comb.actualVal)    as 'Actual Value'
from (
             t1.id as id
           , t1.val as expectedVal
           , 0 as actualVal
        from @table1 t1
             t2.id as id
           , 0 as expectedVal
           , t2.val as actualVal
        from @table2 t2
     ) comb group by comb.id

This gives us the following output:

ID Expected Value Actual Value
1 10.10 10.10
2 10.12 10.12
3 12.12 12.12
4 13.44 13.40
5 1.66 1.66
6 0.00 1.00
7 1.66 0.00

Finally, we aren’t really interested in the ones that match. It’s the differences that we want to concentrate on, so, just limit the output to the differences by adding the following condition at the end.

having sum(comb.expectedVal) <> sum(comb.actualVal)

Thus giving us the exceptions as:

ID Expected Value Actual Value
4 13.44 13.40
6 0.00 1.00
7 1.66 0.00

Now, that is more valuable information for the tester and developer that assist both in understanding the root cause of the discrepancy – and possibly fix it at the earliest convenience.

If you’ve reached this far, I thank you for your patience. I hope this has helped you.

I’d appreciate your feedback, comments, critique on this. Please share if you have found this useful!


How I learn from a 6 year old

This was long time due. Cold, flu, work, fever, work, more work, more flu, and much more work and a couple of drafts later here it is.

My ex-boss told me once, never discourage the enthusiasm and questions of children. I follow his advice. Religiously. Well, try my best to, and I think I do it well to a large extent.

I was fascinated by the idea of unschooling that was triggered by this video that I watched a year ago; and also by the buccaneer scholar, James Bach. It was also very fortunate to hear more from the man himself when I attended the Rapid Software Testing course that he was teaching a few weeks ago.

The six year old in question is my son who has a fascination for technology and is ever keen to see, work, explore and play with any gadgets he can lay his hands on. This was since he was about a year old or so.

The only thing he would talk when I used to visit his school during parents evening was about the over head projector, how it works, the computers children were allowed to use, what software it had, how it worked, why some things appeared different on the school pc than at home… the list is endless. One of the teachers even confessed that he knew every teacher’s password and they willingly allowed him to ensure every pc was logged in at the start of a session and correctly closed down at the end.

Whilst that made me proud, I attempted to know how it benefitted him, what the school did, if anything, that allowed him to progress and most importantly, me trying to understand how he learnt things.

The very first thing I noticed was, he wasn’t afraid of giving anything a go. He wasn’t worried about him not knowing anything. My parents have a completely opposite approach. They would be scared to use my laptop in fear of messing something up despite me insisting they work on it. It’s a different matter that my Dad has a sort of an “anti-midas” touch and I’ve seen things not working when he uses them but works perfectly fine if anyone else uses it!

I could see some sort of correlation between age and technology. Perhaps the older we get, we fail to catch up with the progressing technology and we form a shell around ourselves. We are more likely to enjoy our own comfort-zone. Most of us know that children are like sponge and they grasp things very quickly. We as parents have been able to throw more at our son and he has been able to do things faster than others at his age.

I have a tic-tac-toe game on my mobile phone. I gave my phone to him once to play this game and within 5 mins he showed me how the game had a delay before placing the next move and how he could just place three consecutive noughts or crosses in that time and win every time.

On the little big planet, how he could go to the main screen, select another user using a key combination and get that character/user in the game, how he can then use them to press a button that would allow him to progress ahead and move onto the next level! I was speechless when he showed me that.

The other day, despite me warning not to mess around with the TV menu – especially software updates, he showed me how he could create a list of favourite channels for everyone and how it would save us from going to menu, searching for our channel list and just use the up/down keys on remote. The fact that he made the excuse of “and it doesnt even mess the TV at all” was the best part of all!

Anyway; I got the chance of speaking with James during the course breaks on how I (and my son) would benefit from or go about encouraging his unschooling (defined by James as allowing yourselves/your child drive your/their own education according to the natural rythms of your/their own minds – which I read as their potential, their ability to learn – very rapidly in most cases). He encouraged me playing different games and observing how he does things, how he observes and learns from his experiences.

Over a weekend, after the course ended, father and son bonded over many things, including food – especially a breakfast at McDonalds – and a game.

The game was to see how an object behaved albeit with me having a slight advantage of having an identical looking one and having different properties.

The rate at which he formed his own testing heuristics/oracles were fascinating:

“Thats interesting, this bounces when I drop it, but does not when you try”
“I think there is a bouncy surface and a non-bouncy surface”
“The shiny side makes it bounce, the other side doesnt”
“I know its because one of the side is sticky and so it doesnt bounce”
“I think you cant get rid of the stickiness by wiping it off your trousers, lets try this paper towel”
“Let me see if it bounces on this table, this cushion, this floor, this food tray…”
“Let me see what happenes if I tried to throw it on the floor real hard”
“What if I dropped from a very high place, like if I stand on a chair, or even a building”
“My hands are warm so when I touch it warms and so it bounces, but yours are cold so it doesn’t when it gets cold. Let me see if it stops bouncing if I make it cold by placing this under the cold tropicana bottle”
“I tried to cool it down for 1 minute but it bounced back, let me try to make it colder by doing this for 2 minutes”
“Can I borrow your coffee mug so I can warm it to see if I can see any change”
“Why is it that I can press this part (of the object) but I was not able to do so last time”

This immediately reminded me of two things “rapid testing” and “context-driven testing”. He was applying his own heuristics to test an idea and rapidly swapping it for another when it failed.

“All heuristics/oracles are fallible” (James Bach / Michael Bolton, Context Driven Testing)

I think I’d prefer calling them “Fallicles” or even “Fallacles”(fallible+oracles; although happy to acknowledge anyone who has already coined this term, more importantly, in this context before! and will need evidence too!)

We carried on for a good 20 minutes and I could notice he never gave up; at least not until I had to stop. It was truly inspiring. He has such moments, but not always; but thats okay, I guess. His persistence with trying things out with whatever was available to him was commendable.

How many of us give up when we run into dead ends? I know I almost did this, this morning!

I decided to stop when I heard the last one, I said, “Do you think that Dad might be tricking you with yet another object like this one?” He said “No, that’s not possible.” I asked “Why?” and he innocently replied, “because I know you wont!”

Once we stopped, he asked me “Where did you get these from?” I answered “I got them from the best tester in the world. Do you know who that is?” Without a moment of hesitation, he replied “You!”.

Now there were two instances, that reminded me of James Lyndsay and his work on bias (which was also a topic of my previous post). There was faith, but there was also an element of biasness in my son’s honest reply.

I repeated what I said to James Bach when he asked me at the start of the RST course if I was a tester. I said, “I am, but not as good as I would like myself to be”.

James replied, “That’s good. That’s how you keep learning and strive for more.”

Agree. And the learning does not necessarily have to come from peers, sometimes children teach you a lot too!

Thanks for reading! Please do leave your comments, critiques, suggestions, tips ‘n tricks; and if you liked this post, please do share it with others…

we learn the hard way

i was really annoyed recently when a supposedly independent release wiped out my data entirely. when i say independent, it was supposed to be for a module that did not impact the one i was testing.

it did not take me time to realise that the release that was sent had a schema creation sql that truncated and rebuilt all the tables, including those that were being used by the previously delivered release. these tables did not undergo any change in the newer (independent) release and therefore should not have been delivered in the first place.

now the question is, should i blame myself for not checking the entire release contents or was i taking my testing tasks a bit too far?

i posed this questions to some of my colleagues and they were quick to blame the dev team for this. i was not entirely convinced, being the conscientious tester that i am. agreed, it does takeup my time, time that i have not quoted/costed/estimated i would take to carry out the ‘core’ testing activities; to do these checks, but it is certainly a time wisely ‘invested’ that causes us less grief further down the line.

a counter argument to support the critics would have been for the dev team to have documented this in a release note. not all testers (that i have worked with) are technical enough to understand such ddls; nor have many of them shown detailed interest in knowing exactly what the release contents are. i’ve argued with many testers who say “we only look at the release once its loaded/installed”.

a further argument to support them would be to question dev team on whether they had done any tests themselves and if they understood the implications thereof?

initial impact of this was i was annoyed that i will have to regression test previously delivered module and therefore add more time to my testing efforts; but when i thought about it, it made more sense for me to do that, giving me yet another opportunity to provide wider coverage than i did before, so better tests.

but on the whole, note to self: read not just the release notes, not just the sqls, but speak with dev team to see what the impacts are!

progressing on guestimates

guestimates: time quoted (by an inexperienced tester) as the estimated time taken to carry out a (set of) test(s) by an individual (or a set of individuals) .

a journey from plucking a number out of thin air to providing a right estimate takes time. this time is the ‘experience’ which allows a tester get away from ‘guestimating’ and get closer to ‘predicting’ time it takes to do a (set of) test(s).

i remember being asked “how long will it take you to test this module?” during one of my first employments. i naively replied back, having spoken to the most experienced guy around, “two days”. i did it in two days alright, but they were stretched, working days.

this continued for some time and in 8 out of 10 instances, i was being stretched. i was luckier in others where i could complete my work just in time. that was working hard. not smart!

it took me some time, questions, feedback, more questions and thus, experience, before i got myself into the act of providing a number that included various activities that went into me running a test including (but perhaps not limited to):

  1. understanding the requirement/specification and the functionality behind the item being tested
  2. (basic) understanding of the data required – both as a pre-requisite and operational
  3. writing a test case / scenario
  4. executing a test case / scenario
  5. writing a defect report, if it disagreed with its requirements
  6. retesting defect (if there was one to be raised)
  7. contingency (of any admin tasks, environment non-availability, release delays etc)
  8. regression testing (if there is a requirement for one)

i especially like to keep a note of how this has worked for me. i’ve noticed this model works well in most, if not all, cases.

during the last 6 odd months, this has proven accurate enough to be treated as a template for the cost model for testing activities and has even proven the historic estimate of 35-40% of dev time on testing wrong (atleast in my work place and for data warehouse testing that i do).

matthew heusser has some brilliant advice on software test estimation. series of about 7 items available on his blog.