Change data capture
How do you determine what has changed from the source system in order tomove the data to the data warehouse or, in some cases, from the datawarehouse to the data mart?
Basically, there are four primary techniques to capture data from thesource systems -
1)Sequential processing
2)Date and time stamps
3)Triggers
4)The coined phrase "change data capture"
Sequential processing
One very common technique is to extract data from the source system andsequentially process the data against either the data in the datawarehouse or a current state version of the data. This will take loadsof CPU and Disk I/Os (you can do the math), regardless of whether you doit in sequential files or lookups in a database. Now days the ETL tools uses the Hashing technique to identify the change in a table row. In Hashing technique the ETL enginue stores the Hash of an entire row, and after wards compares the old Hash with the new Hash, if the Hash is changed, then it means some thing in that row has been changed.
Date and time stamps
The second technique is to extract the information from the sourcesystem based on date and time stamps. One big gotcha is that you musttrust the date and time stamps. I have found that for the most part,these are updated properly, but even when they are updated, you mustknow if it is every time. For example, does the source system allow forupdates via utilities other than the applications written to deal withdate and time stamps? Are the maintenance programs as rigorous as theapplication in updating the date and time stamps?
Of course, your application must have these date and time stamps tobegin with. If not, then you cannot use this technique.
Triggers
Adding database triggers is commonly retorted on a way to get only thechanged data. "I will put on database triggers that will write the dataout to an 'Audit' table that will tell us if the row has been changed ornot and who did it". Sounds great! This can solve all our problems!
But wait… Wouldn't this slow down the transaction system? After all,when the application was written and performance measurement wasdetermined, did it take into consideration the use of triggers?Probably not. This means that we have now doubled the I/O (do the math),which means that we are cutting the performance in half. If it takestwo seconds to do the I/O, it will now take (at least!) four seconds.Is this reasonable?
Change data capture
There are products on the market that can read through journal or logfiles and come out with a clean version of changed data. Sometimes thedatabase vendors themselves have something like this. While this may bea strong candidate as the best technique, it has yet to filter down tomany implementations.
So, capturing the data and performing incremental updates is not as easy as it sounds. :)
No comments:
Post a Comment