SCD Types
In his book The Data Warehouse Toolkit, Ralph Kimball brought us the idea of slowly changing dimensions. In fact he defined three types of slowly changing dimensions - called Type 1, Type 2, and Type 3.
Type 1
The Type 1 slowly changing dimension is simply overwriting the row in the dimension with the new data. There is no history kept. Doesn't that defeat the purpose of a subject-oriented, time variant, integrated,non-volatile data store? Yes, it does, but it is also the easiest type to implement and uses the least amount of storage. After all, we areused to reading rows and updating them - that is what transaction systems are all about. However, you may wish to consider this type when there is incorrect data in your data warehouse. For example, if you find that someone's National ID Number was incorrect, you may have aneed for a Type 1 change. Again, this will be based on user requirements. It may be that the user community wishes to keep an actual history of what really happened in the data, so you would use another method.
Type 1 Simply overwrites the data so it has no history of data.
Type 2
Type 2 slowly changing dimensions are basically a form of version control for the rows in the database. Instead of making changes to the row, we add a new row to the table.
Type 2 uses a unique key that provides for version control.
Type 2 uses a unique key that provides for version control.
Forexample, some designers will use a key and add a version to it.
The table might look like
Create table Member_Table (
memberKey Integer,
Version Integer,
memberName Character,
Gender Character,
MaritalStatus Character,
Etc.);
When Member_Table row changes, you make a change to the data, update theVersion and write the new row to the table. Then during population ofthe fact table, you can get the latest Version of the member and usethat.
As an example, let's say that Maryum Khan was your only member.
Your row would look like
memberKey Version memberName Gender MaritalStatus
1000 1 Maryum Khan Female Single
If Maryum were to get married, you would add a new row like
memberKey Version memberName Gender MaritalStatus
1000 1 Maryum Khan Female Single
1000 2 Maryum Jawaid Female Married
When querying the data, if you ask for memberKey = 1000, then you getboth rows. This assumes memberKey is not a surrogate key but a realvalue known to the user community (otherwise they would not know the keyvalue!).
If you ask for Gender = Female, then you get both rows. Ifyou ask for MaritalStatus = Married, then you only get one row (version2).
But if you wanted to do analysis of MaritalStatus (grouping sales byMaritalStatus, you would get both rows but they would be summed and grouped separately.
But if you wanted to do analysis of MaritalStatus (grouping sales byMaritalStatus, you would get both rows but they would be summed and grouped separately.
The primary key to the table would be memberKey and Version. I don'tparticularly care for this option because it requires two values to be added per dimension to the fact table.
One way to get around this is to combine memberKey and Version to asingle column. However, I believe that this makes it more difficult forthe Extraction, Transformation, and Load (ETL) process to accomplish it.
I believe that Type 2 is the most widely used form of slowly changingdimensions. It provides a true historic perspective.
Another way of implementing Type 2 is to use a surrogate key and thereal key.
For example,
Member_Table table might look like
Create table Member_Table (
Create table Member_Table (
memberSurrogateKey Integer,
memberApplicationKey Integer,
memberName Character,
Gender Character,
MaritalStatus Character,
Etc.);
In this case, the memberSurrogateKey could be the version number, butit is unique throughout all of Member_Table table, instead of within asingle memberApplicationKey.
We always know that the latestMember_Table is the one with the highest memberSurrogateKey.
If welooked at the previous example with Maryum as our only test case, this ishow the table would look.
memberSurrogateKey memberApplicationKey memberName Gender MaritalStatus
1 1000 Maryum Khan Female Single
2 1000 Maryum Jawaid Female Married
It doesn't look much different. Just the data moved from one column toanother. But if you add a second member in between the time Maryum gotmarried, then you could see the difference.
memberSurrogateKey memberApplicationKey memberName Gender MaritalStatus
1 1000 Maryum Khan Female Single
2 1001 Saira Bano Female Married
3 1000 Maryum Jawaid Female Married
Previous way
memberKey Version memberName Gender MaritalStatus
1000 1 Maryum Khan Female Single
1001 1 Saira Bano Female Married
1000 2 Maryum Jawaid Female Married
What happens is that the ETL process becomes much simpler and providesan easy way to do all of the same analysis we could previously whilekeeping the fact table smaller by having less columns in its' primarykey.
Another way is we will put dates on this table. I do itbased on the user community requirements. What I don't want to do is tomake the user community put date complexity in their query. But I may wish to have the dates for a better understanding of the data.
For example,
if Maryum didn't purchase anything for a year, then how would yoube able to know when she was married? You may have gotten notificationthat she was married, but the data connected to it with the fact doesn'tshow when. So I use the dates to be able to analyze data this way. Idon't want the user community using dates to determine which row toretrieve.
So my table might look like
So my table might look like
Create table Member_Table (
memberSurrogateKey Integer,
memberApplicationKey Integer,
memberName Character,
Gender Character,
MaritalStatus Character,
CreatedDate Date,
Etc.);
Again, I don't need CreatedDate to do my queries.
However, if I would like to see just what the current folks are doing, Ihave to do MAX in my query to determine the latest information. While that is not too hard to do, I prefer to give an easier way to my usercommunity. Therefore I add a CurrentIndicator column (a Type 1-like columnwithin a Type 2 table!) and it gets updated from "Y" to "N" in theprevious row whenever a new row is inserted.
Now my table might look like
Create table Member_Table (
memberSurrogateKey Integer,
memberApplicationKey Integer,
memberName Character,
Gender Character,
MaritalStatus Character,
CreatedDate Date,
CurrentIndicator Character,
Etc.);
Like I said, I believe that Type 2 is the most widely used form ofslowly changing dimensions. It provides a true historic perspective.There are many ways to implement this type and I think it is highlydependent on your user community and their ability to access data.
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute of interest, one indicating the original value, and one indicating the current value. There will also be a column that indicates when the current value becomes active.
In Type 3 I will change my example, you will later know why.
In our example, we will have the following table:
MemberKey Name City
1001 Maryum Jawaid Lahore
To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns:
MemberKey Name, Original City, Current City, Effective Date
After Maryum moved from Lahore to Quetta, the original information gets updated, and we have the following table (assuming the effective date of change is January 15, 2008):
MemberKey Name Original_City Current_State Effective Date
1001 Maryum Jawaid Lahore Quetta 15-JAN-2008
Advantages:
- This does not increase the size of the table, since new information is updated.
- This allows us to keep some part of history.
Disadvantages:
- Type 3 will not be able to keep all history where an attribute is changed more than once. For example, if Maryum later moves to Karachi on December 15, 2008, the Quetta information will be lost.
Usage:
Type 3 is rarely used in actual practice.
When to use Type 3:
Type III slowly changing dimension should only be used when it is necessary for the data warehouse to track historical changes, and when such changes will only occur for a finite number of time.
1 comment:
The reason was I dont Maryum Jawaid to get divorced :)
Post a Comment