In slowly changing (SCD) dimensions, type-2 attributes involves ending a dimension row when the attribute value changes with the current date and creating a new row starting from the current date. While this works great to capture history once a warehouse/cube is implemented, it does not address the situation of historical data if such tracking was already occurring. For most scenarios, the history can simply be loaded without the cube processing having the type-2 rules implemented to pre-populate the SCD start/end dates. However in some business such as health care, history may sometimes need to be “re-written”. For example, a claim may not get into the system until several days after it occurs and the claim would need to be effective for a point in time in history already loaded.
This challenge is highlighted in the below examples. A real-world example might be tracking what certifications were held by an individual at a given point in time. For example a software professional may have been an Oracle OCP from 6/1/2010 to 8/31/2010, and a PMP from 7/1 TO 12/31. To track the history of all certifications held by date ranges, 3 rows are needed:
– 6/1/2010 – 6/31/2010 : OCP
– 7/1/2010 – 8/31/2010 : OCP, PMP
– 9/1/2010 – 12/31/2011: PMP only
The spreadsheet illustrates the scenario where sets of data for a dimension row coming from 2 different tables have values effective for different historical periods – the dates associated with the 2 sets of values could be overlapping, equivalent, or covering (i.e. ‘A’ start date before ‘B’ start date with end date also greater than ‘B’ end date)
This is of interest for the scenario where customer already has SCD-type2 tracking in place for historical data across more than one tables which have values controlled by different effective date ranges and desires to consolidate to a single dimension. The goal is a single dimension from multiple source tables with TYPE-2 tracking using a single start/end date to capture history of all values associated to effective dates from the source tables including support for regenerating if the individual historical values or date ranges are changed.
Below is the SQL that demonstrates the solution approach I found for taking multiple histories with different start/end dates associated to historical values and getting them to a consolidated History input with type-2 SCD behavior that emulates what would have happened if SCD tracking had been in place incrementally over the date range of the historical data. Here are the steps:
1) Unpivot start/end dates through unions of the start/end dates from all historical tables into 2 columns
2) Create distinct ranges from the unpivoted dates by joining the distinct start dates to the next available end date
3) Join back to the source tables for all values effective for the segmented date ranges.
The SQL accomplishes the 3 items through views (unpivoted, range, history) to generate the required rows for the type-2 SCD dimension.
Although only 2 tables are involved, this should work the same way for additional tables, it just means adding additional unions in the unpivoted view and additional left joins to the history view.
CREATE TABLE [dbo].[t1](
[k] [int] NOT NULL,
[v] [int] NULL,
[s] [date] NULL,
[e] [date] NULL)
GO
CREATE TABLE [dbo].[t2](
[k] [int] NOT NULL,
[v] [int] NULL,
[s] [date] NULL,
[e] [date] NULL)
CREATE VIEW [dbo].[Unpivoted] AS
SELECT s, e FROM T1 UNION SELECT s, e FROM T2
GO
CREATE VIEW [dbo].[Ranges] AS
SELECT s, (SELECT MIN(e) FROM Unpivoted WHERE e > VS.s) AS e FROM (SELECT DISTINCT s FROM Unpivoted) AS VS
GO
CREATE VIEW [dbo].[History] AS
SELECT R.s, R.e, t1.k as T1_Key, t2.k as T2_Key, T1.v as T1_Value, T2.V as T2_Value FROM Ranges R
LEFT JOIN T1
ON T1.s BETWEEN R.s AND R.e
OR T1.e BETWEEN R.S and R.e
OR T1.s < R.s AND T1.e > R.e
LEFT JOIN T2
ON T2.s BETWEEN R.s AND R.e
OR T2.e BETWEEN R.s AND R.e
OR T2.s < R.s AND T2.e > R.e
INSERT [dbo].[t1] ([k], [v], [s], [e]) VALUES (1, 0, CAST(0x01380B00 AS Date), CAST(0x78380B00 AS Date))
INSERT [dbo].[t1] ([k], [v], [s], [e]) VALUES (2, 1, CAST(0x79380B00 AS Date), CAST(0xF3380B00 AS Date))
INSERT [dbo].[t1] ([k], [v], [s], [e]) VALUES (3, 0, CAST(0xF4380B00 AS Date), CAST(0x11390B00 AS Date))
INSERT [dbo].[t1] ([k], [v], [s], [e]) VALUES (4, 1, CAST(0x12390B00 AS Date), CAST(0x20390B00 AS Date))
INSERT [dbo].[t1] ([k], [v], [s], [e]) VALUES (5, 0, CAST(0x21390B00 AS Date), CAST(0x30390B00 AS Date))
INSERT [dbo].[t1] ([k], [v], [s], [e]) VALUES (6, 1, CAST(0x31390B00 AS Date), CAST(0x4E390B00 AS Date))
INSERT [dbo].[t2] ([k], [v], [s], [e]) VALUES (1, 0, CAST(0x01380B00 AS Date), CAST(0x4A380B00 AS Date))
INSERT [dbo].[t2] ([k], [v], [s], [e]) VALUES (2, 1, CAST(0x4B380B00 AS Date), CAST(0x87380B00 AS Date))
INSERT [dbo].[t2] ([k], [v], [s], [e]) VALUES (3, 0, CAST(0x88380B00 AS Date), CAST(0xB5380B00 AS Date))
INSERT [dbo].[t2] ([k], [v], [s], [e]) VALUES (4, 1, CAST(0xB6380B00 AS Date), CAST(0x11390B00 AS Date))
INSERT [dbo].[t2] ([k], [v], [s], [e]) VALUES (5, 0, CAST(0x12390B00 AS Date), CAST(0x20390B00 AS Date))
INSERT [dbo].[t2] ([k], [v], [s], [e]) VALUES (6, 1, CAST(0x21390B00 AS Date), CAST(0x4E390B00 AS Date))
Below are the results of running the SQL
/*————————
SELECT * FROM T1 – Test table 1
SELECT * FROM T2 – Test table 2
SELECT * FROM UnPivoted – All distinct start/end dates from all rows extracted into 2 columns
SELECT * FROM Ranges – All distinct start dates with the next end date
SELECT * FROM History – Join back to the source tables to get the historical keys and values
————————*/
k v s e
———– ———– ———- ———-
1 0 2014-01-01 2014-04-30
2 1 2014-05-01 2014-08-31
3 0 2014-09-01 2014-09-30
4 1 2014-10-01 2014-10-15
5 0 2014-10-16 2014-10-31
6 1 2014-11-01 2014-11-30
(6 row(s) affected)
k v s e
———– ———– ———- ———-
1 0 2014-01-01 2014-03-15
2 1 2014-03-16 2014-05-15
3 0 2014-05-16 2014-06-30
4 1 2014-07-01 2014-09-30
5 0 2014-10-01 2014-10-15
6 1 2014-10-16 2014-11-30
(6 row(s) affected)
s e
———- ———-
2014-01-01 2014-03-15
2014-01-01 2014-04-30
2014-03-16 2014-05-15
2014-05-01 2014-08-31
2014-05-16 2014-06-30
2014-07-01 2014-09-30
2014-09-01 2014-09-30
2014-10-01 2014-10-15
2014-10-16 2014-10-31
2014-10-16 2014-11-30
2014-11-01 2014-11-30
(11 row(s) affected)
s e
———- ———-
2014-01-01 2014-03-15
2014-03-16 2014-04-30
2014-05-01 2014-05-15
2014-05-16 2014-06-30
2014-07-01 2014-08-31
2014-09-01 2014-09-30
2014-10-01 2014-10-15
2014-10-16 2014-10-31
2014-11-01 2014-11-30
(9 row(s) affected)
s e T1_Key T2_Key T1_Value T2_Value
———- ———- ———– ———– ———– ———–
2014-01-01 2014-03-15 1 1 0 0
2014-03-16 2014-04-30 1 2 0 1
2014-05-01 2014-05-15 2 2 1 1
2014-05-16 2014-06-30 2 3 1 0
2014-07-01 2014-08-31 2 4 1 1
2014-09-01 2014-09-30 3 4 0 1
2014-10-01 2014-10-15 4 5 1 0
2014-10-16 2014-10-31 5 6 0 1
2014-11-01 2014-11-30 6 6 1 1
(9 row(s) affected)