Generating a Type-2 Dimension from Multiple Tables with Effective Dates

In slowly changing (SCD) dimensions, type-2 attributes involves ending a dimension row when the attribute value changes with the current date and creating a new row starting from the current date. While this works great to capture history once a warehouse/cube is implemented, it does not address the situation of historical data if such tracking was already occurring. For most scenarios, the history can simply be loaded without the cube processing having the type-2 rules implemented to pre-populate the SCD start/end dates. However in some business such as health care, history may sometimes need to be “re-written”. For example, a claim may not get into the system until several days after it occurs and the claim would need to be effective for a point in time in history already loaded.

This challenge is highlighted in the below examples. A real-world example might be tracking what certifications were held by an individual at a given point in time. For example a software professional may have been an Oracle OCP from 6/1/2010 to 8/31/2010, and a PMP from 7/1 TO 12/31. To track the history of all certifications held by date ranges, 3 rows are needed:

– 6/1/2010 – 6/31/2010 : OCP

– 7/1/2010 – 8/31/2010  : OCP, PMP

– 9/1/2010 – 12/31/2011: PMP only

The spreadsheet illustrates the scenario where sets of data for a dimension row coming from 2 different tables have values effective for different historical periods – the dates associated with the 2 sets of values could be overlapping, equivalent, or covering (i.e. ‘A’ start date before ‘B’ start date with end date also greater than ‘B’ end date)

This is of interest for the scenario where customer already has SCD-type2 tracking in place for historical data across more than one tables which have values controlled by different effective date ranges and desires to consolidate to a single dimension. The goal is a single dimension from multiple source tables with TYPE-2 tracking using a single start/end date to capture history of all values associated to effective dates from the source tables including support for regenerating if the individual historical values or date ranges are changed.

Below is the SQL that demonstrates the solution approach I found for taking multiple histories with different start/end dates associated to historical values and getting them to a consolidated History input with type-2 SCD behavior that emulates what would have happened if SCD tracking had been in place incrementally over the date range of the historical data. Here are the steps:

1) Unpivot start/end dates through unions of the start/end dates from all historical tables into 2 columns

2) Create distinct ranges from the unpivoted dates by joining the distinct start dates to the next available end date

3) Join back to the source tables for all values effective for the segmented date ranges.

The SQL accomplishes the 3 items through views (unpivoted, range, history) to generate the required rows for the type-2 SCD dimension.

Although only 2 tables are involved, this should work the same way for additional tables, it just means adding additional unions in the unpivoted view and additional left joins to the history view.

CREATE TABLE [dbo].[t1](

[k] [int] NOT NULL,

[v] [int] NULL,

[s] [date] NULL,

[e] [date] NULL)

GO

CREATE TABLE [dbo].[t2](

[k] [int] NOT NULL,

[v] [int] NULL,

[s] [date] NULL,

[e] [date] NULL)

CREATE VIEW [dbo].[Unpivoted] AS

SELECT s, e FROM T1 UNION SELECT s, e FROM T2

GO

CREATE VIEW [dbo].[Ranges] AS

SELECT s, (SELECT MIN(e) FROM Unpivoted WHERE e > VS.s) AS e FROM (SELECT DISTINCT s FROM Unpivoted) AS VS

GO

CREATE VIEW [dbo].[History] AS

SELECT R.s, R.e, t1.k as T1_Key, t2.k as T2_Key, T1.v as T1_Value, T2.V as T2_Value FROM Ranges R

LEFT JOIN T1

ON T1.s BETWEEN R.s AND R.e

OR T1.e BETWEEN R.S and R.e

OR T1.s < R.s AND T1.e > R.e

LEFT JOIN T2

ON T2.s BETWEEN R.s AND R.e

OR T2.e BETWEEN R.s AND R.e

OR T2.s < R.s AND T2.e > R.e

INSERT [dbo].[t1] ([k], [v], [s], [e]) VALUES (1, 0, CAST(0x01380B00 AS Date), CAST(0x78380B00 AS Date))

INSERT [dbo].[t1] ([k], [v], [s], [e]) VALUES (2, 1, CAST(0x79380B00 AS Date), CAST(0xF3380B00 AS Date))

INSERT [dbo].[t1] ([k], [v], [s], [e]) VALUES (3, 0, CAST(0xF4380B00 AS Date), CAST(0x11390B00 AS Date))

INSERT [dbo].[t1] ([k], [v], [s], [e]) VALUES (4, 1, CAST(0x12390B00 AS Date), CAST(0x20390B00 AS Date))

INSERT [dbo].[t1] ([k], [v], [s], [e]) VALUES (5, 0, CAST(0x21390B00 AS Date), CAST(0x30390B00 AS Date))

INSERT [dbo].[t1] ([k], [v], [s], [e]) VALUES (6, 1, CAST(0x31390B00 AS Date), CAST(0x4E390B00 AS Date))

INSERT [dbo].[t2] ([k], [v], [s], [e]) VALUES (1, 0, CAST(0x01380B00 AS Date), CAST(0x4A380B00 AS Date))

INSERT [dbo].[t2] ([k], [v], [s], [e]) VALUES (2, 1, CAST(0x4B380B00 AS Date), CAST(0x87380B00 AS Date))

INSERT [dbo].[t2] ([k], [v], [s], [e]) VALUES (3, 0, CAST(0x88380B00 AS Date), CAST(0xB5380B00 AS Date))

INSERT [dbo].[t2] ([k], [v], [s], [e]) VALUES (4, 1, CAST(0xB6380B00 AS Date), CAST(0x11390B00 AS Date))

INSERT [dbo].[t2] ([k], [v], [s], [e]) VALUES (5, 0, CAST(0x12390B00 AS Date), CAST(0x20390B00 AS Date))

INSERT [dbo].[t2] ([k], [v], [s], [e]) VALUES (6, 1, CAST(0x21390B00 AS Date), CAST(0x4E390B00 AS Date))

Below are the results of running the SQL

/*————————

SELECT * FROM T1 – Test table 1

SELECT * FROM T2 – Test table 2

SELECT * FROM UnPivoted – All distinct start/end dates from all rows extracted into 2 columns

SELECT * FROM Ranges – All distinct start dates with the next end date

SELECT * FROM History – Join back to the source tables to get the historical keys and values

————————*/

k           v           s          e

———– ———– ———- ———-

1           0           2014-01-01 2014-04-30

2           1           2014-05-01 2014-08-31

3           0           2014-09-01 2014-09-30

4           1           2014-10-01 2014-10-15

5           0           2014-10-16 2014-10-31

6           1           2014-11-01 2014-11-30

(6 row(s) affected)

k           v           s          e

———– ———– ———- ———-

1           0           2014-01-01 2014-03-15

2           1           2014-03-16 2014-05-15

3           0           2014-05-16 2014-06-30

4           1           2014-07-01 2014-09-30

5           0           2014-10-01 2014-10-15

6           1           2014-10-16 2014-11-30

(6 row(s) affected)

s          e

———- ———-

2014-01-01 2014-03-15

2014-01-01 2014-04-30

2014-03-16 2014-05-15

2014-05-01 2014-08-31

2014-05-16 2014-06-30

2014-07-01 2014-09-30

2014-09-01 2014-09-30

2014-10-01 2014-10-15

2014-10-16 2014-10-31

2014-10-16 2014-11-30

2014-11-01 2014-11-30

(11 row(s) affected)

s          e

———- ———-

2014-01-01 2014-03-15

2014-03-16 2014-04-30

2014-05-01 2014-05-15

2014-05-16 2014-06-30

2014-07-01 2014-08-31

2014-09-01 2014-09-30

2014-10-01 2014-10-15

2014-10-16 2014-10-31

2014-11-01 2014-11-30

(9 row(s) affected)

s          e          T1_Key      T2_Key      T1_Value    T2_Value

———- ———- ———– ———– ———– ———–

2014-01-01 2014-03-15 1           1           0           0

2014-03-16 2014-04-30 1           2           0           1

2014-05-01 2014-05-15 2           2           1           1

2014-05-16 2014-06-30 2           3           1           0

2014-07-01 2014-08-31 2           4           1           1

2014-09-01 2014-09-30 3           4           0           1

2014-10-01 2014-10-15 4           5           1           0

2014-10-16 2014-10-31 5           6           0           1

2014-11-01 2014-11-30 6           6           1           1

(9 row(s) affected)

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s