My experience is that unfortunately far too many databases lack referential integrity in the schema. There are a few reasons for this, almost all of them bad. One of the worse is the “our application controls integrity” rationale with a dialogue that typically goes something like:
Customer: “We control this in our application logic, so don’t need to do it in the database. “
My response: So if your application does it so well, why can’t you also enable it in the database?
Usual customer response: “Well, that would break our application because we temporarily allow inconsistencies for our application to work”
My response: Is your application building the data into the tables in a single transaction so that nothing is committed to the database until all the operations needed to ensure integrity are completed?
Usual customer response: “No, that would be too slow” or “No, that is too complicated” or “No, that would cause locking issues” (concurrency problems) “No, we need to be able to temporarily store the inconsistent data in the database”, etc.
My response: So that means if there is an unexpected error or a system problem in the application then you are left with data integrity problems.
Customer response: “Uh, well, um, yes, I guess so”
My purpose in this post though is not to delve into this design problem and why it is critical to enforce referential integrity in the database, but to discover the integrity that may already exist in the database. There is also another motivator for this post and that is there are often relationships in the data which are not strictly needed for integrity that exist possibly coincidentally or due to some unobvious correlation. Based on that, I think there is value in the tool even if you are working with a database that has strict referential integrity implemented.
Without further ado, below is the technique shown in both the single column flavor where we only look for potential foreign keys and primary keys based on a single column or a flavor for doing multiple columns for foreign keys/primary keys. Both techniques require dynamic SQL to implement – the design for this is to generate the SQL and execute it to return the results back into a variable in the mainline code.
The single-column version is pretty easy to write the SQL for and I plan to do this in the next couple of weeks. The multi-column is trickier, especially to do it in an efficient manner what only visits each combination once. The best implementation is probably a recursive CTE to build that list. I’m going to see if I can convince my adult son Blake who is an excellent SQL developer and could write a book on recursive CTE queries. He has already developed several for various applications that will make your hair stand on end.
The below is based on utilizing the standard information schema views in SQL Server to enumerate tables and columns (Information_Schema.Tables and Information_Schema.Columns). This could also be driven by a table of suspected relationships instead to make more efficient and the process could write back the discovered relationships to use as a baseline for maintenance of the relationship information.
One other interesting item to implement in this would be a “fudge-factor” when checking the relationships. Since by definition, this really is a process for databases that don’t have referential integrity, it is very likely that few relationships will be found in such databases because of the very fact that no referential integrity exists and there is simply bad and inconsistent data. By utilizing a fudge factor when doing the select distinct against the select for the table on the foreign-key validation, we could identify potential data integrity problems where the number of matches is at a sufficient percentage to indicate that something probably should be a foreign key with the mismatches actually being data integrity errors.
Note, there is a possibility of false positives, especially if the database is lightly populated – i.e. finding something that looks like a foreign key because it contains only the values from the candidate primary key, but in reality could be just a coincidence of the data.