Referential Integrity Over Time-Varying Tables
In 1992, the University of Arizona had a big problem. Its data was managed by a suite of Cobol legacy systems, each with its own underlying DBMS or file structure. The Personnel Operating System (PSOSs), using the Transact DBMS, managed (and continue to manage) personnel records. A financial records system (FRS) using IDMS manages financial records, including purchasing, reorders, property management, fixed assets, and the ledger. Yet another system contains student records-the student information system (SIS), which uses VSAM. To obtain information from a specific system, you would ask someone in the Center for Computing and Information Technology (CCIT) who was familiar with that system to write a report program, a process that often took several weeks. Getting an integrated report across two or more of these databases was exceedingly difficult. About the only positive thing that you could say about this situation was that it guaranteed job security for CCIT personnel.
Cheryl Bach, an impassioned and imaginative seven-year veteran, had a better idea. She proposed that CCIT create a data warehouse in a relational DBMS, gathering cleansed and regularized information from all these systems into a single, consistent database. Cheryl, working with Htay Lay and later with Chris Janton, eventually defined some 300 tables comprising the integrated university information system (UIS). In day-to-day operations, some tables were copied over daily from the legacy systems; others were added each pay period, each month, or even each 12 months (for those tables summarizing the fiscal year).
Merging the data from each system into UIS required many technical decisions because the data generally wasn't stored in tabular format in the legacy systems. We will examine two UIS tables, the INCUMBENTS table and the POSITIONS table. The INCUMBENTS table contains 12 columns; the POSITIONS table contains 16 columns. Here we focus on the following columns:
INCUMBENTS (SSN, PCN) POSITIONS (PCN, JOB_TITLE_CODE1)
Each row of the first table provides information on a job assignment (identified by the position control number, or PCN, for a current employee), and each row of the third table describes a particular position (which can be associated with multiple job titles). The primary key of INCUMBENTS is (SSN, PCN), and the primary key of POSITIONS is PCN. INCUMBENTS.PCN is a foreign key to POSITIONS.PCN, which is the concern of this column.
Cheryl appended two columns to the INCUMBENTS table. The first column indicates when the information in the row became valid (that is, when the employee was assigned to that position). The second column indicates when the information was no longer valid (that is, when the employee was assigned to another position or left the University).
INCUMBENTS (SSN, PCN, START_DATE, END_DATE)
Such a table is called a valid-time table; it records the history of the modeled reality. The original table, without temporal support, is termed a snapshot table because it logically captures the state of the enterprise at a single point in time, similar to a photographic snapshot. Such tables are generally kept up to date and so capture reality "as of now."
Table 1 is an excerpt of the INCUMBENTS table. The first row specifies that employee 111-22-3333 (Bob) had position 90025 (senior vice president, research) starting at the beginning of 1996, and extending to June 1 of that year, when he transferred to position 120033. He stayed in that position for a total of four months, until October 1, when he transferred to position 137112, where he continues to this day (the special date Ф3000-01-01' denotes "currently valid"). Another employee, 444-33-2222, has remained in the same position since being hired the beginning of the following year.
Table 1 has several interesting features. Although Bob's SSN appears in three rows, he had only one position at any point in time. In the terminology of my June 1998 article in Database Programming & Design, SSN is a sequenced primary key of INCUMBENTS.
Additionally, neither employee has holes in his or her position history. (A hole represents times when there was no PCN associated with their SSN.) Disallowing holes may be appropriate or unacceptable, depending on the semantics of the application.
The third observation is that this table can be viewed as a compact representation of a sequence of snapshot tables, each valid on a particular day. The snapshot table valid on January 1, 1996, contains one row, (111-22-3333, 900225); the snapshot table valid on September 13, 1996, also contains one row (111-22-3333, 120033); the table valid on February 22, 1997, contains two rows: (111-222-3333, 137112) and (444-33-2222, 120033). This long sequence of snapshot tables is very efficiently encoded in these five rows by associating a period (two DATEs) with each row.
A referential integrity (RI) constraint specifies that the value of the specified column in every row of the referencing table appears as the value of a specified column in a row of the referenced column. The expression of such constraints depends heavily on whether the referencing and referenced tables are temporal tables. There are four cases, depending on whether the referencing and referenced tables are temporal.
Neither table is temporal. SQL's constructs are perfectly adequate if neither table is temporal. Assuming for the moment that the INCUMBENTS table has no timestamp columns, the fact that INCUMBENTS.PCN is a foreign key for POSITIONS.PCN is expressed as follows in this first code fragment:
ALTER TABLE INCUMBENTS ADD FOREIGN KEY (PCN) REFERENCES POSITIONS
Only the referencing table is temporal. If the referencing table is temporal but the referenced table is not, the same code works. Considering INCUMBENTS as a valid-time state table with columns START_DATE and END_DATE, the above code fragment works fine. Here the assumption is that the nontemporal table contains time-invariant data (that is, data that doesn't vary over time).
Both tables are temporal. If the referenced table is temporal, the situation is considerably more complex. We make POSITIONS temporal by adding START_DATE and END_DATE columns. It turns out that in the UIS position code, numbers are sometimes invalidated and occasionally reuse a PCN. However, the UIS only maintains the current state of the POSITIONS table, so when a PCN is reused, an old value will be matched <<again, passive>> incorrectly to a new job title code. You can avoid this situation by making POSITIONS temporal and defining appropriate referential integrity constraints:
POSITIONS(PCN, JOB_TITLE_CODE1, START_DATE, END_DATE)
As with duplicates, there are three types of referential integrity constraints on temporal tables: current, sequenced, and nonsequenced.
Current referential integrity ("the PCN of all current incumbents must be a current position") is straightforward and uses the trick of converting a predicate of the form "for all P" into "not exists (not P)."
In Listing 1, we extract the current state on which to apply the constraint. Current referential integrity can be satisfied today but violated tomorrow-even if no changes are made to either table.
My first code fragment works well for the nonsequenced referential integrity constraint ("for each value of INCUMBENTS.PCN, there existed at some, possibly different, time that value in POSITIONS.PCN"). The fact that POSITIONS is temporal means that past positions have become invalid. Nonsequenced referential integrity ignores this time-varying behavior, and is content to match current incumbents with out-of-date positions.
The temporal analogue of a nontemporal referential integrity constraint is the sequenced constraint: "At each point in time, each incumbent's PCN is valid at that time." This applies the intuition of referential integrity to time-varying information. As with duplicates, stating a sequenced referential integrity constraint in English is natural, but stating it in SQL is challenging.
The key is a sequenced foreign key if for all rows r in the referencing table,
The assertion in Listing 2 may be read as follows: The outermost NOT EXISTS states that no row I of INCUMBENTS fails the referential integrity test. The three predicates in the WHERE clause provide ways for row I to fail the test. First, the test fails if there is no row in POSITIONS valid at the start of row I's period of validity. Second, the test fails if there is no row in POSITIONS valid at the end of row I's period of validity. Third, the test fails if there is a "hole" during row I's period of validity, a time when no row of POSITIONS was valid. A hole exists if there is a row P that ends during row I's period of validity that is not "extended" (towards I.END_DATE) by another row.
CREATE ASSERTION INCUMBENTS_Sequenced_Referential_Integrity CHECK (NOT EXISTS ( SELECT * FROM INCUMBENTS AS I -- there was a row valid in INCUMBENTS.PCN when I started WHERE NOT EXISTS ( SELECT * FROM POSITIONS AS P WHERE I.PCN = P.PCN AND P.START_DATE <= I.START_DATE AND I.START_DATE < P.END_DATE) -- there was a row valid in INCUMBENTS.PCN when I ended OR NOT EXISTS ( SELECT * FROM POSITIONS AS P WHERE I.PCN = P.PCN AND P.START_DATE < I.END_DATE AND I.END_DATE <= P.END_DATE) -- there are no holes in INCUMBENTS.PCN during I's period of validity OR EXISTS ( SELECT * FROM POSITIONS AS P WHERE I.PCN = P.PCN AND I.START_DATE < P.END_DATE AND P.END_DATE < I.END_DATE AND NOT EXISTS ( SELECT * FROM POSITIONS AS P2 WHERE P2.PCN = P.PCN AND P2.START_DATE <= P.END_DATE AND P.END_DATE < P2.END_DATE))) )
Listing 2. A sequenced referential integrity constraint.
If we are assured that the histories in the referenced table are contiguous-that there are no holes in these histories-then sequenced referential integrity is easier to express, as are two assertions.
Requiring a contiguous history (see Listing 3) is a nonsequenced constraint. Unlike a sequenced constraint, which must be true independently at each point in time, a nonsequenced constraint requires examining the table at multiple points in time. The absence of a PCN on a particular day constitutes a hole in the history only if there is a PCN for this SSN present both before and after this date.
CREATE ASSERTION POSITIONS_Contiguous_History CHECK (NOT EXISTS ( SELECT * FROM POSITIONS AS P, POSITIONS AS P2 WHERE P.END_DATE < P2.START_DATE AND P.PCN = P2.PCN AND NOT EXISTS ( SELECT * FROM POSITIONS AS P3 WHERE P3.PCN = P.PCN AND (((P3.START_DATE <= P.END_DATE) AND (P.END_DATE < P3.END_DATE)) OR ((P3.START_DATE < P2.START_DATE) AND (P2.START_DATE <= P3.END_DATE))))) )
Listing 3. Ensuring there are no holes in the history.
Given that there are no holes, we can check for containment of the referencing period of validity by the contiguous history in the referenced table by simply checking the delimiting instants of the referencing period of validity (see Listing 4).
CREATE ASSERTION INCUMBENTS_Sequenced_Referential_Integrity CHECK (NOT EXISTS ( SELECT * FROM INCUMBENTS AS I WHERE NOT EXISTS ( SELECT * FROM POSITIONS AS P WHERE I.PCN = P.PCN AND P.START_DATE <= I.START_DATE AND I.START_DATE < P.END_DATE) OR NOT EXISTS ( SELECT * FROM POSITIONS AS P WHERE I.PCN = P.PCN AND P.START_DATE < I.END_DATE AND I.END_DATE <= P.END_DATE))) )
Listing 4. Sequenced referential integrity assuming a contiguous history.
Only the referenced table is temporal. The final case to consider (see Listing 5) is when the referencing table (here, INCUMBENTS) is a nontemporal table and the referenced table (here, POSITIONS) is a temporal table. The current constraint is easy to express. The nonsequenced constraint is, again, expressed as the first code fragment we discussed.
For the sequenced constraint, we need to be more precise on what "at each point in time" means for a nontemporal table. There are at least two reasonable interpretations. One is that the nontemporal table records current data, in which case a sequenced constraint is equivalent to a current constraint. A different interpretation is that the nontemporal table contains time-invariant data. In that case, the original constraint (in the first code fragment) works fine.
For the INCUMBENTS table, the most appropriate interpretation is that this table records current data, in which case the code in Listing 1 suffices.
CREATE ASSERTION INCUMBENTS_Current_Referential_Integrity CHECK (NOT EXISTS ( SELECT * FROM INCUMBENTS AS I WHERE I.END_DATE = DATE '3000-01-01' AND NOT EXISTS ( SELECT * FROM POSITIONS AS P WHERE I.PCN = P.PCN AND P.END_DATE = DATE '3000-01-01')) )
Listing 1. A current referential integrity constraint.
IBM DB2 does not support assertions nor a SELECT statement within a CHECK constraint. It uses triggers to implement these checks, which can significantly complicate matters. For example, you need four triggers to specify current referential integrity in DB2, INSERT and UPDATE triggers on INCUMBENTS and DELETE and UPDATE triggers on POSITIONS, instead of the single fairly simple CHECK constraint given earlier.
Microsoft Access does not support assertions. You can implement assertion checking in Access via Visual Basic functions that are activated by events provided in Access forms, in particular, the After Insert event.
Microsoft SQLServer, Sybase SQLServer, and UniSQL also do not support ASSERTIONs; you must implement them as triggers.
Oracle does not support assertions or complex CHECK constraints; you must use triggers to implement them. Another relevant limitation is that there can be only one INSERT/UPDATE trigger per table in Oracle, so multiple CHECK constraints or ASSERTIONs on a table must be merged into a single trigger.
Temporal Referential Integrity in SQL-92
The details of expressing a referential integrity constraint depend on whether the referencing and referenced tables are temporal tables. The six code fragments I offered illustrate the possibilities.
In many cases, a simple constraint suffices. However, when both tables are temporal, the most natural referential constraint is a sequenced one, and a complex assertion (Listing 2) or pair of assertions (Listings 3 and 4) is required. Table 2 summarizes the alternatives.
There are various kinds of referential integrity that can be expressed with the proposed constructs.
Case 1. Neither table is temporal. You can initially assume that neither the INCUMBENTS nor the POSITIONS table has temporal support. RI can then be expressed in SQL-92 as:
ALTER TABLE INCUMBENTS ADD FOREIGN KEY (PCN) REFERENCES POSITIONS
Case 2. Only the referencing table is temporal. We now add temporal support to the INCUMBENTS table using a proposed SQL3 construct:
ALTER TABLE INCUMBENTS ADD VALIDTIME PERIOD(DATE)
This associates a period with each row; you can think of the period as two delimiting DATEs.
The foreign key constraint still applies directly. It translates to a current constraint: "For each currently valid row of INCUMBENTS, the PCN is also in POSITIONS." This ensures temporal upward compatibility, guaranteeing that applications are not broken when temporal support is added to an underlying table.
Case 3. Both tables are temporal. We now add temporal support to the referenced table, POSITIONS:
ALTER TABLE POSITIONS ADD VALIDTIME PERIOD(DATE)
When applied to a table with temporal support, we interpret the foreign key specified in case 1 as a current foreign key. In SQL-92, current RI requires a 10-line assertion (Listing 1).
For a nonsequenced RI constraint ("for each value of INCUMBENTS.PCN, there existed at some, possibly different, time that value in POSITIONS.PCN"), you only have to specify that the constraint is nonsequenced:
ALTER TABLE INCUMBENTS ADD NONSEQUENCED VALIDTIME FOREIGN KEY (PCN) REFERENCES POSITIONS
The sequenced RI constraint, "at each point in time, each incumbent's PCN is valid at that time," is the most natural application of the non-temporal RI constraint to time-varying information. This required a complex, 29-line assertion (see Listing 2). Using the proposal constructs, only one additional keyword, VALIDTIME, is necessary to obtain a sequenced integrity constraint (query, view, modification, cursor):
ALTER TABLE INCUMBENTS ADD VALIDTIME FOREIGN KEY (PCN) REFERENCES POSITIONS
Case 4. Only the referenced table is temporal. Here we drop the temporal support on the referencing table:
The original RI constraint continues to apply, and is equivalent to the eight-line assertion in Listing 5, in which every PCN value in INCUMBENTS must also occur in the current state of the POSITIONS table.
CREATE ASSERTION INCUMBENTS_Current_Referential_Integrity CHECK (NOT EXISTS ( SELECT * FROM INCUMBENTS AS I WHERE NOT EXISTS ( SELECT * FROM POSITIONS AS P WHERE I.PCN = P.PCN AND P.END_DATE = DATE '3000-01-01')) )
Listing 5. Current referential integrity when only the referenced table is temporal.
In SQL-92, users must "roll their own" referential integrity assertions over time-varying tables. The SQL-92 FOREIGN KEY constraint simply doesn't work; users must instead write highly complex assertions. With the proposed additions to SQL3, the various forms of referential integrity are all simple variants of the nontemporal constraint.
R.T. Snodgrass, M.H. Boehlen, C.S. Jensen, and A. Steiner. "Adding Valid Time to SQL/Temporal." Change proposal, ANSI X3H2-96-501r2, ISO/IEC JTC 1/SC 21/WG 3 DBL-MAD-146r2, November 1996. At URL: ftp://ftp.cs.arizona.edu/tsql/tsql2/sql3/mad146.pdf
Rick Snodgrass is a professor of computer science at the University of Arizona. He chairs ACM SIGMOD, has written many papers and several books on temporal databases, and consults on designing and implementing time-varying databases. He is working with the ANSI and ISO SQL3 committees to add temporal support to that language. You can email Rick at email@example.com or visit his web page at www.cs.arizona.edu/people/rts.
This material is based on the forthcoming book, Developing Time-Oriented Applications in SQL, by Richard Thomas Snodgrass, copyright (c) 1999 Morgan Kaufmann Publishers, Inc., http://www.mkp.com/.
Украинская баннерная сеть