Change data capture

From Wikipedia, the free encyclopedia

Change data capture (CDC) is a set of software design patterns used to determine the data that has changed in a database so that action can be taken using the changed data.

CDC solutions occur most often in data warehouse environments since capturing and preserving the state of data across time is one of the core functions of a data warehouse, but CDC can be utilized in any database or data repository system.

Contents

[edit] CDC Solutions

CDC solutions can be created in a number of ways and in any one or a combination of system layers from application logic down to physical storage.

In a simplified CDC context, one computer system has data that is thought to have changed from a previous point in time, and a second computer system needs to take action based on that changed data. The former is the source, the latter is the target. It is possible that the source and target are the same system physically, but that does not change the design patterns logically.

It is possible and not uncommon for multiple CDC solutions to exist in a single system.

[edit] Timestamps on Rows

Tables whose changes must be captured are given a column that represents the time of last change. Names such as LAST_UPDATE, etc. are common. Any row in any table that has a timestamp in that column that is more recent than the last time data was captured is considered to have changed.

[edit] Version Numbers on Rows

Tables whose changes must be captured are given a column that contains a version number. Names such as VERSION_NUMBER, etc. are common. When data in a row changes, its version number is updated to the current version. A supporting construct such as a reference table with the current version in it is needed. When a change capture occurs, all data with the latest version number is considered to have changed. When the change capture is complete, the reference table is updated with a new version number.

There are three or four major techniques for doing CDC with version numbers, the above paragraph is just one.

[edit] Status indicators on Rows

This technique can be seen either as an alternative or a complement to timestamps and versioning. It can configure an alternative if, for example, a status column is set up on a table row indicating that the row as changed (e.g. a boolean column that, when set to true, indicates that the row has changed). Otherwise, it can act as a complement to the previous methods, indicating that a row, despite having a new version number or an earlier date, still shouldn't be updated on the target (e.g. the data may require human validation).

[edit] Time/Version/Status on Rows

This is a combination of the three previously discussed methods. As noted, it is not uncommon to see multiple CDC solutions at work in a single system, however, the combination of time, version, and status is a particularly powerful combination and should be utilized as a trio where possible. The three elements are not redundant or superfluous. Using them together allows for such logic as, "Capture all data for version 2.1 that changed between 6/1/2005 12:00 a.m. and 7/1/2005 12:00 a.m. where the status code indicates it is ready for production."

[edit] Triggers on Tables

Discussion needed. May include a publish/subscribe pattern to communicate the changed data to multiple targets.

[edit] Log Scanners on Databases


[edit] Replication on Databases


[edit] Replication on Storage


[edit] Comparison to Target


[edit] Full Rebuild of Target


[edit] Confounding Factors

As often occurs in complex domains, the final solution to a CDC problem will be the balance of many competing concerns.

[edit] Sub-optimal Source Schemas


[edit] Tracking the Capture


[edit] Push versus Pull


[edit] The Politics of Pulling


[edit] See also

Slowly Changing Dimension

[edit] External links