Change data capture
From Wikipedia, the free encyclopedia
Change data capture (CDC) is a set of software design patterns used to determine the data that has changed in a database so that action can be taken using the changed data.
CDC solutions occur most often in data warehouse environments since capturing and preserving the state of data across time is one of the core functions of a data warehouse, but CDC can be utilized in any database or data repository system.
Contents |
[edit] CDC Solutions
CDC solutions can be created in a number of ways and in any one or a combination of system layers from application logic down to physical storage.
In a simplified CDC context, one computer system has data that is thought to have changed from a previous point in time, and a second computer system needs to take action based on that changed data. The former is the source, the latter is the target. It is possible that the source and target are the same system physically, but that does not change the design patterns logically.
It is possible and not uncommon for multiple CDC solutions to exist in a single system.
[edit] Timestamps on Rows
Tables whose changes must be captured are given a column that represents the time of last change. Names such as LAST_UPDATE, etc. are common. Any row in any table that has a timestamp in that column that is more recent than the last time data was captured is considered to have changed.
[edit] Version Numbers on Rows
Tables whose changes must be captured are given a column that contains a version number. Names such as VERSION_NUMBER, etc. are common. When data in a row changes, its version number is updated to the current version. A supporting construct such as a reference table with the current version in it is needed. When a change capture occurs, all data with the latest version number is considered to have changed. When the change capture is complete, the reference table is updated with a new version number.
There are three or four major techniques for doing CDC with version numbers, the above paragraph is just one.
[edit] Status indicators on Rows
[edit] Time/Version/Status on Rows
This is a combination of the three previously discussed methods. As noted, it is not uncommon to see multiple CDC solutions at work in a single system, however, the combination of time, version, and status is a particularly powerful combination and should be utilized as a trio where possible. The three elements are not redundant or superfluous. Using them together allows for such logic as, "Capture all data for version 2.1 that changed between 6/1/2005 12:00 a.m. and 7/1/2005 12:00 a.m. where the status code indicates it is ready for production."
[edit] Triggers on Tables
Discussion needed. May include a publish/subscribe pattern to communicate the changed data to multiple targets.
[edit] Log Scanners on Databases
[edit] Replication on Databases
[edit] Replication on Storage
[edit] Comparison to Target
[edit] Full Rebuild of Target
[edit] Confounding Factors
As often occurs in complex domains, the final solution to a CDC problem will be the balance of many competing concerns.
[edit] Sub-optimal Source Schemas
[edit] Tracking the Capture
[edit] Push versus Pull
[edit] The Politics of Pulling
[edit] Software Products supporting CDC
- DataMirror Transformation Server - http://www.datamirror.com/products/tserver
- GoldenGate GoldenGate TDM - http://www.goldengate.com
- IBM DataPropagator
- Informatica PowerExchange - http://www.informatica.com/products/powerexchange/
- Sunopsis