Duplicate code
Duplicate code is a computer programming term for a sequence of source code that occurs more than once, either within a program or across different programs owned or maintained by the same entity. Duplicate code is generally considered undesirable for a number of reasons.[1] A minimum requirement is usually applied to the quantity of code that must appear in a sequence for it to be considered duplicate rather than coincidentally similar. Sequences of duplicate code are sometimes known as code clones or just clones, the automated process of finding duplications in source code is called clone detection.
The following are some of the ways in which two code sequences can be duplicates of each other:
- character-for-character identical
- character-for-character identical with white space characters and comments being ignored
- token-for-token identical
- token-for-token identical with occasional variation (i.e., insertion/deletion/modification of tokens)
- functionally identical
How duplicates are created
There are a number of reasons why duplicate code may be created, including:
- Copy and paste programming, or scrounging, in which a section of code is copied "because it works". In most cases this operation involves slight modifications in the cloned code such as renaming variables or inserting/deleting code.
- Functionality that is very similar to that in another part of a program is required and a developer independently writes code that is very similar to what exists elsewhere. Studies suggest, that such independently rewritten code is typically not syntactically similar.[2]
- Plagiarism, where code is simply copied without permission or attribution.
Problems associated with duplicate code
Code duplication is generally considered a mark of poor or lazy programming style. Good coding style is generally associated with code reuse. It may be slightly faster to develop by duplicating code, because the developer need not concern himself with how the code is already used or how it may be used in the future. The difficulty is that original development is only a small fraction of a product's life cycle, and with code duplication the maintenance costs are much higher. Some of the specific problems include:
- Code bulk affects comprehension: Code duplication frequently creates long, repeated sections of code that differ in only a few lines or characters. The length of such routines can make it difficult to quickly understand them. This is in contrast to the "best practice" of code decomposition.
- Purpose masking: The repetition of largely identical code sections can conceal how they differ from one another, and therefore, what the specific purpose of each code section is. Often, the only difference is in a parameter value. The best practice in such cases is a reusable subroutine.
- Update anomalies: Duplicate code contradicts a fundamental principle of database theory that applies here: Avoid redundancy. Non-observance incurs update anomalies, which increase maintenance costs, in that any modification to a redundant piece of code must be made for each duplicate separately. At best, coding and testing time are multiplied by the number of duplications. At worst, some locations may be missed, and for example bugs thought to be fixed may persist in duplicated locations for months or years (this is also known as bug propagation). The best practice here is a code library.
- File and binary size: Unless external lossless compression is applied, the file will take up more space on the computer. Duplications in the source code will also be found in the compiled binary. For domains with restricted execution platforms (like embedded systems), such an increased binary size maybe prohibitive.
Detecting duplicate code
A number of different algorithms have been proposed to detect duplicate code. For example:
- Baker's algorithm.[3]
- Rabin–Karp string search algorithm.
- Using Abstract Syntax Trees.[4]
- Visual clone detection.[5]
Example of functionally duplicate code
Consider the following code snippet for calculating the average of an array of integers
extern int array1[]; extern int array2[]; int sum1 = 0; int sum2 = 0; int average1 = 0; int average2 = 0; for (int i = 0; i < 4; i++) { sum1 += array1[i]; } average1 = sum1/4; for (int i = 0; i < 4; i++) { sum2 += array2[i]; } average2 = sum2/4;
The two loops can be rewritten as the single function:
int calcAverage (int* Array_of_4) { int sum = 0; for (int i = 0; i < 4; i++) { sum += Array_of_4[i]; } return sum/4; }
Using the above function will give source code that has no loop duplication:
extern int array1[]; extern int array2[]; int average1 = calcAverage(array1); int average2 = calcAverage(array2);
See also
- Abstraction principle (programming)
- Code smell
- Don't repeat yourself
- List of tools for static code analysis
- Redundant code
- Rule of three (computer programming)
References
- ↑ Spinellis, Diomidis. "The Bad Code Spotter's Guide". InformIT.com. Retrieved 2008-06-06.
- ↑ Code similarities beyond copy & paste by Elmar Juergens, Florian Deissenboeck, Benjamin Hummel.
- ↑ Brenda S. Baker. A Program for Identifying Duplicated Code. Computing Science and Statistics, 24:49–57, 1992.
- ↑ Ira D. Baxter, et al. Clone Detection Using Abstract Syntax Trees
- ↑ Visual Detection of Duplicated Code by Matthias Rieger, Stephane Ducasse.
External links
- The University of Alabama at Birmingham: Code Clones Literature
- Finding duplicate code in C#, VB.Net, ASPX, Ruby, Python, Java, C, C++, ActionScript, or XAML