Canonicalization

From Wikipedia, the free encyclopedia

Not to be confused with Canonization.

Canonicalization (abbreviated c14n) is the process of converting data that has more than one possible representation into a "standard" canonical representation. This can be done to compare different representations for equivalence, to count the number of distinct data structures (e.g., in combinatorics), to improve the efficiency of various algorithms by eliminating repeated calculations, or to make it possible to impose a meaningful sorting order.

[edit] Examples

As an example, Wikipedia uses canonicalization in its processing of links between articles (see Wikipedia:Canonicalization). The first letter in the article name is capitalized, leading and trailing spaces are removed, and embedded whitespace is replaced by underscores. For example,

[[Egg_salad]]
[[egg salad]]
[[  egg_salad  ]]

all refer to the same article.

Here is an example of why canonicalization is important in a security context (in this case, a web server):

The rule is "Only execute files under the cgi directory (C:\inetpub\wwwroot\cgi-bin)". The rule is enforced by checking that the path starts with "C:\inetpub\wwwroot\cgi-bin\", and if it does, the file is executed.

Should "C:\inetpub\wwwroot\cgi-bin\..\..\..\Windows\System32\cmd.exe" be executed? No, because this trick path goes back up the directory hierarchy, not staying within cgi-bin. Accepting it at face value would be an error due to failure to canonicalize the filename to a unique (simplest) representation, namely: C:\Windows\System32\cmd.exe, before doing the path check. This type of fault is called a directory traversal vulnerability.

Another example might be to convert Unicode-encoded strings to the simplest form possible. Since Unicode allows for an infinite number of ways of representing the same character, you should always reduce the string to the simplest possible form before doing any comparisons.

A more commonly used example is in boolean logic. It is theoretically possible (though not always feasible) to reduce any logic function to its conjunctive normal form, disjunctive normal form, or algebraic normal form.

[edit] See also

In other languages