In computer science, canonicalization (abbreviated c14n, where 14 represents the number of letters between the C and the N), (also sometimes standardization or normalization) is a process for converting data that has more than one possible representation into a "standard", "normal", or canonical form. This can be done to compare different representations for equivalence, to count the number of distinct data structures, to improve the efficiency of various algorithms by eliminating repeated calculations, or to make it possible to impose a meaningful sorting order.
Contents[hide] |
Canonicalization of filenames is important for computer security. For example, a web server may have a security rule stating "only execute files under the cgi directory (C:\inetpub\wwwroot\cgi-bin)". The rule is enforced by checking that the path starts with "C:\inetpub\wwwroot\cgi-bin\", and if it does, the file is executed.
Should file "C:\inetpub\wwwroot\cgi-bin\..\..\..\Windows\System32\cmd.exe" be executed? No, because this trick path goes back up the directory hierarchy (through use of the '..' path specifier), not staying within cgi-bin. Accepting it at face value would be an error due to failure to canonicalize the filename to the unique (simplest) representation, namely: "C:\Windows\System32\cmd.exe", before doing the path check. This type of fault is called a directory traversal vulnerability.
Variable-length encodings in the Unicode standard, in particular UTF-8, have more than one possible encoding for most common characters.[1] This makes string validation more complicated, since every possible encoding of each string character must be considered. A software implementation which does not consider all character encodings runs the risk of accepting strings considered invalid in the application design, which could cause bugs or allow attacks. The solution is to allow a single encoding for each character. Canonicalization is then the process of translating every string character to its single allowed encoding. An alternative is for software to determine whether a string is canonicalized, and then reject it if it is not. In this case, in a client/server context, the canonicalization would be the responsibility of the client.
In web search and search engine optimization (SEO), URL canonicalization deals with web content that has more than one possible URL. Having multiple URLs for the same web content can cause problems for search engines - specifically in determining which URL should be shown in search results.[2]
Example:
All of these URLs point to the homepage of Wikipedia, but a search engine will only consider one of them to be the canonical form of the URL.
A Canonical XML document is by definition an XML document that is in XML Canonical form, defined by The Canonical XML specification. Briefly, canonicalization removes whitespace within tags, uses particular character encodings, sorts namespace references and eliminates redundant ones, removes XML and DOCTYPE declarations, and transforms relative URIs into absolute URIs.
Simple example: Given two versions of the same XML:
Note the extra spaces in the samples, the canonicalized version of these two might be:
Note that the spaces are removed — this is one thing a canonicalizier does. A real canonicalizier may make other changes as well.
A full summary of canonicalization changes is listed below:
xml:base
attributes is performed