Near Duplicate Algorithms

From Wikipedia, the free encyclopedia


Possible copyright infringement

If you have just labeled this page as a possible copyright infringement, please add the following to the bottom of Wikipedia:Copyright_problems/2008_June_12/Articles
* {{subst:article-cv|:Near Duplicate Algorithms}} from [http://www.ir.iit.edu/~abdur/Research/Duplicate.html]. ~~~~

The previous content of this page appears to infringe on the copyright of the text from the source(s) below and is now listed on Wikipedia:Copyright problems:

http://www.ir.iit.edu/~abdur/Research/Duplicate.html

Do not edit this page until an administrator has resolved this issue.

  • To write a new article without infringing material, follow this link to a temporary subpage.
State that you have done so on this article's discussion page.
Note that simply modifying copyrighted text is not sufficient to avoid copyright infringement—if the original copyright violation cannot be clearly identified and the article reverted to a prior version, it is best to write the article from scratch. An administrator will move the new article into place once the issue is resolved.
Explain this on this article's discussion page, then either display a notice to this effect at the site of original publication or send an e-mail from an address associated with the original publication to permissions-en at wikimedia dot org or a postal letter to the Wikimedia Foundation. These messages must explicitly permit use under the GFDL.
Note: Articles on Wikipedia must be written from a neutral point of view and must be verifiable in published third-party sources; copyright issues aside, your text may not be appropriate for inclusion in Wikipedia.
  • If this text is in the public domain, or is already under a license suitable for Wikipedia:
Explain this on this article's discussion page, with reference to evidence.

Unless the copyright status of the text on this page is clarified, it will be deleted one week after the time of its listing.

  • Posting copyrighted material without the express permission of the copyright holder is a violation of applicable law and of Wikipedia policy.
  • If you have questions about copyright, see Copyright FAQ.
  • Those who repeatedly post copyrighted material will be blocked from further editing.
  • Temporarily, the original posting is still accessible for viewing in the page history.
  • You are welcome to submit original contributions.
Maintenance use only: {{subst:Nothanks-web|pg=Near Duplicate Algorithms|url=http://www.ir.iit.edu/~abdur/Research/Duplicate.html}} ~~~~



A near duplicate algorithm is one that detects documents that are almost identical.

Contents

[edit] Overview

Detection of near duplicate documents is an important problem in many data mining and information filtering applications. The definition of what constitutes a duplicate is unclear. A duplicate can be defined as the exact syntactic terms and sequence, with or without formatting differences. Thus, there are either no-differences or only formatting differences, and the content of the documents is exactly the same. This definition of duplicate is quite restrictive, but is the most common approach used by content systems.[1][2] Duplicate detection is achieved by calculating a unique hash value for each document. Each document is then examined for duplication by looking up the value (hash) in either an in-memory hash or persistent lookup system. Several common hash functions used are MD2, MD5, and SHA, however these are not used because they are designed to have totally different results for slightly different documents. This property is called the Avalanche effect. Hash functions without the Avalanche effect are used because they have three desirable properties, can be calculated on arbitrary data / document lengths, easy to compute, and have very low probabilities of collisions.

While this approach is both fast and easy to implement, it is very brittle, potentially any change in formatting (if formatting is used), word order, or slight variations in content produces a different document signature (hash). For example, a web page that displays the number of visitors along with the content of the page will continually produce different signatures even though the document is the same. A news article, which has a typo where the word “the” is repeated twice, will produce a different signature when the typo is fixed, even though the content would be considered duplicated.[3] It is this similar duplicate document problem that we focus on.

[edit] Efficiency

Next we examine the efficiency issues that are involved in calculating similarity of a document to a collection. We partition the work into two categories: shingling techniques, similarity measures calculations. Shingling techniques, such as COPS[4], KOALA[5], and DSC [6], take a set of contiguous terms or shingles of documents and compare the number of matching shingles, or others.[7] The comparison of document subsets allows the algorithms to calculate a percentage of overlap between two documents using resemblance, Equation 1. This type of approach relies on hash values for each document subsection/feature set and filters those hash values to reduce the number of comparisons the algorithm must perform. This filtration of features, therefore, improves the runtime performance. Note that the simplest filter is strictly a syntactic filter based on simple syntactic rules, and the trivial subset is the entire collection. We illustrate later why such a naive approach is not generally acceptable. In the shingling approaches, rather than comparing documents, subdocuments are compared, and thus, each document may produce many potential duplicates. Returning many potential matches requires vast user involvement to sort out potential duplicates, diluting the potential usefulness of these types of approaches.

To combat the inherent efficiency issues, several optimization techniques were proposed to reduce the number of comparisons made. One approach was to retain only a portion of the shingles. That is, the approach either removed frequently occurring shingles[5] or retained only every 25th shingle.[8] This reduction, however, does hinder the accuracy. Since no semantic premise is used to reduce the volume of data, a random degree of “fuzziness” is introduced to the matching process resulting in relatively non-similar documents being identified as potential duplicates. Even with the performance-improving technique of removing shingles occurring in over 1000 documents and keeping only every 25th shingle, the implementation of the DSC algorithm took 10 CPU days to process 30 million documents.[8]

The DSC algorithm has a more efficient alternative, DSC-SS, which uses super shingles. This algorithm takes several shingles and combines them into a super shingle resulting in a document with a few super shingles rather than many shingles. Instead of measuring resemblance as a ratio of matching shingles, resemblance is defined as matching a single super shingle in two documents. This is much more efficient because it no longer requires a full counting of all overlaps. The authors, however, noted that DSC-SS does “not work well for short documents” so no runtime results are reported.[8]

Approaches that compute document-to-document similarity measures[9][10][11] are similar to document clustering work[12] in that they use similarity computations to group potentially duplicate documents. All pairs of documents are compared, i.e., each document is compared to every other document and a similarity weight is calculated. A document to document similarity comparison approach is thus computationally prohibitive given the theoretical O(d2) runtime, where d is the number of documents. In reality, these approaches only evaluate documents with an overlap of terms. Thus, the actual runtime is data dependent and difficult to accurately predict.

To further examine this class of duplicate or clustering approaches, we examine the mechanics of their term selection and weighting to cluster or compare documents for similarity. Typically, most of these document processing systems use an inverted index to efficiently retrieve documents containing a given index term. Various techniques exist that select which terms are retrieved from the inverted index, how these terms are analyzed and manipulated, and how a similarity measure is computed.[13][14][15] Independent of the particular techniques chosen, the final computed weight is then used to sort the documents retrieved.

The basic hypothesis of similarity measure similar document detection approaches is that similar documents have a similarity score comparable to the original document. The posting list of each term is returned from the inverted index. Therefore, for document one, each term is used to search the collection, and a final weight is produced for each document with a matching term, the highest valued document is the most similar. This approach of using the document as a query, thus clustering on those results sets, is computationally infeasible for large collections or dynamic collections since each document must be queried against the entire collection. Sanderson and Cooper used a variation on this where the terms are selected using Rocchio[16] relevance feedback algorithm[10][17], which used IDF (Inverse Document Frequency) to weight which terms are used for the original query/document. Each term queried against the inverted index must retrieve all the documents for the posting list to be analyzed. For large collections, where a common term may occur in millions of records, this is computationally expensive. For term selection approaches this cost is significantly less, but still requires at least the same number of I/O operations as the number of terms selected via the relevance feedback algorithm.

Kolcz proposed an optimization to cosine in which terms that occurred in more than 5% of the collection were assumed to occur in each document. This optimization to cosine reportedly produced results within 90% percent of the full cosine similarity but executed an order of magnitude faster for their 10 gigabyte collection.[18]

[edit] Fuzzy Hashing

All the prior fuzzy approaches have assumed that each document should be compared to another via a similarity value computed. Thus, when that value exceeds some threshold documents are considered similar or duplicated. The shingling and similarity approaches violated the notion of a true difference metric when efficiency became a factor. Thus, resemblance or other metrics of similarity were no longer exact measurements, but approximations of similarity.

Those efficiency motivations pushed a more recent trend in duplicate data detection based on eliminating I/O costs yet still finding duplicate documents with fuzzy variations as efficiently as possible. To meet the efficiency requirements more recent duplication efforts have re-examined the original approach of creating whole document hashing. These approaches strive to produce a single document representation that can be used to find similar documents. Gionis provided a means for similarity searching in high dimensions by hashing to unique values. Where numeric comparisons in value could be used to find similar documents. While this approach was suitable for some DBMS problems, the very high dimensionality of text problems still did not have a viable solution.[19]

To find a solution to text similarity I-Match was proposed to solve many of these problems.[2] I-Match does not rely on strict parsing but instead uses collection statistics to identify which terms should be used as the basis for comparison. An inverse document frequency weight is determined for each term in the collection. The idf for each term is defined by tx = log (N/n), where N is the number of documents in the collection and n is the number of documents containing the given term. The use of idf collection statistics allows us to determine the usefulness of terms for duplicate document detection. It was previously shown that terms with high collection frequencies often do not add to the semantic content of the document.[20][21] I-Match hinges on the premise that removal of very infrequent terms or very common terms results in good document representations for identifying duplicates. Pseudo-code for the algorithm is as follows.

  1. Get document.
  2. Parse document into a token steam, removing format tags.
  3. Using term thresholds (idf), retain only significant tokens.
  4. Insert relevant tokens into Unicode ascending ordered tree of unique tokens.
  5. Loop through token tree and add each unique token to the SHA1[22] digest. Upon completion of token tree loop, a (doc_id, SHA1 Digest) tuple is defined.
  6. The tuple (doc_id, SHA1 Digest) is inserted into the storage data structure based on SHA1 Digest key.
  7. If there is a collision of digest values then the documents are similar.

The overall runtime of the I-Match approach is (O(d log d)) in the worst case where all documents are duplicates of each other and (O(d)) otherwise, where d is the number of documents in the collection. All similar documents must be grouped together. That is, the corresponding document identifiers must be stored as a group. In the most extreme case, all of the hash values are the same (all the documents are similar to each other). In such a case, to store all the document identifiers together in a data structure (tree) requires (O(d log d)) time. Typically, however, the processing time of the I-Match approach is O(d) time.

The calculation of idf values can be approached with either of two methods. The first is with the use of a training collection to produce a set of terms idf tuples before the de-duplication work occurs. Since, term idf weights change slightly as collection sizes grow this is an acceptable solution.[23] A second approach is to run two passes over the documents, where the first pass calculates the idf weights of the terms, and the second pass finds duplicates with the I-Match algorithm. This approach would increase the actual run time of the algorithm, but the theoretical complexity would remain unchanged. Conrad examined these approaches when using a dynamic collection and only using high idf terms and found the approach not stable if a dynamic vocabulary is used.[24] Which puts more evidence that the first approach may be the more applicable for this problem.

The I-Match time complexity is comparable to the DSC-SS algorithm, which generates a single super shingle if the super shingle size is large enough to encompass the whole document. Otherwise, it generates k super shingles while I-Match only generate one. Since k is a constant in the DSC-SS timing complexity, the two algorithms are theoretically equivalent. I-Match, however, is more efficient in practice.

The real benefit of I-Match over DSC-SS however, is not the timing improvement but that small sized documents are not ignored. With DSC-SS, it is quite likely that for sufficiently small documents, no shingles are identified for duplicate detection. Hence, those short documents are not considered even though they may be duplicated. Given the wide variety of domains that duplicate document detection may be used, e.g., document declassification[25], email spam processing[26], etc., neglecting short documents is a potentially serious issue.

While I-Match is very fast it suffers from the same brittleness that the original hashing techniques suffered from, when some slight variation in that hash is made. One resent enhancement to I-Match has been the use of random lexicon variations of the feature idf range. These variations are then used to produce multiple signatures of a document. All of the hashes can be considered a valid signature, this modification to I-Match reduces the brittleness of the I-Match.[18] Kolcz showed that this randomization approach improved the recall effectiveness of I-Match by 40-60% without hurting precision, when using 10 random lexicons.

[edit] Plagiarism and Clustering

While duplicate data detection can be considered a sub-set of problems like plagiarism detection or clustering. We feel that plagiarism is more similar in nature to information retrieval problems, in that similarity of one document to another is not based on the entire document, but rather sub-document areas. Where that sub-document representation can be considered a query against a collection of documents. Researchers have tried many approaches to plagiarism detection from suffix-trees[27][28][29][30], shingle representations[4][31][9][32][33][34][35][36][37], to vector space cosine approaches [11][38][39][35]. Those studies are beyond the scope of this section.

While there is a great breadth of work in document clustering, the focus here has been on duplicate detection approaches to find similar clusters. Several researchers have examined using clustering approaches to finding duplicate documents.[40][41] Where other text clustering approaches use techniques like nearest neighbor or a myriad of other techniques beyond the scope of this section.

Finally, research for image duplicate detection is addressed in [42]. While these are approaches to find duplicate data, the techniques and issues are image processing rather than document processing and thus are not examined here.

[edit] See also

[edit] References

  1. ^ Cho, J., N. Shivakumar, et al. (1999). Finding replicated Web collections. ACM SIGMOD.
  2. ^ a b Chowdhury, A., O.Frieder, et al. (2002). "Collection statistics for fast duplicate document detection." ACM Transaction on Information Systems 20(2): 171-191.
  3. ^ (2004). Reuters Services.
  4. ^ a b Brin, S., J. Davis, et al. (1995). Copy Detection Mechanisms for Digital Documents. Proceeding of SIGMOD '95.
  5. ^ a b Heintze, N. (1996). Scalable document fingerprinting. In Proc. USENIX Work-shop on Electronic Commerce.
  6. ^ Broder, A. (1998). On the resemblance and containment of documents. In SEQS: Sequences 91.
  7. ^ Finkel, R., A. Zaslavsky, et al. (2001). Signature extraction for overlap detection in documents. Twenty-Fifth Australasian Computer science conference.
  8. ^ a b c Broder, A., S. Glassman, et al. (1997). Syntactic clustering of the Web. In Proceedings of the 6th International Web Conference.
  9. ^ a b Hoad, T. and J. Zobel (2002). "Methods for identifying versioned and plagiarism documents." Journal of the American Society for Information Science and Technology.
  10. ^ a b Sanderson, M. (1997). "Duplicate detection in the Reuters collection." Technical Report (TR-1997-5) of the Department of Computing Science at the University of Glasgow, Glasgow, UK G12 8QQ.
  11. ^ a b Buckley, C., C. Cardie, et al. (2000). The Smart/Empire TIPSTER IR System. TIPSTER Phase III Proceedings, Morgan Kaufmann.
  12. ^ Salton, G., C. S. Yang, et al. (1975). A Vector-Space Model for Information Retrieval. Comm. of the ACM.
  13. ^ Kwok, K. L. (1996). A new method of weighting query terms for ad-hoc retrieval. Proc. 19th Annual Intl. ACM SIGIR Conf. on R&D in IR.
  14. ^ Singhal, A., C. Buckley, et al. (1996). Pivoted Document Length Normalization. Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.
  15. ^ Robertson, S., S. Walker, et al. (1998). Okapi at TREC-7: automatic ad hoc, filtering, VLC and interactive. Proceedings of the 7th Text Retrieval Conference.
  16. ^ Rocchio, J. (1971). Relevance Feedback in Information Retrieval. In The Smart System - experiments in automatic document processing. Englewood Cliffs, NJ, Prentice Hall.
  17. ^ Cooper, J., A. Coden, et al. (2002). A novel method for detecting similar documents. Proceedings of the 35th Hawaii International Conference on System Sciences.
  18. ^ a b Kolcz, A., A. Chowdhury, et al. (2004). Improved stability of I-Match signatures via lexicon randomization, AOL.
  19. ^ Gionis, A., P. Indyk, et al. (1999). Similarity Search in High Dimensions via Hashing. Proceedings of the 25th International Conference on Very Large Databases (VLDB).
  20. ^ Grossman, D., D. Holmes, et al. (1995). A DBMS Approach to IR in TREC-4. Text REtrieval Conference (TREC-4).
  21. ^ Smeaton, A., F. Kelledy, et al. (1997). Ad Hoc Retrieval Using Thresholds, WSTs for French Monolingual Retrieval, Document-at-a-Glance for High Precision and Triphone Windows for Spoken Documents. Proceedings of the Sixth Text Retrieval Conference (TREC-6).
  22. ^ (1995). Secure Hash Standard, U.S Department of Commerce/National Institute of Standards and Technology. FIPS PUB 180-1.
  23. ^ Frieder, O., A. Chowdhury, et al. (2000). "Efficiency Considerations for Scalable Information Retrieval Servers." Journal of Digital information.
  24. ^ Conrad, J., X. Guo, et al. (2003). Online duplicate document detection: signature reliability in a dynamic retrieval environment. Twelfth International Conference on Information and Knowledge Management CIKM 2003.
  25. ^ Scotti, R. and C. Lilly (1999). George Washington University Declassification Productivity Research Center, George Washington University.
  26. ^ Kolcz, A., A. Chowdhury, et al. (2003). On the Effects of Data Duplication on Classifier Accuracy. The Twentieth International Conference on Machine Learning (ICML-2003) Workshop on Learning from Imbalanced Data Sets II.
  27. ^ Monostori, K., A. Zaslavsky, et al. (1999). Parallel overlap and similarity detection in semi-structured document collections. In Proceedings of the 6th Annual Australasian Conference on Parallel And Real-Time Systems (PART-99).
  28. ^ Monostori, K., A. Zaslavsky, et al. (2000). Document overlap detection system for distributed digital libraries. In Proceedings of the ACM Digital Libraries 2000 (DL2000).
  29. ^ Monostori, K., A. Zaslavsky, et al. (2000). Match Detect Reveal: Finding overlapping and similar digital documents. Proceedings of the Information Resources Management Association International Conference (IRMA2000).
  30. ^ Monostori, K., A. Zaslavsky, et al. (2000). Parallel and distributed overlap detection on the Web. In Proceedings of the Workshop on Applied Parallel Computing (PARA2000).
  31. ^ Garcia-Molina, H., L. Gravano, et al. (1996). dSCAM: Finding document copies across multiple databases. Proceedings of the 4th International Conference on Parallel and Distributed Systems (PDIS 96).
  32. ^ Shivakumar, N. and H. Garcia-Molina (1996). Building a scalable and accurate copy detection mechanism. Proceedings of the 2nd International Conference on Theory and Practice of Digital Libraries.
  33. ^ Shivakumar, N. and H. Garcia-Molina (1998). Finding near-replicas of documents on the web. Proceedings of the Workshop on Web Databases (WebDB 98).
  34. ^ Mander, U. (1994). Finding similar files in a large file system. In 1994 Winter USENIX Technical Conference.
  35. ^ a b Antonia, S., H. Leong, et al. (1997). CHECK: a document plagiarism detection system. In Proceedings of ACM Symposium for Applied Computing.
  36. ^ Ottenstein, K. J. (1976). An algorithm approach to the detection and prevention of plagiarism, CSD-TR 200, Purdue University.
  37. ^ Schleimer, S., D. S. Wilkerson, et al. (2003). Winnowing: local algorithms for document fingerprinting. ACM SIGMOD.
  38. ^ Ilyinsky, S., M. Kuzmin, et al. (2002). An efficient method to detect duplicates of web documents with the use of inverted index. Proceedings of the Eleventh International World Wide Web Conference.
  39. ^ Wise, M. and J. Yap (1996). Improved detection of similarities in computer programs and other texts. In: Proceedings of the SIGCSE 96.
  40. ^ Haveliwala, T., A. Gionis, et al. (2000). Scalable techniques for clustering the web. Proceedings of WebDB 2000.
  41. ^ Fetterly, D., M. Manasse, et al. (2003). On the evolution of clusters of near-duplicate web pages. Proceedings of the 1st Latin American Web Congress.
  42. ^ Kjell, B., W. A. Woods, et al. (1994). "Discrimination of Authorship Using Visualization." Information Processing and Management 30(1): pp. 141-150.