An introduction to duplicate detection

  • 0 Ratings
  • 0 Want to read
  • 0 Currently reading
  • 0 Have read
Not in Library

My Reading Lists:

Create a new list

Check-In

×Close
Add an optional check-in date. Check-in dates are used to track yearly reading goals.
Today

  • 0 Ratings
  • 0 Want to read
  • 0 Currently reading
  • 0 Have read

Buy this book

Last edited by ImportBot
July 29, 2023 | History

An introduction to duplicate detection

  • 0 Ratings
  • 0 Want to read
  • 0 Currently reading
  • 0 Have read

With the ever increasing volume of data, data quality problems abound. Multiple, yet different representations of the same real-world objects in data, duplicates, are one of the most intriguing data quality problems. The effects of such duplicates are detrimental; for instance, bank customers can obtain duplicate identities, inventory levels are monitored incorrectly, catalogs are mailed multiple times to the same household, etc. Automatically detecting duplicates is difficult: First, duplicate representations are usually not identical but slightly differ in their values. Second, in principle all pairs of records should be compared, which is infeasible for large volumes of data. This lecture examines closely the two main components to overcome these difficulties: (i) Similarity measures are used to automatically identify duplicates when comparing two records.Well-chosen similarity measures improve the effectiveness of duplicate detection. (ii) Algorithms are developed to perform on very large volumes of data in search for duplicates.Well-designed algorithms improve the efficiency of duplicate detection. Finally, we discuss methods to evaluate the success of duplicate detection.

Publish Date
Language
English
Pages
77

Buy this book

Previews available in: English

Edition Availability
Cover of: An Introduction to Duplicate Detection
An Introduction to Duplicate Detection
2010, Springer Nature
in English
Cover of: An introduction to duplicate detection
An introduction to duplicate detection
2010, Morgan & Claypool, Morgan & Claypool Publishers
electronic resource / in English

Add another edition?

Book Details


Table of Contents

1. Data cleansing: introduction and motivation
Data quality
Data quality dimensions
Data cleansing
Causes for duplicates
Intra-source duplicates
Inter-source duplicates
Use cases for duplicate detection
Customer relationship management
Scientific databases
Data spaces and linked open data
Lecture overview
2. Problem definition
Formal definition
Complexity analysis
Data in complex relationships
Data model
Challenges of data with complex relationships
3. Similarity functions
Token-based similarity
Jaccard coefficient
Cosine similarity using token frequency and inverse document frequency
Similarity based on tokenization using q-grams
Edit-based similarity
Edit distance measures
Jaro and Jaro-Winkler distance
Hybrid functions
Extended Jaccard similarity
Monge-Elkan measure
Soft TF/IDF
Measures for data with complex relationships
Other similarity measures
Rule-based record comparison
Equational theory
Duplicate profiles
4. Duplicate detection algorithms
Pairwise comparison algorithms
Blocking
Sorted-neighborhood
Comparison
Algorithms for data with complex relationships
Hierarchical relationships
Relationships forming a graph
Clustering algorithms
Clustering based on the duplicate pair graph
Clustering adjusting to data & cluster characteristics
5. Evaluating detection success
Precision and recall
Data sets
Real-world data sets
Synthetic data sets
Towards a duplicate detection benchmark
6. Conclusion and outlook
Bibliography
Authors' biographies.

Edition Notes

Part of: Synthesis digital library of engineering and computer science.

Title from PDF t.p. (viewed on April 7, 2010).

Series from website.

Includes bibliographical references (p. 71-76).

Abstract freely available; full-text restricted to subscribers or individual document purchasers.

Also available in print.

Mode of access: World Wide Web.

System requirements: Adobe Acrobat Reader.

Published in
San Rafael, Calif. (1537 Fourth Street, San Rafael, CA 94901 USA)
Series
Synthesis lectures on data management -- # 3
Other Titles
Synthesis digital library of engineering and computer science.

Classifications

Dewey Decimal Class
005.7565
Library of Congress
QA76.9.D3 N285 2010, QA76.9.D3 N38 2010

The Physical Object

Format
[electronic resource] /
Number of pages
77

ID Numbers

Open Library
OL25556151M
Internet Archive
introductiontodu00naum
ISBN 13
9781608452217, 9781608452200
OCLC/WorldCat
401167979

Community Reviews (0)

Feedback?
No community reviews have been submitted for this work.

Lists

This work does not appear on any lists.

History

Download catalog record: RDF / JSON / OPDS | Wikipedia citation
July 29, 2023 Edited by ImportBot import existing book
March 7, 2023 Edited by MARC Bot import existing book
June 18, 2022 Edited by ImportBot import existing book
February 25, 2022 Edited by ImportBot import existing book
July 29, 2014 Created by ImportBot Imported from Internet Archive item record