An edition of An introduction to duplicate detection (2010)

An introduction to duplicate detection

by Felix Naumann

0 Ratings
0 Want to read
0 Currently reading
0 Have read

Not in Library

My Reading Lists:

Use this Work

Create a new list

0 Ratings
0 Want to read
0 Currently reading
0 Have read

Check nearby libraries

Buy this book

Last edited by ImportBot

July 29, 2023 | History

Edit

An edition of An introduction to duplicate detection (2010)

An introduction to duplicate detection

by Felix Naumann

0 Ratings
0 Want to read
0 Currently reading
0 Have read

With the ever increasing volume of data, data quality problems abound. Multiple, yet different representations of the same real-world objects in data, duplicates, are one of the most intriguing data quality problems. The effects of such duplicates are detrimental; for instance, bank customers can obtain duplicate identities, inventory levels are monitored incorrectly, catalogs are mailed multiple times to the same household, etc. Automatically detecting duplicates is difficult: First, duplicate representations are usually not identical but slightly differ in their values. Second, in principle all pairs of records should be compared, which is infeasible for large volumes of data. This lecture examines closely the two main components to overcome these difficulties: (i) Similarity measures are used to automatically identify duplicates when comparing two records.Well-chosen similarity measures improve the effectiveness of duplicate detection. (ii) Algorithms are developed to perform on very large volumes of data in search for duplicates.Well-designed algorithms improve the efficiency of duplicate detection. Finally, we discuss methods to evaluate the success of duplicate detection.

Publish Date

2010

Publisher

Morgan & Claypool, Morgan & Claypool Publishers

Language

English

Pages

Check nearby libraries

Buy this book

Previews available in: English

Showing 2 featured editions. View all 2 editions?

Edition	Availability
1 An Introduction to Duplicate Detection 2010, Springer Nature in English 3031018354 9783031018350	zzzz Not in Library Libraries near you: WorldCat
2 An introduction to duplicate detection 2010, Morgan & Claypool, Morgan & Claypool Publishers electronic resource / in English 1608452212 9781608452217	aaaa Not in Library Libraries near you: WorldCat

Add another edition?

Book Details

1. Data cleansing: introduction and motivation

Data quality

Data quality dimensions

Data cleansing

Causes for duplicates

Intra-source duplicates

Inter-source duplicates

Use cases for duplicate detection

Customer relationship management

Scientific databases

Data spaces and linked open data

Lecture overview

2. Problem definition

Formal definition

Complexity analysis

Data in complex relationships

Data model

Challenges of data with complex relationships

3. Similarity functions

Token-based similarity

Jaccard coefficient

Cosine similarity using token frequency and inverse document frequency

Similarity based on tokenization using q-grams

Edit-based similarity

Edit distance measures

Jaro and Jaro-Winkler distance

Hybrid functions

Extended Jaccard similarity

Monge-Elkan measure

Soft TF/IDF

Measures for data with complex relationships

Other similarity measures

Rule-based record comparison

Equational theory

Duplicate profiles

4. Duplicate detection algorithms

Pairwise comparison algorithms

Blocking

Sorted-neighborhood

Comparison

Algorithms for data with complex relationships

Hierarchical relationships

Relationships forming a graph

Clustering algorithms

Clustering based on the duplicate pair graph

Clustering adjusting to data & cluster characteristics

5. Evaluating detection success

Precision and recall

Data sets

Real-world data sets

Synthetic data sets

Towards a duplicate detection benchmark

6. Conclusion and outlook

Bibliography

Authors' biographies.

Edition Notes

Part of: Synthesis digital library of engineering and computer science.

Title from PDF t.p. (viewed on April 7, 2010).

Series from website.

Includes bibliographical references (p. 71-76).

Abstract freely available; full-text restricted to subscribers or individual document purchasers.

Also available in print.

Mode of access: World Wide Web.

System requirements: Adobe Acrobat Reader.

Published in: San Rafael, Calif. (1537 Fourth Street, San Rafael, CA 94901 USA)
Series: Synthesis lectures on data management -- # 3
Other Titles: Synthesis digital library of engineering and computer science.

Classifications

Dewey Decimal Class: 005.7565
Library of Congress: QA76.9.D3 N285 2010, QA76.9.D3 N38 2010

The Physical Object

Format: [electronic resource] /
Number of pages: 77

ID Numbers

Open Library: OL25556151M
Internet Archive: introductiontodu00naum
ISBN 13: 9781608452217, 9781608452200
OCLC/WorldCat: 401167979

Source records

Internet Archive item record
Better World Books record
Better World Books record
marc_nuls MARC record
Internet Archive item record

Community Reviews (0)

Feedback?

No community reviews have been submitted for this work.

Lists

This work does not appear on any lists.

History

Created July 29, 2014
6 revisions

Download catalog record: RDF / JSON / OPDS | Wikipedia citation

July 29, 2023	Edited by ImportBot	import existing book
March 7, 2023	Edited by MARC Bot	import existing book
June 18, 2022	Edited by ImportBot	import existing book
February 25, 2022	Edited by ImportBot	import existing book
July 29, 2014	Created by ImportBot	Imported from Internet Archive item record

An introduction to duplicate detection

by Felix Naumann