Lehti, Patrick (2006)
Unsupervised Duplicate Detection Using Sample Non-Duplicates.
Technische Universität Darmstadt
Ph.D. Thesis, Primary publication
|
Dissertation -
PDF
thesis.pdf Copyright Information: In Copyright. Download (1MB) | Preview |
|
|
Lebenslauf -
PDF
LebenslaufDiss.pdf Copyright Information: In Copyright. Download (5kB) | Preview |
Item Type: | Ph.D. Thesis | ||||||
---|---|---|---|---|---|---|---|
Type of entry: | Primary publication | ||||||
Title: | Unsupervised Duplicate Detection Using Sample Non-Duplicates | ||||||
Language: | English | ||||||
Referees: | Neuhold, Prof. Dr.- Erich ; Hofmann, Prof. Dr. Thomas | ||||||
Advisors: | Fankhauser, Dr. Peter | ||||||
Date: | 31 October 2006 | ||||||
Place of Publication: | Darmstadt | ||||||
Date of oral examination: | 17 May 2006 | ||||||
Abstract: | The problem of identifying objects in databases that refer to the same real world entity, is known, among others, as duplicate detection or record linkage. Objects may be duplicates, even though they are not identical due to errors and missing data. Traditional scenarios for duplicate detection are data warehouses, which are populated from several data sources. Duplicate detection here is part of the data cleansing process to improve data quality for the data warehouse. More recently in application scenarios like web portals, that offer users unified access to several data sources, or meta search engines, that distribute a search to several other resources and finally merge the individual results, the problem of duplicate detection is also present. In such scenarios no long and expensive data cleansing process can be carried out, but good duplicate estimations must be available directly. The most common approaches to duplicate detection use either rules or a weighted aggregation of similarity measures between the individual attributes of potential duplicates. However, choosing the appropriate rules, similarity functions, weights, and thresholds requires deep understanding of the application domain or a good representative training set for supervised learning approaches. For this reason, these approaches entail significant costs. This thesis presents an unsupervised, domain independent approach to duplicate detection that starts with a broad alignment of potential duplicates, and analyses the distribution of observed similarity values among these potential duplicates and among representative sample non-duplicates to improve the initial alignment. To this end, a refinement of the classic Fellegi-Sunter model for record linkage is developed, which makes use of these distributions to iteratively remove clear non-duplicates from the set of potential duplicates. Alternatively also machine learning methods like Support Vector Machines are used and compared with the refined Fellegi-Sunter model. Additionally, the presented approach is not only able to align flat records, but makes also use of related objects, which may significantly increase the alignment accuracy, depending on the application. Evaluations show that the approach supersedes other unsupervised approaches and reaches almost the same accuracy as even fully supervised, domain dependent approaches. |
||||||
Alternative Abstract: |
|
||||||
Uncontrolled Keywords: | Duplikaterkennung, Datenbereinigung | ||||||
Alternative keywords: |
|
||||||
URN: | urn:nbn:de:tuda-tuprints-7411 | ||||||
Classification DDC: | 000 Generalities, computers, information > 004 Computer science | ||||||
Divisions: | 20 Department of Computer Science | ||||||
Date Deposited: | 17 Oct 2008 09:22 | ||||||
Last Modified: | 08 Jul 2020 22:56 | ||||||
URI: | https://tuprints.ulb.tu-darmstadt.de/id/eprint/741 | ||||||
PPN: | |||||||
Export: |
View Item |