Data received from multiple sources may contain duplicates that need to be removed. In many cases it is important to be able to consolidate information about entities (e.g., to construct more comprehensive sets of scientific data).How can we match entities and consolidate information about them across sources, without revealing the origin of the sources or the real-world origin of the entities? Record Linkage is the identification of records that refer to the same real-world entity. This is a key challenge to
enabling data integration from heterogeneous data sources. What makes record linkage a problem in its own right, (i.e., different from the duplicate elimination problem), is the factthat real-world data is “dirty”. In other words, if data were accurate, record linkage would be similar to duplicate elimination. Unfortunately, in real-world data, duplicate records may have different values in one or more fields (e.g. misspelling causes multiple records for the same person).
Record linkage techniques can be used to disclose data confidentiality. In particular, a privacy-aware corporation will use anonymization techniques to protect its own data before sharing it with other businesses. A data intruder tries to identify as many concealed records as possible using an external database (many external databases are now publicly-available). Anonymization techniques must be aware of the capabilities of record linkage techniques to preserve the privacy of the data.
On the other hand, businesses need to integrate their databases to perform data mining and analysis procedures. Such data integration requires privacy-preserving record linkage, that is record linkage in presence of a privacy framework that ensures the data confidentiality of each business. Thus, we need solutions for the following problems:
• Privacy-preserving record linkage: discovering the records that represent the same real world entity from two integrated databases each of which is protected (encrypted or anonymized). In other words, records are matched without having their identity revealed.
• Record linkage aware data protection: that is protecting the data, before sharing, using anonymization techniques that are aware of the possible use of record linkage, with public available data, to reveal the identity of the records.
• Online record linkage: linking records that arrive continuously in a stream. Real-time systems and sensor networks are two examples of applications that need online data analysis, cleaning, and mining.
Record linkage has been studied in various contexts and has been referred to using different names, such as the merge purge problem. The record linkage problem can also be viewed as a pattern classification problem . In pattern classification problems, the goal is to correctly assign patterns to one of a finite number of classes. Similarly, the goal of the record linkage problem is to determine the matching status of a pair of records brought together for comparison. Machine learning methods, such as decision tree induction, neural networks, instance-based learning, clustering, are widely used for pattern classification. Given a set of patterns, a machine learning method builds a decision model that can be used to predict the class of each unclassified pattern. TAILOR, an interactive Record Linkage toolbox, uses three classification models for record linkage based on induction and clustering.
c-pgms.blogspot.com Moved
15 years ago
No comments:
Post a Comment