Sunday, October 19, 2008

DATA INTEGRATION AND DATA MINING

Data Integration and Data Mining are quite closely coupled. Integration is a necessary pre-requisite before mining data collected from multiple sources. At the same time, data mining/machine learning techniques are used to enable automatic data integration. Several systems have been developed to implement automatic schema matching . The systems use machine learning/data mining tools to help automate schema matching. SemInt uses neural networks to determine match candidates. Clustering is done
on similar attributes of the input schema. The signatures of the cluster centers are used as training data. Matching is done by feeding attributes from the second schema into the neural network. LSD also uses machine learning techniques for schema matching .LSD consists of several phases. First, mappings for several sources are manually specified. Then source data is extracted (into XML) and training data is created for each base learner. Finally the base learners and the meta-learner are trained. Further steps are carried out to refine the weights learned. The base learners used are a nearest neighbor classification model as well as a Na¨ıve Bayes learner. Again, there has been work on different privacy-preserving classification models that is applicable. Artemis is another schema integration tool that computes “affinities” in the range 0 to 1 between attributes. Schema integration is done by clustering attributes based on those affinities. Clearly, a lot of work in both privacy preserving data mining as well as cryptography is relevant to
the problem of privacy-preserving schema integration.
Record linkage also uses various machine learning techniques.
Record linkage can be viewed as a pattern classification problem. In pattern classification problems, the goal is to correctly assign patterns to one of a finite number of classes. Similarly, the goal of the record linkage problem is to determine the matching status of a pair of records brought together for comparison. Machine learning methods, such as decision tree induction, neural networks, instance-based learning, clustering, are widely used for pattern classification. Given a set of patterns, a machine learning method builds a decision model that can be used to predict the class of each unclassified pattern. Again, prior privacy-preserving work is relevant. At the other end of the spectrum, privacy-preserving data mining assumes that data integration has already been done, which is clearly not a solved problem.

No comments:


Find It