Sunday, October 19, 2008

Privacy Preserving Data Integration And Mining : Querying Across Sources

Once semantic correspondences have been established, it is possible to query (e.g., with SQL queries) across the sources. How do we ensure that query results do not violate privacy policy? How do we query the sources such that only the results are disclosed? How can we prevent the leaking of information from answering a set of queries? Only a few general techniques exist today for querying datasets while preserving privacy: statistical databases, privacy-preserving join computation, and privacy-preserving top-K queries. In statistical databases, the goal is to allow users to ask aggregate queries over the database while hiding individual data items. There exists a rich literature on this topic, and a comprehensive survey is in. Unfortunately, the main results
are negative: while it is possible to preserve privacy for a single query, ensuring that a sequence of query results cannot be combined to disclose individual data is not practical. Privacy-preserving joins and the more restricted privacy-preserving intersection size computation have been addressed in. Here, each of the two parties learns only the query’s answer, and nothing else. The techniques only apply to a specialized class of queries.
Privacy-preserving top-K queries have also recently been studied. Such a query returns just the closest K matches to a query without revealing anything about why those matches are close, what the values of the attributes of the close items are, or even which site the closest matches come from. This is accomplished efficiently through the use of an untrusted third party: a party that is not allowed to see private values, but is trusted not to collude with any site to violate privacy(see Figure 1.)





In this method, each site finds its own top k, and encrypts each result with the public key of the querying site. The parties then compare their top k with the top k of all other sites – except that the comparison used gives each site a random share of the result, so neither learns the result. The results from all sites are combined, scrambled, and given to the non-colluding untrusted site. This site can combine the random shares to get a comparison result for each pair, enabling it to sort and select the top k. The results corresponding to these k are sent to the querying site. Each site learns nothing about other sites (the comparison result it sees appears to be a randomly chosen bit.) The untrusted site sees k*n encrypted results. It is able to totally order the results, but since it knows nothing about what each means or where it comes from, it learns nothing. The querying site only sees the final result.



For cohort studies, query criteria will combine attributes about individuals across data sources. Solutions have been developed for this. Privacy-preserving data mining also provides some building blocks. However, the issue of inference from multiple queries must still be resolved. Issues include categorizing types of queries with respect to privacy policy, ensuring that query processing does not disclose
information, and guarding against leakage from a set of queries. While there has been work in this area, many practical challenges remain.
Another result is finding matches to a query without revealing the query. In this case, both the query and the data are private – the only thing that can be revealed is which items match. In addition, the method allows checking for “forbidden” queries – even though the query is not revealed, it can be checked against combinations of query criteria that are not permitted.

No comments:


Find It