Sunday, October 19, 2008

Privacy Preserving Data Integration And Mining : Privacy views

The database administrator defines what is private data by specifying a set of privacy views, in a declarative language extending SQL.View defines a set of private attributes and an owner to whom it is related or belongs.
For example, the database administrator in a health care organization might define the following three privacy views

PRIVACY-VIEW patientAddressDob
OWNER Patient.pid
SELECT Patient.address, Patient.dob
FROM Patient


PRIVACY-VIEW zipDisease
OWNER Patient.pid
SELECT Patient.address.zip, Disease.name
FROM Patient, Treatment, Disease
WHERE Patient.pid = Treatment.pid and Treatment.did = Disease.did


PRIVACY-VIEW physicianDisease
OWNER Patient.pid
SELECT Physician.name, Disease.name
FROM Patient, Treatment, Disease, Physician
WHERE Patient.pid = Treatment.pid and Treatment.did = Disease.did and Physician.id = Treatment.id

The first Privacy view specifies that a patient’s address and dob (date-of-birth) are considered private data when occurring together. Whenever these two attributes occur together in a piece of data, e.g., to be exchanged with a partner or integrated with other data, they are private. Notice that here dob is not private by itself (and similarly address: more below). Similar definitions can be given for patient name and other fields commonly referred to as “individually identifiable information”: Sets of attributes that can be used to tie a tuple or a set of tuples in a data source to a specific real-world entity (e.g., a person). Alternatively, administrators may choose to define database IDs or tuple IDs as private data, both of which could be used to breach privacy over time. Database IDs are used to identify from which data source the data comes. While not necessarily an individual privacy issue, protecting the data source may be a prerequisite for organizations to participate in sharing. Tuple IDs are used to identify tuples within a source. While this may not inherently violate privacy, it may enable tracking of tuples that can violate privacy over time. In general, privacy views can be much more complex (i.e. by specifying associations between attributes from different tables).
The second privacy view, zipDisease, is more subtle: it says that the patient’s zip code and the disease constitutes private data. The zip code alone is not an individually identifiable information, still it is part of a person’s private data and here the decision has been made to consider the association zip, disease as private. Notice also that the two attributes come from different tables. This example illustrates the power of the privacy views: any combination of data can be declared to be private and have an owner.
The third privacy definition query specifies that even the association between physician names and diseases is to be considered private data, owned by the patient. This example illustrates the difficulty in defining ownership for private data. Suppose Dr. Johnson treats both patients “Smith” and patient “Brown” for Diabetes. Who owns the association (“Dr. Johnson”, “Diabetes”), Smith or Brown ? We address this by adopting bag semantics, i.e., we consider two occurrences of the tuple (“Dr. Johnson”, “Diabetes”), one owned by Smith the other by Brown. Privacy views could be implemented by a privacy monitor that checks every data item being retrieved from the database and detects if it contains items that have been defined as private. There are two approaches: compile-time (based on query containment) and run-time (based on materializing the privacy views and building indices on the private attributes). Both approaches need to be investigated and tradeoffs evaluated.

No comments:


Find It