<<

Protecting when Disclosing Information k

and Its Enforcement through Generalization and Suppression

y

Latanya Sweeney Pierangela Samarati

Lab oratory for Computer Science Computer Science Lab oratory

Massachusetts Institute of SRI International

Cambridge MA USA Menlo Park CA USA

sweeneyaimitedu samaraticslsricom

Abstract

Todays global ly networked society places great demand on the dissemination and sharing of personspecic

Situations where aggregate statistical information was once the reporting norm now rely heavily on the

transfer of microscopical ly detailed transaction and encounter information This happens at a time when

more and more historical ly public information is also electronical ly available When these data are linked

together they provide an electronic shadow of a person or organization that is as identifying and personal

as a ngerprint even when the sources of the information contains no explicit identiers such as name

and phone number In order to protect the anonymity of individuals to whom released data refer data

holders often remove or encrypt explicit identiers such as names addresses and phone numbers However

other distinctive data which we term quasiidentiers often combine uniquely and can be linked to publicly

available information to reidentify individuals

In this paper we address the problem of releasing personspecic data while at the same time safeguarding

the anonymity of the individuals to whom the data refer The approach is based on the denition of k

anonymity A table provides k anonymity if attempts to link explicitly identifying information to its contents

ambiguously map the information to at least k entities We il lustrate how k anonymity can be provided by

using generalization and suppression techniques We introduce the concept of minimal generalization which

captures the property of the release process not to distort the data more than needed to achieve kanonymity

We il lustrate possible preference policies to choose among dierent minimal generalizations Final ly we

present an algorithm and experimental results when an implementation of the algorithm was used to produce

releases of real medical information We also report on the quality of the released data by measuring the

precision and completeness of the results for dierent values of k

The work of Pierangela Samarati was supp orted in part DARPARome Lab oratory under grant FC

and by the National Science Foundation under grant ECS The work of Latanya Sweeney was supp orted

by Medical Informatics Training Grant IT LM from the National Library of Medicine

y

On leave from Universita di Milano

Intro duction

In the age of the Internet and inexp ensive computing p ower so ciety has develop ed an insatiable app etite

for information all kinds of information for many new and often exciting uses Most actions in daily life

are recorded on some computer somewhere That information in turn is often shared exchanged and sold

Many p eople may not care that the lo cal gro cer keeps track of which items they purchase but shared

information can b e quite sensitive or damaging to individuals and organizations Improp er disclosure of

medical information nancial information or matters of national security can have alarming ramications

and many abuses have b een cited The ob jective is to release information freely but to do so in a way

that the identity of any individual contained in the data cannot b e recognized In this way information can

b e shared freely and used for many new purp oses

Sho ckingly there remains a common incorrect b elief that if the data lo oks anonymous it is anonymous

Data holders including government agencies often remove all explicit identiers such as name address

and phone numb er from data so that other information in the data can b e shared incorrectly b elieving

that the identities of individuals cannot b e determined On the contrary deidentifying data provides no

guarantee of anonymity Released information often contains other data such as birth date gender

and ZIP co de that in combination can b e linked to publicly available information to reidentify individuals

Most municipalities sell p opulation registers that include the identities of individuals along with basic demo

graphics examples include lo cal census data voter lists city directories and information from motor vehicle

agencies tax assessors and real estate agencies For example an electronic version of a citys voter list was

purchased for twenty dollars and used to show the ease of reidentifying medical records In addition to

names and addresses the voter list included the birth dates and genders of voters Of these had

unique birth dates were unique with resp ect to birth date and gender with resp ect to birth date

and a digit ZIP co de and were identiable with just the full p ostal co de and birth date These

results reveal how uniquely identifying combinations of basic demographic attributes such as ZIP co de date

of birth ethnicity gender and martial status can b e

To illustrate this problem Figure exemplies a table of released medical data deidentied by suppress

ing names and So cial Security Numb ers SSNs so as not to disclose the identities of individuals to whom

the data refer However values of other released attributes such as fZIP Date Of Birth Ethnicity Sex

Marital Statusg can also app ear in some external table jointly with the individual identity and can there

fore allow it to b e tracked As illustrated in Figure ZIP Date Of Birth and Sex can b e linked to the

Voter List to reveal the Name Address and City Likewise Ethnicity and Marital Status can b e linked

to other publicly available p opulation registers In the Medical Data table of Figure there is only one

female b orn on and living in the area From the uniqueness results mentioned previously

regarding an actual voter list more than of the voters could b e uniquely identied using just

these attributes This combination uniquely identies the corresp onding bulleted tuple in the released data

as p ertaining to Sue J Carlson Main Street Cambridge and therefore reveals she has rep orted

shortness of breath Notice the medical information is not assumed to b e publicly asso ciated with the

individuals and the desired protection is to release the medical information such that the identities of the

individuals cannot b e determined However the of the released characteristics for Sue J Carlson leads to

determine which medical data among those released are hers While this example demonstrated an exact

match in some cases released information can b e linked to a restrictive set of individuals to whom the

released information could refer

Several protection techniques have b een develop ed with resp ect to statistical databases such as scram

bling and swapping values and adding noise to the data in such a way as to maintain an overall statistical

prop erty of the result However many new uses of data including cost analysis and

retrosp ective often need accurate information within the tuple itself Two indep endently develop ed

systems have b een released which use suppression and generalization as techniques to provide disclosure con

Medical Data Released as Anonymous

SSN Name Ethnicity Date Of Birth Sex ZIP Marital Status Problem

asian female divorced hyp ertension

asian female divorced ob esity

asian male married chest pain

asian male married ob esity

black male married hyp ertension

black male married shortness of breath

black female married shortness of breath

black female married ob esity

white male single chest pain

white male single ob esity

white female widow shortness of breath

Voter List

Name Address City ZIP DOB Sex Party

Sue J Carlson Main St Cambridge female demo crat

Figure Reidentifying anonymous data by linking to external data

trol while maintaining the integrity of the values within each tuple namely Datay in the

and MuArgus in Europ e However no formal foundations or abstraction have b een provided for the

techniques employed by b oth Further approximations made by the systems can suer from drawbacks such

as generalizing data more than is needed like or not providing adequate protection like

In this pap er we provide a formal foundation for the anonymity problem against linking and for the

application of generalization and suppression towards its solution We intro duce the denition of quasi

identiers as attributes that can b e exploited for linking and of kanonymity as characterizing the degree

of protection of data with resp ect to inference by linking We show how k anonymity can b e ensured

in information releases by generalizing andor suppressing part of the data to b e disclosed Within this

framework we intro duce the concepts of generalized table and of minimal generalization Intuitively a

generalization is minimal if data are not generalized more than necessary to provide kanonymity Also

the denition of preferred generalization allows the user to select among p ossible minimal generalizations

those that satisfy particular conditions such as favoring certain attributes in the generalization pro cess We

present an algorithm to compute a preferred minimal generalization of a given table Finally we discuss

some exp erimental results derived from the application of our approach to a medical database containing

information on patients

The problem we consider diers from the traditional access control and from statistical database

problems Access control systems address the problem of controlling sp ecic access to data

with resp ect to rules stating whether a piece of data can or cannot b e released In our work it is not the

disclosure of the sp ecic piece of data to b e protected ie on which an access decision can b e taken but

rather the fact that the data refers to a particular entity Statistical database techniques address the problem

of pro ducing tabular data representing a summary of the information to b e queried Protection is enforced

in such a framework by ensuring that it is not p ossible for users to infer original individual data from the

pro duced summary In our approach instead we allow the release of generalized p ersonsp ecic data on

which users can pro duce summaries according to their needs The advantage with resp ect to precomputed

releasesp ecic statistics is an increased exibility and availability of information for the users This exibility

and availability has as a drawback from the enduser stand p oint a coarse granularity level of the data

This new typ e of declassication and release of information seems to b e required more and more in to days

emerging applications

The remainder of this pap er is organized as follows In Section we intro duce basic assumptions and

denitions In Section we discuss generalization to provide anonymity and in Section we continue the

discussion to include suppression In Section basic preference p olicies for cho osing among dierent minimal

generalizations are illustrated In Section we discuss an algorithmic implementation of our approach

Section rep orts some exp erimental results Section concludes the pap er

Assumptions and preliminary denitions

We consider the data holders table to b e a private table PT where each tuple refers to a dierent entity

individual organization and so on From the private table PT the data holder constructs a table which is

to b e an anonymous release of PT For the sake of simplicity we will subsequently refer to the privacy and re

identication of individuals in cases equally applicable to other entities We assume that all explicit identiers

eg names SSNs and addresses are either encrypted or suppressed and we therefore ignore them in the

remainder of this pap er Borrowing the terminology from we call the combination of characteristics on

which linking can b e enforced quasiidentiers Quasiidentiers must therefore b e protected They are

dened as follows

Denition Quasiidentier Let T A A be a table A quasiidentier of T is a set of at

1 n

tributes fA A g fA A g whose release must be control led

i j 1 n

Given a table T A A a subset of attributes fA A g fA A g and a tuple t T

1 n i j 1 n

tA A denotes the sequence of the values of A A in t T A A denotes the pro jection

i j i j i j

maintaining duplicate tuples of attributes A A in T Also QI denotes the set of quasiidentiers

i j

T

asso ciated with T and jT j denotes cardinality that is the numb er of tuples in T

Our goal is to allow the release of information in the table while ensuring the anonymity of the individuals

The anonymity constraint requires released information to indistinctly relate to at least a given numb er k of

individuals where k is typically set by the data holder as stated by the following requirement

Denition kanonymity requirement Each release of data must be such that every combination

of values of quasiidentiers can be indistinctly matched to at least k individuals

Adherence to the anonymity requirement necessitates knowing how many individuals each released tuple

matches This can b e done by explicitly linking the released data with externally available data This is

obviously an imp ossible task for the data holder Although we can assume that the data holder knows

which attributes may app ear in external tables and therefore what constitutes quasiidentiers the sp ecic

values of data in external knowledge cannot b e assumed The key to satisfying the k anonymity requirement

therefore is to translate the requirement in terms of the released data themselves In order to do that we

require the following assumption to hold

Assumption Al l attributes in table PT which are to be released and which are external ly available in

1

combination ie appearing together in an external table or in possible joins between external tables to a

data recipient are dened in a quasidentier associated with PT

Although this is not a trivial assumption its enforcement is p ossible The data holder estimates which

attributes might b e used to link with outside knowledge this of course forms the basis for a quasiidentier

While the exp ectation of this knowledge is somewhat reasonable for publicly available data we recognize that

there are far to o many sources of semi public and private information such as pharmacy records longitudinal

A universal relation combining external tables can b e imagined

studies nancial records survey resp onses o ccupational lists and memb ership lists to account a priori for

all linking p ossibilities Supp ose the choice of attributes for a quasiidentier is incorrect that is the

data holder misjudges which attributes are sensitive for linking In this case the released data may b e less

anonymous than what was required and as a result individuals may b e more easily identied Sweeney

examines this risk and shows that it cannot b e p erfectly resolved by the data holder since the data holder

cannot always know what each recipient of the data knows p oses solutions that reside in p olicies laws

and contracts In the remainder of this work we assume that prop er quasiidentiers have b een recognized

We intro duce the denition of k anonymity for a table as follows

Denition k anonymity Let T A A be a table and QI be the quasiidentiers associated

1 n

T

with it T is said to satisfy k anonymity i for each quasiidentier QI QI each sequence of values in

T

T QI appears at least with k occurrences in T QI

Under Assumption and under the hyp othesis that the privately stored table contains at most one

tuple for each identity to b e protected ie to whom a quasiidentier refers k anonymity of a released

table represents a sucient condition for the satisfaction of the k anonymity requirement In other words

a table satisfying Denition for a given k satises the k anonymity requirement for such a k Consider

a quasiidentier QI if Denition is satised each tuple in PTQI has at least k o ccurrences Since

the p opulation of the private table is a subset of the p opulation of the outside world there will b e at least

k individuals in the outside world matching these values Also since all attributes available outside in

combination are included in QI no additional attributes can b e joint to QI to reduce the cardinality of

0

such a set Note also that any subset of the attributes in QI will refer to k k individuals To illustrate

consider the situation exemplied in Figure but assume that the released data contained two o ccurrences

of the sequence white female widow Then at least two individuals matching such

o ccurrences will exist in the voter list or in the table combining the voter list with all other external tables

and it will not b e p ossible for the data recipient to determine which of the two medical records asso ciated

with these values of the quasiidentier b elong to which of the two individuals Since k anonymity of was

provided in the release each medical record could indistinctly b elong to at least two individuals

Given the assumption and denitions ab ove and given a private table PT to b e released we fo cus on

the problem of pro ducing a version of PT which satises k anonymity

Generalizing data

Our rst approach to providing k anonymity is based on the denition and use of generalization relationships

b etween domains and b etween values that attributes can assume

Generalization relationships

In a classical relational database system domains are used to describ e the set of values that attributes

assume For example there might b e a ZIP co de domain a number domain and a string domain We

extend this notion of a domain to make it easier to describ e how to generalize the values of an attribute In

the original database where every value is as sp ecic as p ossible every attribute is in the ground domain

For example is in the ground ZIP co de domain Z To achieve k anonymity we can make the ZIP

0

co de less informative We do this by saying that there is a more general less sp ecic domain that can b e

used to describ e ZIP co des Z in which the last digit has b een replaced by a There is also a mapping from

1

Z to Z such as This mapping b etween domains is stated by means of a generalization

0 1

relationship which represents a partial order on the set Dom of domains and which is required to

D

satisfy the following conditions each domain D has at most one direct generalized domain and all

i

Z f g

2

HY

H

H

H

person

E fperson g Z f; g

1 1

HY

H I I

H

H

asian black caucasian

E fasian; black; caucasian g Z f; ; ; g

0 0

VGH DGH VGH DGH

E E Z Z

0 0 0 0

released not

M fnot released g

2

HY

H

H

H

married never married released once not

married; never married g released g E fnot M fonce

1 1

Qk

I

Q

Q

married divorced widow single male female

M fmarried; divorced; widow; single g E fmale; female g

0 0

VGH DGH DGH VGH

G G M M

0 0 0 0

Figure Examples of domain and value generalization hierarchies

2

maximal elements of Dom are singleton The denition of this generalization implies the existence for each

domain D Dom of a hierarchy which we term the domain generalization hierarchy DGH Since

D

generalized values can b e used in place of more sp ecic ones it is imp ortant that all domains in a hierarchy

b e compatible Compatibility can b e ensured by using the same storage representation form for all domains

in a generalization hierarchy A value generalization relationship partial order is also dened which

V

asso ciates with each value v in a domain D a unique value in domain D direct generalization of D Such

i i j i

a relationship implies the existence for each domain D of a value generalization hierarchy VGH

D

Example Figure il lustrates an example of domain and value generalization hierarchies for domain

Z representing zipcodes of the Cambridge MA area E representing ethnicities M representing marital

0 0 0

status and G representing gender

0

In the remainder of this pap er we will often refer to a domain or value generalization hierarchy in terms

of the graph representing all and only the direct generalization relationships b etween the elements in it ie

implied generalization relationships do not app ear as arcs in the graph We will use the term hierarchy

interchangeably to denote either a partially ordered set or the graph representing the set and all the direct

generalization relationships b etween its elements We will explicitly refer to the ordered set or to the graph

when it is not otherwise clear from context

Also since we will b e dealing with sets of attributes it is useful to visualize the generalization relationship

and hierarchies in terms of tuples comp osed of elements of Dom or of their values Given a tuple DT

hD D i such that D Dom i n we dene the domain generalization hierarchy of DT as

1 n i

assuming that the Cartesian pro duct is ordered by imp osing co ordinate DGH DGH DGH

D DT D

n 1

wise order DGH denes a lattice whose minimal element is DT The generalization hierarchy of

DT

a domain tuple DT denes the dierent ways in which DT can b e generalized In particular each path

from DT to the unique maximal element of DGH in the graph describing DGH denes a p ossible

DT DT

alternative path that can b e followed in the generalization pro cess We refer to the set of no des in each of

such paths together with the generalization relationships b etween them as a generalization strategy for

where the domain generalization DGH Figure illustrates the domain generalization hierarchy DGH

DT E Z

0 0

hierarchies of E and Z are as illustrated in Figure

0 0

The motivation b ehind condition is to ensure that all values in each domain can b e eventually generalized to a

single value

hE ; Z i hE ; Z i hE ; Z i hE ; Z i

1 2 1 2 1 2 1 2

I

hE ; Z i hE ; Z i hE ; Z i hE ; Z i hE ; Z i

1 1 0 2 1 1 1 1 0 2

HY

H

H

H

hE ; Z i hE ; Z i hE ; Z i hE ; Z i hE ; Z i

0 1 0 1 1 0 1 0 0 1

I

hE ; Z i hE ; Z i hE ; Z i hE ; Z i

0 0 0 0 0 0 0 0

GS GS GS DGH

1 2 3

DT

Figure Domain generalization hierarchy DGH and strategies for DT hE Z i

DT 0 0

EthE ZIPZ EthE ZIPZ EthE ZIPZ EthE ZIPZ

EthE ZIPZ

1 0 1 1 0 2 0 1

0 0

p erson p erson asian asian

asian

p erson p erson asian asian

asian

p erson p erson asian asian

asian

p erson p erson asian asian

asian

p erson p erson black black

black

p erson p erson black black

black

p erson p erson black black

black

p erson p erson black black

black

p erson p erson white white

white

p erson p erson white white

white

p erson p erson white white

white

p erson p erson white white

white

GT GT GT GT

PT

[10] [11] [02] [01]

Figure Examples of generalized tables for PT

Generalized table and minimal generalization

Given a private table PT our rst approach to provide k anonymity consists of generalizing the values stored

in the table Intuitively attribute values stored in the private table can b e substituted up on release with

generalized values Since multiple values can map to a single generalized value generalization may decrease

the numb er of distinct tuples thereby p ossibly increasing the size of the clusters containing tuples with the

same values We p erform generalization at the attribute level Generalizing an attribute means substituting

its values with corresp onding values from a more general domain Generalization at the attribute level

ensures that all values of an attribute b elong to the same domain However as a result of the generalization

pro cess the domain of an attribute can change In the following dom A T denotes the domain of attribute

i

A in table T D dom A PT denotes the domain asso ciated with attribute A in the private table PT

i i i i

Denition Generalized Table Let T A A and T A A be two tables dened on the

i 1 n j 1 n

same set of attributes T is said to be a generalization of T written T T i

j i i j

jT j jT j

i j

z n dom A T domA T

z i D z j

It is possible to dene a bijective mapping between T and T that associates each tuples t and t such

i j i j

that t A t A

i z V j z

Denition states that a table T is a generalization of a table T dened on the same attributes

j i

i T and T have the same numb er of tuples the domain of each attribute in T is equal to or a

i j j

hE ; Z i

1 2

I

hE ; Z i hE ; Z i

1 1 0 2

H HY

H H

H H

H H

hE ; Z i hE ; Z i

1 0 0 1

I

hE ; Z i

0 0

DGH

hE Z i

Figure Hierarchy DGH and corresp onding lattice on distance vectors

hE Z i

generalization of the domain of the attribute in T and each tuple t in T has a corresp onding tuple t

i i i j

in T and vice versa such that the value for each attribute in t is equal to or a generalization of the value

j j

of the corresp onding attribute in t

i

Example Consider the table PT il lustrated in Figure and the domain and value generalization hierar

chies for E and Z il lustrated in Figure The remaining four tables in Figure are al l possible generalized

0 0

tables for PT but the topmost one generalizes each tuple to hperson i For the clarity of the example

each table reports the domain for each attribute in the table With respect to k anonymity GT satises

[01]

k anonymity for k GT satises k anonymity for k GT satises k anonymity for

[10] [02]

k and GT satises k anonymity for k

[11]

Given a table dierent p ossible generalizations exist Not all generalizations however can b e considered

equally satisfactory For instance the trivial generalization bringing each attribute to the highest p ossible

level of generalization thus collapsing all tuples in T to the same list of values provides k anonymity at

the price of a strong generalization of the data Such extreme generalization is not needed if a more sp ecic

table ie containing more sp ecic values exists which satises k anonymity This concept is captured by

the denition of k minimal generalization To intro duce it we rst intro duce the notion of distance vector

Denition Distance vector Let T A A and T A A be two tables such that T T

i 1 n j 1 n i j

The distance vector of T from T is the vector DV d d where each d is the length of the unique

j i ij 1 n z

path between D domA T and domA T in the domain generalization hierarchy DGH

z i z j D

Example Consider table PT and its generalized tables il lustrated in Figure The distance vectors

between PT and its dierent generalizations are the vectors appearing as a subscript of each table

0 0

0 0 0

Given two distance vectors DV d d and DV d d DV DV i d d for all i

1 n i

n

1

i

0 0 0

n DV DV i DV DV and DV DV A generalization hierarchy for a domain tuple can b e seen

as a hierarchy lattice on the corresp onding distance vectors For instance Figure illustrates the lattice

representing the relationship b etween the distance vectors corresp onding to the p ossible generalization of

hE Z i

0 0

We can now intro duce the denition of k minimal generalization

Denition k minimal generalization Let T and T be two tables such that T T T is said to

i j i j j

be a k minimal generalization of T i

i

T satises kanonymity

j

Ethn DOB Sex ZIP Status

asian female divorced

asian female divorced

asian male married

asian male married

black male married

black male married

black female married

black female married

white male single

white male single

white female widow

PT

Ethn DOB Sex ZIP Status Ethn DOB Sex ZIP Status

asian not rel not rel p ers female b een

asian not rel not rel p ers female b een

asian not rel not rel p ers male b een

asian not rel not rel p ers male b een

black not rel not rel p ers male b een

black not rel not rel p ers male b een

black not rel not rel p ers female b een

black not rel not rel p ers female b een

white not rel not rel p ers male never

white not rel not rel p ers male never

white not rel not rel p ers female b een

GT GT

[02122] [13011]

Figure An example of table PT and its minimal generalizations

T T T T satises k anonymity and DV DV

z i z z iz ij

Intuitively a generalization T is minimal i there do es not exist another generalization T satisfying k

j z

anonymity which is dominated by T in the domain generalization hierarchy of hD D i or equivalently

j 1 n

in the corresp onding lattice of distance vectors If this were the case T would itself b e a generalization for

j

T Note also that a table can b e a minimal generalization of itself if the table already achieved kanonymity

z

Example Consider table PT and its generalized tables il lustrated in Figure Assume QI Eth ZIP

to be a quasiidentier It is easy to see that for k there exist two k minimal generalizations which are

GT and GT Table GT which satises the anonymity requirements is not minimal since it is a

[10] [01] [02]

generalization of GT Analogously GT cannot be minimal being a generalization of both GT and

[01] [11] [10]

GT There are also only two k minimal generalized tables for k which are GT and GT

[01] [10] [02]

Note that since k anonymity requires the existence of k o ccurrences for each sequence of values only for

quasiidentiers for every minimal generalization T DV d for all attributes A which do not b elong

j ij z z

to any quasiidentier

Suppressing data

In Section we discussed how given a private table PT a generalized table can b e pro duced which releases

a more general version of the data in PT and which satises a k anonymity constraint Generalization has

the advantage of allowing release of all the single tuples in the table although in a more general form Here

we illustrate a complementary approach to providing k anonymity which is suppression Suppressing means

to remove data from the table so that they are not released and as a disclosure control technique is not

new We apply suppression at the tuple level that is a tuple can b e suppressed only in its entirety

Suppression is used to mo derate the generalization pro cess when a limited numb er of outliers that is

Ethn DOB Sex ZIP Status

Ethn DOB Sex ZIP Status

asian female divorced

asian female divorced

asian female divorced

asian female divorced

asian male married

asian male married

asian male married

asian male married

black male married

black male married

black male married

black male married

black female married

black female married

black female married

black female married

white male single

white male single

white male single

white male single

GT

PT

[02000]

Figure An example of table PT and its minimal generalization

EthE ZIPZ EthE ZIPZ EthE ZIPZ EthE ZIPZ

EthE ZIPZ

1 0 0 1 0 2 1 1

0 0

p erson asian asian p erson

asian

p erson asian asian p erson

asian

p erson asian asian p erson

asian

p erson asian asian p erson

asian

p erson black black p erson

black

p erson black black p erson

black

p erson black black p erson

black

p erson white white p erson

white

GT GT GT GT

PT

[10] [01] [02] [11]

Figure Examples of generalized tables for PT

tuples with less that k o ccurrences would force a great amount of generalization To clarify consider the

table illustrated in Figure whose pro jection on the considered quasiidentier is illustrated in Figure

and supp ose k anonymity with k is to b e provided Attribute Date of Birth has a domain date with

the following generalizations from the sp ecic date mmddyy to the month mmyy to the year yy to a

3

year interval eg to a year interval eg to a year interval and so on It is easy

to see that the presence of the last tuple in the table necessitates for this requirement to b e satised two

steps of generalization on Date of Birth one step of generalization on Zip Code one step of generalization

on Marital Status and either one further step on Sex Zip Code and Marital Status or alternatively

on Ethnicity and Date of Birth The two p ossible minimal generalizations are as illustrated in Figure

In practice in b oth cases almost all the attributes must b e generalized It can b e easily seen at the same

time that had this last tuple not b een present k anonymity could have b een simply achieved by two steps of

generalization on attribute Date of Birth as illustrated in Figure Suppressing the tuple would in this

case p ermit enforcement of less generalization

In illustrating how suppression interplays with generalization to provide k anonymity we b egin by re

stating the denition of generalized table as follows

Denition Generalized Table with suppression Let T A A and T A A be two

i 1 n j 1 n

tables dened on the same set of attributes T is said to be a generalization of T written T T i

j i i j

jT j jT j

j i

z n dom A T domA T

z i D z j

It is possible to dene an injective mapping between T and T that associates tuples t T and t T

i j i i j j

such that t A t A

i z V j z

Note that although generalization may seem to change the format of the data compatibility can b e assured by

using the same representation form For instance the month can b e represented always as a sp ecic day This is

actually the trick that we used in our application of generalization

EthE ZIPZ EthE ZIPZ EthE ZIPZ EthE ZIPZ

EthE ZIPZ

1 0 0 1 0 2 1 1

0 0

p erson asian asian p erson

asian

p erson asian asian p erson

asian

p erson asian asian p erson

asian

p erson asian asian p erson

asian

p erson black p erson

black

black black p erson

black

p erson black black p erson

black

p erson p erson

white

GT GT GT GT

PT

[10] [01] [02] [11]

Figure Examples of generalized tables for PT

The denition ab ove diers from Denition since it allows tuples app earing in T not to have any

i

corresp onding generalized tuple in T Intuitively tuples in T not having any corresp ondent in T are tuples

j i j

which have b een suppressed

Denition allows any amount of suppression in a generalized table Obviously we are not interested

in tables that suppress more tuples than necessary to achieve k anonymity at a given level of generalization

This is captured by the following denition

Denition Minimal required suppression Let T be a table and T a generalization of T satis

i j i

fying kanonymity T is said to enforce minimal required suppression i T such that T T DV

j z i z iz

DV jT j jT j and T satises k anonymity

ij j z z

Example Consider the table PT and its generalizations il lustrated in Figure The tuples written in

bold face and marked with double lines in each table are the tuples that must be suppressed to achieve k

anonymity of Suppression of a subset of them would not reach the required anonymity Suppression of

any superset would be unnecessary not satisfying minimal required suppression

Allowing tuples to b e suppressed typically aords more tables p er level of generalization It is trivial to

prove however that for each p ossible distance vector the generalized table satisfying a k anonymity con

straint by enforcing minimal suppression is unique This table is obtained by rst applying the generalization

describ ed by the distance vector and then removing al l and only the tuples that app ear with fewer than k

o ccurrences

In the remainder of this pap er we assume the condition stated in Denition to b e satised that is all

generalizations that we consider enforce minimal required suppression Hence in the following within the

context of a k anonymity constraint when referring to the generalization at a given distance vector we will

intend the unique generalization for that distance vector which satises the k anonymity constraint enforcing

minimal required suppression To illustrate consider the table PT in Figure with resp ect to k anonymity

with k we would refer to its generalizations as illustrated in Figure Note that for sake of clarity we

have left an empty row to corresp ond to each removed tuple

Generalization and suppression are two dierent approaches to obtaining from a given table a table

which satises k anonymity It is trivial to note that the two approaches pro duce the b est results when jointly

applied For instance we have already noticed how with resp ect to the table in Figure generalization

alone is unsatisfactory see Figure Suppression alone on the other side would require suppression of all

tuples in the table Joint application of the two techniques allows instead the release of a table like the

one in Figure The question is therefore whether it is b etter to generalize at the cost of less precision

in the data or to suppress at the cost of completeness From observations of reallife applications and

requirements we assume the following We consider an acceptable suppression threshold MaxSup as

sp ecied stating the maximum numb er of suppressed tuples that is considered acceptable Within this

acceptable threshold suppression is considered preferable to generalization in other words it is b etter to

suppress more tuples than to enforce more generalization The reason for this is that suppression aects

single tuples whereas generalization mo dies all values asso ciated with an attribute thus aecting all tuples

in the table Tables which enforce suppression b eyond MaxSup are considered unacceptable

Given these assumptions we can now restate the denition of k minimal generalization taking suppression

into consideration

Denition k minimal generalization with suppression Let T and T be two tables such that

i j

T T and let MaxSup be the specied threshold of acceptable suppression T is said to be a k minimal

i j j

generalization of a table T i

i

T satises kanonymity

j

jT j jT j MaxSup

i j

T T T T satises conditions and and DV DV

z i z z iz ij

Intuitively generalization T is k minimal i it satises k anonymity it do es not enforce more suppression

j

than it is allowed and there do es not exist another generalization satisfying these conditions with a distance

vector smaller than that of T nor do es there exist another table with the same level of generalization

j

satisfying these conditions with less suppression

Example Consider the private table PT il lustrated in Figure and suppose k anonymity with k

is required The possible generalizations but the topmost one col lapsing every tuple to hperson i are

il lustrated in Figure Depending on the acceptable suppression threshold the fol lowing generalizations are

considered minimal

MaxSup GT GT GT or GT suppress more tuple than it is al lowed GT is not

[11] [10] [01] [02] [12]

minimal because of GT

[11]

MaxSup GT and GT GT suppresses more tuple than it is al lowed GT is not minimal

[10] [02] [01] [11]

because of GT and GT is not minimal because of GT and GT

[10] [12] [10] [02]

MaxSup GT and GT GT is not minimal because of GT GT and GT are not

[10] [01] [02] [01] [11] [12]

minimal because of GT and GT

[10] [01]

Preferences

It is clear from Section that there may b e more than one minimal generalization for a given table suppres

sion threshold and k anonymity constraint This is completely legitimate since the denition of minimal

only captures the concept that the least amount of generalization and suppression necessary to achieve k

anonymity is enforced However multiple solutions may exist which satisfy this condition Which of the

solutions is to b e preferred dep ends on sub jective measures and preferences of the data recipient For in

stance dep ending on the use of the released data it may b e preferable to generalize some attributes instead

of others We outline here some simple preference p olicies that can b e applied in cho osing a preferred min

imal generalization To do that we rst intro duce two distance measures dened b etween tables absolute

and relative distance Let T A A b e a table and T A A b e one of its generalizations with

i 1 n j 1 n

distance vector DV d d The absolute distance of T from T written Absdist is the sum of

ij 1 n j i ij

P

n

d The relative distance of T from T written the distances for each attribute Formally Absdist

i j i ij

i=1

Reldist is the sum of the relative distance for each attribute where the relative distance of each attribute

ij

P

n

d

z

is obtained by dividing the distance over the total height of the hierarchy Formally Reldist

ij

z =1

h

z

where h is the height of the domain generalization hierarchy of dom A T

z z i

Given those distance measures we can outline the following basic preference p olicies

Minimum absolute distance prefers the generalizations that has a smaller absolute distance that is

with a smaller total numb er of generalization steps regardless of the hierarchies on which they have

b een taken

Minimum relative distance prefers the generalizations that has a smaller relative distance that is

that minimizes the total numb er of relative steps that is considered with resp ect to the height of the

hierarchy on which they are taken

Maximum distribution prefers the generalizations that contains the greatest numb er of distinct tuples

Minimum suppression prefers the generalizations that suppresses less that is that contains the greater

numb er of tuples

Example Consider Example Suppose MaxSup Minimal generalizations are GT and GT

[10] [02]

Under minimum absolute distance GT is preferred Under minimum relative distance maximum distribu

[10]

tion and minimum suppression policies the two generalizations are equal ly preferable Suppose MaxSup

Minimal generalizations are GT and GT Under the minimum absolute distance policy the two gen

[10] [01]

eralizations are equal ly preferable Under the minimum suppression policy GT is preferred Under the

[10]

minimum relative distance and the maximum distribution policies GT is preferred

[01]

The list ab ove is obviously not complete and there remain additional preference p olicies that could b e

applied the b est one to use of course dep ends on the sp ecic use for the released data Examination of an

exhaustive set of p ossible p olicies is outside the scop e of this pap er The choice of a sp ecic preference p olicy

is done by the requester at the time of access Dierent preference p olicies can b e applied to dierent

quasiidentiers in the same released data

Computing a preferred generalization

We have dened the concept of preferred k minimal generalization corresp onding to a given private table

Here we illustrate an approach to computing such a generalization Before discussing the algorithm we make

some observations clarifying the problem of nding a minimal generalization and its complexity We use the

term outlier to refer to a tuple with fewer than k o ccurrences where k is the anonymity constraint required

First of all given that the k anonymity prop erty is required only for attributes in quasiidentiers

we consider the generalization of each sp ecic quasiidentier within table PT indep endently Instead of

considering the whole table PT to b e generalized we consider its pro jection PTQI keeping duplicates on

the attributes of a quasiidentier QI The generalized table PT is obtained by enforcing generalization for

each quasiidentier QI QI The correctness of the combination of the generalizations indep endently

PT

pro duced for each quasiidentier is ensured by the fact that the denition of a generalized table requires

4

corresp ondence of values across whole tuples and by the fact that the quasiidentiers of a table are disjoint

In Section we illustrated the concepts of a generalization hierarchy and strategies for a domain tuple

Given a quasiidentier QI A A the corresp onding domain hierarchy on DT hD D i

1 n 1 n

pictures all the p ossible generalizations and their relationships Each path strategy in it denes a dierent

way in which generalization can b e applied With resp ect to a strategy we could dene the concept of

lo cal minimal generalization as the generalization that is minimal with resp ect to the set of generalizations

in the strategy intuitively the rst found in the path from the b ottom element DT to the top element

Each k minimal generalization is lo cally minimal with resp ect to some strategy as stated by the following

theorem

This last constraint can b e removed provided that generalization of nondisjoint quasiidentiers b e executed

serially

Theorem Let T A A PTQI be the table to be generalized and let DT hD D i be the

1 n 1 n

tuple where D dom A T z n be a table to be generalized Every k minimal generalization of

z z

T is a local minimal generalization for some strategy of DGH

i DT

Proofsketch By contradiction Supp ose T is k minimal but is not lo cally minimal with resp ect to

j

any strategy Then there exists a strategy containing T such that there exists another generalization T

j z

dominated by T in this strategy which satises k anonymity by suppressing no more tuples than what is

j

allowed Hence T satises conditions and of Denition Moreover since T is dominated by T

z z j

DV DV Hence T cannot b e minimal which contradicts the assumption 2

iz ij j

Since strategies are not disjoint the converse is not necessarily true that is a lo cal minimal generalization

with resp ect to a strategy may not corresp ond to a k minimal generalization

From Theorem following each generalization strategy from the domain tuple to the maximal element

of the hierarchy would then reveal all the lo cal minimal generalizations from which the k minimal general

izations can b e selected and an eventual preferred generalization chosen The consideration of preferences

implies that we cannot stop the search at the rst generalization found that is known to b e k minimal

However this pro cess is much to o costly b ecause of the high numb er of strategies which should b e followed

(h ++h )!

1 n

It can b e proved that the numb er of dierent strategies for a domain tuple DT hD D i is

1 n

h !h !

1 n

where each h is the length of the path from D to the top domain in DGH

i i D

i

In the implementation of our approach we have realized an algorithm that computes a preferred gen

eralization without needing to follow all the strategies and computing the generalizations The algorithm

makes use of the concept of distance vector b etween tuples Let T b e a table and x y T two tuples such

0 00 00 00 0 0

i where each v v is a value in domain D The distance vector v i and y hv v that x hv

i

n 1 n 1

i i

0 00

b etween x and y is the vector V d d where d is the length of the paths from v and v to their

xy 1 n i

i i

For instance with reference to the closest common ancestor in the value generalization hierarchy VGH

D

i

PT illustrated in Figure the distance b etween hasiani and hblacki is Intuitively the

distance b etween two tuples x and y in table T is the distance vector b etween T and the table T with

i i j

T T where the domains of the attribute in T are the most sp ecic domains for which x and y generalize

i j j

to the same tuple t

The following theorem states the relationship b etween distance vectors b etween tuples in a table and a

minimal generalization for the table

Theorem Let T A A PTQI and T be two tables such that T T If T is k minimal then

i 1 n j i j j

DV V for some tuples x y in T such that either x or y has a number of occurrences smal ler than k

ij xy i

Proofsketch By contradiction Supp ose that a k minimal generalization T exists such that DV

j ij

do es not satisfy the condition ab ove Let DV d d Consider a strategy containing a generalization

ij 1 n

with that distance vector there will b e more than one of such strategies and which one is considered is not

imp ortant Consider the dierent generalization steps executed according to the strategy from the b ottom

going up arriving at the generalization corresp onding to T Since no outlier is at exact distance d d

j 1 n

from any tuple no outlier is merged with any tuple at the last step of generalization considered Then the

generalization directly b elow T in the strategy satises the same k anonymity constraint as T with the same

j i

amount of suppression Also by denition of strategy DV DV Then by Denition T cannot b e

iz ij j

minimal which contradicts the assumption 2

According to Theorem the distance vector of a minimal generalization falls within the set of the

vectors b etween the outliers and other tuples in the table This prop erty is exploited by the generalization

algorithm to reduce the numb er of generalizations to b e considered

The algorithm works as follows Let PTQI b e the pro jection of PT over quasiidentier QI First all

distinct tuples in PTQI are determined together with the numb er of their o ccurrences Then the distance

vectors b etween each outlier and every tuple in the table is computed Then a DAG with as no des all

distance vectors found is constructed There is an arc from each vector to all the smallest vector dominating

it in the set Intuitively the DAG corresp onds to a summary of the strategies to b e considered not

all strategies may b e represented and not all generalizations of a strategy may b e present Each path in

the DAG is then followed from the b ottom up until a minimal lo cal generalization is found The algorithm

determines if a generalization is lo cally minimal simply by controlling how the o ccurrences of the tuples would

combine on the basis of the distance table constructed at the b eginning without actually p erforming the

generalization When a lo cal generalization is found another path is followed As paths may b e not disjoint

the algorithm keeps track of generalizations that have b een considered so as to stop on a path when it runs

into another path on which a lo cal minimum has already b een found Once all p ossible paths have b een

examined the evaluation of the distance vectors allows the determination of the generalizations among those

found which are k minimal Among them a preferred generalization to b e computed is then determined on

the basis of the distance vectors and of how the o ccurrences of tuples would combine

The characteristics that reduce the computation cost are therefore that the computation of the

distance vectors b etween tuples greatly reduces the numb er of generalizations to b e considered gener

alizations are not actually computed but foreseen by lo oking at how the o ccurrences of the tuples would

combine the fact that the algorithm keeps track of evaluated generalizations allows it to stop evaluation

on a path whenever it crosses a path already evaluated

The correctness of the algorithm descends directly from Theorems and

The necessary and sucient condition for a table T to satisfy k anonymity is that the cardinality of the

table b e at least k and only in this case therefore is the algorithm applied This is stated by the following

theorem

Theorem Let T be a table MaxSup jT j be the acceptable suppression threshold and k be a natural

number If jT j k then there exists at least a k minimal generalization for T If jT j k there are no

nonempty k minimal generalizations for T

Proofsketch Supp ose jT j k Consider the generalization generalizing each tuple to the topmost

p ossible domain Since maximal elements of Dom are singleton all values of an attribute collapse to the same

value Hence the generalization will contain jT j o ccurrences of the same tuple Since jT j k it satises

k anonymity Supp ose jT j k no generalization can satisfy k anonymity which can b e reached only by

suppressing all the tuples in T 2

Application of the approach some exp erimental results

We constructed a computer program that pro duces tables adhering to k minimal generalizations given sp ecic

thresholds of suppression The program was written in C using ODBC to interface with an SQL server

which in turn accessed a medical database Our goal was to mo del an actual release and to measure the quality

of the released data Most states have legislative mandates to collect medical data from hospitals so we

collapsed the original medical database into a single table consistent with the format and primary attributes

the National Asso ciation of Health Data Organizations recommends that state agencies collect Each

tuple represents one patient and each patient is unique The data contained medical records for patients

Figure itemizes the attributes used the table is considered deidentied b ecause it contains no explicit

identifying information such as name or address As discussed earlier ZIP co de date of birth and gender can

b e linked to p opulation registers that are publicly available in order to reidentify patients Therefore

the quasiidentier QI fZIP birth date gender ethnicityg was considered Each tuple within QI was

found to b e unique

The top table in Figure is a sample of the original data and the lower table illustrates a k minimal

generalization of that table given a threshold of suppression The ZIP eld has b een generalized to the

Attribute distinct values min frequency max frequency median frequency comments

ZIP

Birth year yr range

Gender

Ethnicity

Table Distribution of values in the table considered in the exp eriment

rst digits and date of birth to the year The tuple with the unusual ZIP co de of has b een

suppressed The recipient of the data is informed of the levels of generalizations and how many tuples were

suppressed Note The default value for month is January and for day is the st when dates are generalized

This is done for practical considerations that preserve the data typ e originally assigned to the attribute see

Section

Table itemizes the basic distribution of values within the attributes ZIP co des were stored in the

full digit form with a generalization hierarchy replacing rightmost digits with of levels Birth dates

were generalized rst to the month then year year year year and year p erio ds A twolevel

hierarchy was considered for gender and ethnicity see Figure The pro duct of the numb er of p ossible

domains for each attribute gives the total numb er of p ossible generalizations which is

The program constructed a clique where each no de was a tuple and the edges were weighted by distance

vectors b etween adjacent tuples Reading these vectors from the clique the program generated a set of

generalizations to consider There were generalizations read from the clique discarding or For

our tests we used values of k to b e and a maximum suppression threshold of or tuples

Figure shows the relationship b etween suppression and generalization within the program in a practical

and realistic application We measure the loss of due to suppression as the ratio of the numb er of

tuples suppressed divided by the total numb er of tuples in the original data We dene the inverse measure

of completeness to determine how much of the data remains computed as one minus the loss due to

suppression Generalization also reduces the quality of the data since generalized values are less precise We

measure the loss due to generalization as the ratio of the level of generalization divided by the total height

of the generalization hierarchy We term precision as the amount of sp ecicity remaining in the data

computed as one minus the loss due to generalization

In Charts A and B of Figure we compare the data quality loss as the k anonymity requirement

increases Losses are rep orted for b oth generalization and suppression for each attribute as if it were solely

resp onsible for achieving the k anonymity requirement By doing so we characterize the distribution and

nature of values found in these attributes Given the distribution of males and females in the data

the gender attribute itself can achieve these values of k so we see no loss due to generalization or suppression

On the other hand there were of distinct birth dates Clearly date of birth and ZIP co de are the

most discriminating values so it is not surprising that they must b e generalized more than other attributes

The at lines on these curves indicate values b eing somewhat clustered

Charts C and D of Figure rep ort completeness and precision measurements for the minimal

generalizations found Basically generalizations that satisfy smaller values of k app ear further to the right

in chart C and those generalizations that achieve larger values of k are leftmost This results from the

observation that the larger the value for k the more generalization may b e required resulting of course in

a loss of precision It is also not surprising that completeness remains ab ove b ecause our suppression

threshold during these tests was Though not shown in the charts it can easily b e understo o d that

raising the suppression threshold typically improves precision since more values can b e suppressed to achieve

k Clearly generalization is exp ensive to the quality of the data since it is p erformed across the entire

attribute every tuple is aected On the other hand it remains semantically more useful to have a value

Figure Example of current release practice and minimally generalized equivalent

Figure Exp erimental results based on medical records

present even if it is a less precise one than not having any value at all as is the result of suppression

From these exp eriments it is clear that the techniques of generalization and suppression can b e used in

practical applications Of course protecting against linking involves a loss of data quality in the attributes

that comprise the quasiidentier though we have shown that the loss is not severe These techniques are

clearly most eective when the primary attributes required by the recipient are not the same as the quasi

identier that can b e used for linking In the sample medical data shown earlier researchers computer

scientists health economists and others value the information that is not included in the quasiidentier in

order to develop diagnostic to ols p erform retrosp ective research and assess hospital costs

Conclusions

We have presented an approach to disclosing entitysp ecic information such that the released table cannot

b e reliably linked to external tables The anonymity requirement is expressed by sp ecifying a quasiidentier

and a minimum numb er k of duplicates of each released tuple with resp ect to the attributes of the quasi

identier The anonymity requirement is achieved by generalizing and p ossibly suppressing information

up on release We have given the notion of minimal generalization capturing the prop erty that informa

tion is not generalized more than it is needed to achieve the anonymity requirement We have discussed

p ossible preference p olicies to cho ose b etween dierent minimal generalizations and an algorithm to com

pute a preferred minimal generalization Finally we have illustrated the results of some exp eriments from

the application of our approach to the release of a medical database containing information regarding

patients

This work represents only a rst step toward the denition of a complete framework for information

disclosure control Many problems are still op en From a mo deling p oint of view the denition of quasi

identiers and of an appropriate size of k must b e addressed The quality of generalized data is b est when

the attributes most imp ortant to the recipient do not b elong to any quasiidentier For publicuse les this

may b e acceptable but determining the quality and usefulness in other settings must b e further researched

From the technical p oint of view future work should include the investigation of an ecient algorithm

to enforce the prop osed techniques and the consideration of sp ecic queries of multiple releases over time

and of data up dating which may allow inference attacks

Acknowledgments

We thank Steve Dawson at SRI for discussions and supp ort Rema Padman at CMU for discussions on

metrics and Dr Lee Mann of Inova Health Systems Lexical Technology Inc and Dr Fred Chu for

making medical data available to validate our approaches We also thank Sylvia Barrett and Henry Leitner

of Harvard University for their supp ort

References

NR Adam and JC Wortman Securitycontrol metho ds for statistical databases A comparative

study ACM Computing Surveys

Ross Anderson A security p olicy mo del for clinical information systems In Proc of the IEEE

Symposium on Security and Privacy pages Oakland CA May

Silvana Castano Maria Grazia Fugini Giancarlo Martella and Pierangela Samarati Database Security

Addison Wesley

PC Chu Cell suppression metho dology The imp ortance of suppressing marginal totals IEEE Trans

on Know ledge Data Systems JulyAugust

LH Cox Suppression metho dology in statistical disclosure analysis In ASA Proceedings of Social

Statistics Section pages

Tore Dalenius Finding a needle in a haystack or identifying anonymous census record Journal of

Ocial Statistics

BA Davey and HA Priestley Introduction to Lattices and Order Cambridge University Press

Dorothy E Denning Cryptography and AddisonWesley

Dan Guseld A little knowledge go es a long way Faster detection of compromised data in D tables

In Proc of the IEEE Symposium on Security and Privacy pages Oakland CA May

J Hale and S Shenoi Catalytic inference analysis Detecting inference threats due to knowledge

discovery In Proc of the IEEE Symposium on Security and Privacy pages Oakland

CA May

A Hundep o ol and L Willenb org and Argus for statistical disclosure control In Third

International Seminar on Statistical Condentiality Bled

Ram Kumar Ensuring data security in interrelated tabular data In Proc of the IEEE Symposium on

Security and Privacy pages Oakland CA May

Teresa Lunt Aggregation and inference Facts and fallacies In Proc of the IEEE Symposium on

Security and Privacy pages Oakland CA May

National Asso ciation of Health Data Organizations Falls Church A Guide to StateLevel Ambulatory

Care Data Col lection Activities Octob er

P Samarati and L Sweeney Generalizing data to provide anonymity when dislosing information In

Proc of the ACM SIGACTSIGMODSIGART Symposium on Principles of Database Systems

PODS Seattle USA June

Latanya Sweeney Computational disclosure control for medical micro data In Record Linkage Workshop

Bureau of the Census Washington DC

Latanya Sweeney Guaranteeing anonymity when sharing medical data the Datay system In Proc

Journal of the American Medical Informatics Association Washington DC Hanley Belfus Inc

Latanya Sweeney Weaving technology and p olicy together to maintain condentiality Journal of Law

Medicine Ethics

Rein Turn issues for the s In Proc of the IEEE Symposium on Security and

Privacy pages Oakland CA May

Jerey D Ullman Principles of Databases and Know ledgeBase Systems volume I Computer Science

Press

L Willenb org and T De Waal Statistical disclosure control in practice New York SpringerVerlag

L Willenb org and T De Waal Statistical Disclosure Control in Practice SpringerVerlag

Beverly Wo o dward The computerbased patient record condentiality The New England Journal of

Medicine