Skip to the content.

Week 3: Privacy-preserving Data Publishing II

⬅️ Week 2 | Main | Week 4 ➡️

🎯 Learning Goals

By the end of this week, you should understand:


📖 Theoretical Content

Limitations of k-Anonymity

While k-anonymity prevents record linkage, it has several critical vulnerabilities:

1. Homogeneity Attack (Lack of Diversity)

2. Background Knowledge Attack

3. Skewness Attack

l-Diversity Model

Definition: An equivalence class satisfies l-diversity if it contains at least l “well-represented” values for each sensitive attribute.

Types of l-Diversity:

1. Distinct l-Diversity

2. Entropy l-Diversity

3. Recursive (c,l)-Diversity

t-Closeness Model

Definition: An equivalence class satisfies t-closeness if the distance between the distribution of sensitive attributes in the class and the distribution in the entire dataset is ≤ t.

Key Concepts:

Advantages:

Challenges:

Advanced Anonymization Techniques

1. Anatomy

2. Permutation

3. (α,k)-Anonymity


🔍 Detailed Explanations

Understanding Homogeneity Attacks

Scenario: Medical dataset with k=3 anonymity

Vulnerable Equivalence Class:

Age Group | Gender | ZIP Area | Disease
20-30     | Male   | 1300*    | HIV
20-30     | Male   | 1300*    | HIV  
20-30     | Male   | 1300*    | HIV

Attack: Even without knowing which specific record belongs to the target, an attacker knows that any male aged 20-30 from ZIP 1300* has HIV.

l-Diversity Solution (l=2):

Age Group | Gender | ZIP Area | Disease
20-30     | Male   | 1300*    | HIV
20-30     | Male   | 1300*    | Diabetes
20-30     | Male   | 1300*    | Flu

Now each group has at least 2 distinct diseases, preventing certain inference.

Background Knowledge Attack Example

Published k-Anonymous Data:

Age Group | Gender | ZIP | Disease
20-25     | Female | 130** | Heart Disease
20-25     | Female | 130** | Diabetes
20-25     | Female | 130** | Flu

Attacker’s Background Knowledge:

Attack: Attacker knows Alice is in this group and eliminates diabetes, leaving 50% chance of heart disease vs flu - significant privacy loss.

Entropy l-Diversity Calculation

Example Equivalence Class:

Entropy Calculation:

Entropy = -(0.5×log₂(0.5) + 0.25×log₂(0.25) + 0.25×log₂(0.25)) = -(0.5×(-1) + 0.25×(-2) + 0.25×(-2)) = -(-0.5 - 0.5 - 0.5) = 1.5

For l=2: log₂(2) = 1 Since 1.5 > 1, this satisfies 2-diversity.


💡 Practical Examples

Example 1: Implementing l-Diversity

Original Data:

Patient | Age | Gender | ZIP   | Salary | Disease
A       | 23  | M      | 02139 | 45K    | Asthma
B       | 24  | M      | 02139 | 47K    | Asthma
C       | 25  | F      | 02142 | 55K    | Diabetes
D       | 26  | F      | 02142 | 57K    | Heart Disease

After 2-Anonymity (from Week 2):

Group | Age   | Gender | ZIP   | Salary | Disease
1     | 20-25 | M      | 0213* | 46K    | Asthma
1     | 20-25 | M      | 0213* | 46K    | Asthma
2     | 25-30 | F      | 0214* | 56K    | Diabetes
2     | 25-30 | F      | 0214* | 56K    | Heart Disease

Problem: Group 1 has homogeneity attack vulnerability (both have Asthma)

2-Diversity Solution: Need to reorganize groups

Group | Age   | Gender | ZIP   | Salary   | Disease
1     | 20-30 | *      | 021** | 45-55K   | Asthma
1     | 20-30 | *      | 021** | 45-55K   | Diabetes
2     | 20-30 | *      | 021** | 47-57K   | Asthma
2     | 20-30 | *      | 021** | 47-57K   | Heart Disease

Example 2: t-Closeness Analysis

Global Disease Distribution:

Equivalence Class Distribution:

Earth Mover’s Distance Calculation:

If t = 0.3, this class violates t-closeness (0.4 > 0.3)

Example 3: Utility Impact Comparison

Dataset: 10,000 patient records

Anonymization Results:

Method          | Groups | Avg Group Size | Info Loss | Privacy Level
k=5             | 500    | 20            | Low       | Basic
(5,2)-diversity | 800    | 12.5          | Medium    | Better
0.2-closeness   | 1200   | 8.3           | High      | Strongest

Analysis:


❓ Self-Assessment Questions

Question 1: Explain the homogeneity attack and how l-diversity addresses it. (Click to reveal answer) **Answer:** **Homogeneity Attack:** Occurs when all records in a k-anonymous equivalence class have the same sensitive attribute value. Even though individual records cannot be re-identified, an attacker can infer the sensitive value for anyone in that group. **Example:** If all records in a group have "Cancer" as the disease, knowing someone is in that group reveals their disease. **l-Diversity Solution:** Requires each equivalence class to have at least l distinct values for sensitive attributes. This ensures that even if someone is known to be in a particular group, there are multiple possible sensitive values, preventing certain inference. For l=3, each group must have at least 3 different diseases, so no single disease can be inferred with certainty.
Question 2: What is the difference between distinct l-diversity and entropy l-diversity? (Click to reveal answer) **Answer:** **Distinct l-Diversity:** - Simply requires at least l different sensitive values in each equivalence class - Doesn't consider the distribution of these values - Vulnerable to skewed distributions (e.g., 99 records with Disease A, 1 with Disease B satisfies 2-diversity but offers little protection) **Entropy l-Diversity:** - Uses information entropy to measure diversity: Entropy ≥ log(l) - Considers both the number of distinct values AND their distribution - Provides better protection against probabilistic inference - More robust against skewness attacks - Example: A group with uniform distribution of l values has maximum entropy = log(l) Entropy l-diversity is stronger because it prevents attackers from making confident probabilistic inferences based on value frequencies.
Question 3: A dataset has the global distribution: Disease A (60%), Disease B (30%), Disease C (10%). An equivalence class has: Disease A (80%), Disease B (20%), Disease C (0%). Calculate the Earth Mover's Distance. (Click to reveal answer) **Answer:** **Earth Mover's Distance (EMD) Calculation:** We need to transform the class distribution to match the global distribution: **Current:** A(80%), B(20%), C(0%) **Target:** A(60%), B(30%), C(10%) **Transformations needed:** 1. Move 20% from A to B: A→B = 20% × distance(A,B) = 20% × 1 = 0.20 2. Move 10% from A to C: A→C = 10% × distance(A,C) = 10% × 2 = 0.20 **Total EMD = 0.20 + 0.20 = 0.40** Note: Distances between diseases are typically: adjacent=1, non-adjacent=2 (or based on semantic similarity). The EMD of 0.40 indicates significant deviation from the global distribution.
Question 4: Why might t-closeness be too restrictive for practical data publishing? (Click to reveal answer) **Answer:** **t-Closeness Limitations:** 1. **Over-Suppression:** Requires distributions to closely match global patterns, often leading to excessive generalization or record suppression 2. **Utility Loss:** Many real-world analysis tasks require preserving local patterns that t-closeness deliberately obscures 3. **Implementation Complexity:** Calculating Earth Mover's Distance and choosing appropriate t values is computationally expensive and requires domain expertise 4. **Semantic Issues:** May not make sense for all attribute types (e.g., forcing uniform distribution of rare diseases) 5. **Group Size Requirements:** Often requires very large equivalence classes to achieve acceptable t values, reducing dataset granularity 6. **Analysis Limitations:** Prevents legitimate research that depends on understanding sub-population differences **Example:** Medical research studying disease prevalence in specific demographics becomes impossible if all groups must mirror global disease distributions.
Question 5: Design a scenario where k-anonymity provides sufficient privacy protection and another where l-diversity is necessary. (Click to reveal answer) **Answer:** **Scenario 1 - k-Anonymity Sufficient:** **Context:** Survey data about shopping preferences **Data:** Age, Gender, ZIP → Favorite Store (Amazon, Walmart, Target, Best Buy, etc.) **Why k-anonymity works:** - Many diverse store preferences exist - No sensitive health/financial information - Natural diversity in consumer choices - Low stakes if preference is inferred **k=5 provides adequate protection** for this commercial use case. **Scenario 2 - l-Diversity Necessary:** **Context:** Medical insurance claims data **Data:** Age, Gender, ZIP → Disease (HIV, Cancer, Depression, Common Cold, etc.) **Why l-diversity needed:** - High sensitivity of medical information - Risk of discrimination/stigma - Potential for homogeneous groups (e.g., all cancer patients in oncology clinic area) - Background knowledge attacks possible (family/friends know general health status) **l=3 diversity ensures** each group has at least 3 different conditions, preventing certain medical inference even with background knowledge.

📚 Additional Resources

Core Papers

Implementation Guides

Tools and Software


⬅️ Week 2 | Main | Week 4 ➡️