Skip to the content.

Week 4: Privacy-preserving Data Publishing III

⬅️ Week 3 | Main | Week 5 ➡️

🎯 Learning Goals

By the end of this week, you should understand:


📖 Theoretical Content

Introduction to Differential Privacy

Motivation: Traditional anonymization methods (k-anonymity, l-diversity, t-closeness) are vulnerable to auxiliary information attacks. Differential privacy provides mathematical guarantees regardless of what background knowledge an attacker might have.

Core Philosophy: The presence or absence of any individual’s data should not significantly affect the output of any analysis.

Formal Definition (ε-Differential Privacy): A randomized algorithm M satisfies ε-differential privacy if for all datasets D₁ and D₂ that differ by at most one record, and for all possible outputs S:

Pr[M(D₁) ∈ S] ≤ exp(ε) × Pr[M(D₂) ∈ S]

Key Concepts

1. Privacy Parameter (ε)

2. Adjacent Datasets

3. Randomized Mechanism

Global Sensitivity

Definition: The maximum change in query output when one record is added or removed from any dataset.

For Count Queries: GS = 1

For Sum Queries: GS = max possible individual contribution

For Average Queries: More complex

The Laplace Mechanism

For Numeric Queries: Add noise drawn from Laplace distribution with scale b = GS/ε

Laplace Distribution:

Algorithm:

  1. Compute true query result f(D)
  2. Generate noise η ~ Laplace(GS/ε)
  3. Return f(D) + η

Example - Count Query:

The Gaussian Mechanism

For (ε, δ)-Differential Privacy: A relaxed version where privacy can fail with probability δ

Noise Scale: σ = √(2 ln(1.25/δ)) × GS/ε

When to Use:

Privacy Budget and Composition

Sequential Composition: If you run k algorithms with privacy parameters ε₁, ε₂, …, εₖ, the total privacy cost is: ε_total = ε₁ + ε₂ + … + εₖ

Privacy Budget Management:

Advanced Composition:


🔍 Detailed Explanations

Understanding the ε Parameter

Intuitive Interpretation:

Mathematical Meaning: exp(ε) represents the maximum ratio between probabilities of any outcome for adjacent datasets.

Example with ε = 1:

Calibrating Noise for Different Queries

Count Query Example:

def dp_count(data, predicate, epsilon):
    true_count = sum(1 for record in data if predicate(record))
    noise = np.random.laplace(0, 1/epsilon)  # GS = 1
    return true_count + noise

Sum Query Example:

def dp_sum(data, attribute, max_value, epsilon):
    true_sum = sum(record[attribute] for record in data)
    noise = np.random.laplace(0, max_value/epsilon)  # GS = max_value
    return true_sum + noise

Average Query Example:

def dp_average(data, attribute, max_value, epsilon):
    n = len(data)
    dp_sum_result = dp_sum(data, attribute, max_value, epsilon/2)
    dp_count_result = dp_count(data, lambda x: True, epsilon/2)
    return dp_sum_result / dp_count_result

Privacy Budget Allocation Strategies

Uniform Allocation:

Adaptive Allocation:

Hierarchical Allocation:


💡 Practical Examples

Example 1: Census Data Release

Scenario: Releasing population statistics while protecting individual privacy

Traditional Approach:

Query: "How many people in ZIP 02139 earn >$100K?"
Answer: 847 people

Differential Privacy Approach (ε = 1):

true_count = 847
sensitivity = 1  # Adding/removing one person changes count by 1
epsilon = 1.0
noise = laplace_noise(scale=sensitivity/epsilon)  # scale = 1
noisy_count = 847 + noise

# Possible outputs: 846.3, 849.7, 845.1, etc.

Privacy Guarantee: Whether any specific individual is included in the dataset, the probability of any particular answer changes by at most factor of e ≈ 2.72.

Example 2: Medical Research Query

Scenario: Researcher wants to know average age of patients with diabetes

Setup:

Implementation:

# Step 1: Count diabetic patients
diabetic_count = dp_count(patients, 
                         lambda p: p.diagnosis == "diabetes", 
                         epsilon=0.25)

# Step 2: Sum ages of diabetic patients  
diabetic_age_sum = dp_sum(patients.filter(diabetes), 
                         "age", 
                         max_value=120, 
                         epsilon=0.25)

# Step 3: Compute average
average_age = diabetic_age_sum / diabetic_count

Example 3: Location Analytics

Scenario: City planning department analyzing foot traffic

Query: “How many people visited downtown area between 2-4 PM?”

Challenges:

Solution:

epsilon_total = 1.0
time_slots = 12  # 2-hour periods in a day
epsilon_per_slot = epsilon_total / time_slots

for slot in time_slots:
    traffic_count = dp_count(location_data, 
                           lambda record: is_downtown(record, slot),
                           epsilon_per_slot)
    publish_result(slot, traffic_count)

❓ Self-Assessment Questions

Question 1: What is the fundamental difference between differential privacy and traditional anonymization methods like k-anonymity? (Click to reveal answer) **Answer:** **Traditional Methods (k-anonymity, l-diversity, t-closeness):** - Syntactic privacy: Modify data structure to prevent re-identification - Vulnerable to auxiliary information attacks - No mathematical guarantees about privacy protection - Protection depends on assumptions about attacker knowledge **Differential Privacy:** - Semantic privacy: Provides mathematical guarantees regardless of auxiliary information - Privacy protection doesn't depend on what the attacker knows - Quantifiable privacy loss through ε parameter - Robust against any background knowledge attack - Trade-off: Requires adding random noise, reducing utility **Key Insight:** Differential privacy protects against the **worst-case** attacker with unlimited background knowledge, while traditional methods protect against specific attack models.
Question 2: Calculate the noise scale for a sum query where individual contributions are bounded by 50,000 and ε = 0.5. What's the standard deviation of the added noise? (Click to reveal answer) **Answer:** **Given:** - Global Sensitivity (GS) = 50,000 (max individual contribution) - ε = 0.5 **Laplace Mechanism:** - Noise scale: b = GS/ε = 50,000/0.5 = 100,000 - Standard deviation: σ = √2 × b = √2 × 100,000 ≈ 141,421 **Interpretation:** The added noise has a standard deviation of about 141,421, which is quite large compared to typical individual contributions. This illustrates the trade-off between privacy (small ε) and utility - stronger privacy requires more noise. For context, about 68% of noise values will be within ±141,421 of zero, and 95% within ±282,842.
Question 3: You have a privacy budget of ε = 1 and need to answer 10 queries. How would you allocate the budget, and what are the implications? (Click to reveal answer) **Answer:** **Uniform Allocation:** - Each query gets ε/10 = 0.1 - Very strong privacy per query but significant noise - Total privacy cost: 10 × 0.1 = 1.0 **Non-Uniform Allocation Examples:** 1. **Priority-based:** - Important queries: ε = 0.3 (3 queries) - Regular queries: ε = 0.1 (1 query) - Total: 3×0.3 + 1×0.1 = 1.0 2. **Hierarchical:** - Use tree-based mechanisms for related queries - Can answer more queries with same total budget **Implications:** - **High noise:** Each query has substantial noise with ε = 0.1 - **Limited utility:** Individual query results may be unreliable - **No additional queries:** Budget exhausted after planned queries - **Strategic planning:** Need to prioritize most important analyses **Alternative:** Consider using (ε,δ)-differential privacy with δ = 10⁻⁵ to reduce noise while maintaining strong privacy.
Question 4: Explain why differential privacy is "composable" and why this matters for practical deployments. (Click to reveal answer) **Answer:** **Composability** means that when multiple differentially private algorithms are applied to the same dataset, the total privacy loss can be calculated and bounded. **Sequential Composition:** - Run k algorithms with privacy parameters ε₁, ε₂, ..., εₖ - Total privacy cost: ε_total = ε₁ + ε₂ + ... + εₖ - Privacy degrades additively **Why This Matters:** 1. **Long-term Privacy Accounting:** Organizations can track cumulative privacy loss over time 2. **Budget Management:** Can allocate privacy budget across different users/queries 3. **Formal Guarantees:** Mathematical proof that privacy doesn't degrade faster than linear rate 4. **System Design:** Enables building complex privacy-preserving systems with known guarantees **Example:** A hospital database used for research: - Year 1: ε₁ = 0.3 for cancer study - Year 2: ε₂ = 0.4 for diabetes study - Year 3: ε₃ = 0.3 for heart disease study - Total privacy loss: ε_total = 1.0 After this, the dataset has "exhausted" its privacy budget and no more queries should be allowed without additional privacy protections.
Question 5: Compare the privacy-utility trade-offs of ε = 0.1, ε = 1, and ε = 10 for a count query with 1000 true results. (Click to reveal answer) **Answer:** **For Count Query (GS = 1):** **ε = 0.1 (Very Strong Privacy):** - Noise scale: b = 1/0.1 = 10 - Standard deviation: ≈ 14.14 - Typical results: 1000 ± 28 (95% confidence) - **Privacy:** Excellent - outcomes nearly identical with/without any individual - **Utility:** Poor - high relative noise (±2.8%) **ε = 1 (Balanced):** - Noise scale: b = 1/1 = 1 - Standard deviation: ≈ 1.41 - Typical results: 1000 ± 3 (95% confidence) - **Privacy:** Good - max probability ratio of e ≈ 2.72 - **Utility:** Good - low relative noise (±0.3%) **ε = 10 (Weak Privacy):** - Noise scale: b = 1/10 = 0.1 - Standard deviation: ≈ 0.14 - Typical results: 1000 ± 0.3 (95% confidence) - **Privacy:** Weak - max probability ratio of e¹⁰ ≈ 22,026 - **Utility:** Excellent - minimal noise (±0.03%) **Recommendation:** ε = 1 offers the best balance for most applications.

📚 Additional Resources

Foundational Papers

Practical Guides

Tools and Libraries

Real-World Applications


⬅️ Week 3 | Main | Week 5 ➡️