Week 4: Privacy-preserving Data Publishing III

⬅️ Week 3 | Main | Week 5 ➡️

🎯 Learning Goals

By the end of this week, you should understand:

Differential privacy: fundamental concepts and mathematical foundations
Privacy budget and composition theorems
Mechanisms for achieving differential privacy (Laplace, Gaussian)
Practical applications and limitations of differential privacy

📖 Theoretical Content

Introduction to Differential Privacy

Motivation: Traditional anonymization methods (k-anonymity, l-diversity, t-closeness) are vulnerable to auxiliary information attacks. Differential privacy provides mathematical guarantees regardless of what background knowledge an attacker might have.

Core Philosophy: The presence or absence of any individual’s data should not significantly affect the output of any analysis.

Formal Definition (ε-Differential Privacy): A randomized algorithm M satisfies ε-differential privacy if for all datasets D₁ and D₂ that differ by at most one record, and for all possible outputs S:

Pr[M(D₁) ∈ S] ≤ exp(ε) × Pr[M(D₂) ∈ S]

Key Concepts

1. Privacy Parameter (ε)

Small ε: Strong privacy protection (outputs are nearly identical)
Large ε: Weaker privacy but higher utility
Common values: ε ∈ [0.01, 10]
Rule of thumb: ε ≤ 1 provides meaningful privacy

2. Adjacent Datasets

Two datasets that differ in exactly one individual’s record
Captures the worst-case privacy loss from participation
Includes both addition and removal of records

3. Randomized Mechanism

Must add carefully calibrated noise
Noise amount depends on ε and query sensitivity
Larger ε allows less noise (better utility)

Global Sensitivity

Definition: The maximum change in query output when one record is added or removed from any dataset.

For Count Queries: GS = 1

Adding/removing one person changes count by at most 1

For Sum Queries: GS = max possible individual contribution

If salary range is [0, 200K], then GS = 200,000

For Average Queries: More complex

Depends on dataset size and value range
Often requires bounding techniques

The Laplace Mechanism

For Numeric Queries: Add noise drawn from Laplace distribution with scale b = GS/ε

Laplace Distribution:

PDF: f(x) = (1/2b) × exp(- x /b)
Mean: 0 (noise is unbiased)
Variance: 2b²

Algorithm:

Compute true query result f(D)
Generate noise η ~ Laplace(GS/ε)
Return f(D) + η

Example - Count Query:

True count: 1,247 people
ε = 1, GS = 1
Noise scale: b = 1/1 = 1
Sample noise: η = 2.3
Output: 1,247 + 2.3 = 1,249.3

The Gaussian Mechanism

For (ε, δ)-Differential Privacy: A relaxed version where privacy can fail with probability δ

Noise Scale: σ = √(2 ln(1.25/δ)) × GS/ε

When to Use:

When pure ε-DP is too restrictive
For complex queries requiring less noise
δ is typically very small (e.g., 10⁻⁵)

Privacy Budget and Composition

Sequential Composition: If you run k algorithms with privacy parameters ε₁, ε₂, …, εₖ, the total privacy cost is: ε_total = ε₁ + ε₂ + … + εₖ

Privacy Budget Management:

Start with total budget (e.g., ε = 1)
Allocate portions to different queries
Once budget is exhausted, no more queries allowed
Critical for long-term data usage

Advanced Composition:

Parallel composition: Independent datasets can be queried with same ε
Group privacy: Privacy degrades for correlated individuals

🔍 Detailed Explanations

Understanding the ε Parameter

Intuitive Interpretation:

ε = 0: Perfect privacy (but useless utility)
ε = 0.1: Very strong privacy, significant noise
ε = 1: Reasonable privacy-utility balance
ε = 10: Weak privacy, minimal noise

Mathematical Meaning: exp(ε) represents the maximum ratio between probabilities of any outcome for adjacent datasets.

Example with ε = 1:

exp(1) ≈ 2.72
Any outcome can be at most 2.72× more likely with/without your data
Provides plausible deniability

Calibrating Noise for Different Queries

Count Query Example:

def dp_count(data, predicate, epsilon):
    true_count = sum(1 for record in data if predicate(record))
    noise = np.random.laplace(0, 1/epsilon)  # GS = 1
    return true_count + noise

Sum Query Example:

def dp_sum(data, attribute, max_value, epsilon):
    true_sum = sum(record[attribute] for record in data)
    noise = np.random.laplace(0, max_value/epsilon)  # GS = max_value
    return true_sum + noise

Average Query Example:

def dp_average(data, attribute, max_value, epsilon):
    n = len(data)
    dp_sum_result = dp_sum(data, attribute, max_value, epsilon/2)
    dp_count_result = dp_count(data, lambda x: True, epsilon/2)
    return dp_sum_result / dp_count_result

Privacy Budget Allocation Strategies

Uniform Allocation:

Divide budget equally among all planned queries
Simple but may not be optimal

Adaptive Allocation:

Allocate more budget to more important queries
Requires knowing query priorities in advance

Hierarchical Allocation:

Use tree structure for range queries
Efficient for many related queries

💡 Practical Examples

Example 1: Census Data Release

Scenario: Releasing population statistics while protecting individual privacy

Traditional Approach:

Query: "How many people in ZIP 02139 earn >$100K?"
Answer: 847 people

Differential Privacy Approach (ε = 1):

true_count = 847
sensitivity = 1  # Adding/removing one person changes count by 1
epsilon = 1.0
noise = laplace_noise(scale=sensitivity/epsilon)  # scale = 1
noisy_count = 847 + noise

# Possible outputs: 846.3, 849.7, 845.1, etc.

Privacy Guarantee: Whether any specific individual is included in the dataset, the probability of any particular answer changes by at most factor of e ≈ 2.72.

Example 2: Medical Research Query

Scenario: Researcher wants to know average age of patients with diabetes

Setup:

Dataset: 10,000 patient records
Age range: [0, 120] years
ε = 0.5 (strong privacy)

Implementation:

# Step 1: Count diabetic patients
diabetic_count = dp_count(patients, 
                         lambda p: p.diagnosis == "diabetes", 
                         epsilon=0.25)

# Step 2: Sum ages of diabetic patients  
diabetic_age_sum = dp_sum(patients.filter(diabetes), 
                         "age", 
                         max_value=120, 
                         epsilon=0.25)

# Step 3: Compute average
average_age = diabetic_age_sum / diabetic_count

Example 3: Location Analytics

Scenario: City planning department analyzing foot traffic

Query: “How many people visited downtown area between 2-4 PM?”

Challenges:

Location data is highly sensitive
Need temporal analysis
Multiple related queries planned

Solution:

epsilon_total = 1.0
time_slots = 12  # 2-hour periods in a day
epsilon_per_slot = epsilon_total / time_slots

for slot in time_slots:
    traffic_count = dp_count(location_data, 
                           lambda record: is_downtown(record, slot),
                           epsilon_per_slot)
    publish_result(slot, traffic_count)

❓ Self-Assessment Questions

Question 1: What is the fundamental difference between differential privacy and traditional anonymization methods like k-anonymity? (Click to reveal answer)

**Answer:** **Traditional Methods (k-anonymity, l-diversity, t-closeness):** - Syntactic privacy: Modify data structure to prevent re-identification - Vulnerable to auxiliary information attacks - No mathematical guarantees about privacy protection - Protection depends on assumptions about attacker knowledge **Differential Privacy:** - Semantic privacy: Provides mathematical guarantees regardless of auxiliary information - Privacy protection doesn't depend on what the attacker knows - Quantifiable privacy loss through ε parameter - Robust against any background knowledge attack - Trade-off: Requires adding random noise, reducing utility **Key Insight:** Differential privacy protects against the **worst-case** attacker with unlimited background knowledge, while traditional methods protect against specific attack models.

Question 2: Calculate the noise scale for a sum query where individual contributions are bounded by 50,000 and ε = 0.5. What's the standard deviation of the added noise? (Click to reveal answer)

**Answer:** **Given:** - Global Sensitivity (GS) = 50,000 (max individual contribution) - ε = 0.5 **Laplace Mechanism:** - Noise scale: b = GS/ε = 50,000/0.5 = 100,000 - Standard deviation: σ = √2 × b = √2 × 100,000 ≈ 141,421 **Interpretation:** The added noise has a standard deviation of about 141,421, which is quite large compared to typical individual contributions. This illustrates the trade-off between privacy (small ε) and utility - stronger privacy requires more noise. For context, about 68% of noise values will be within ±141,421 of zero, and 95% within ±282,842.

Question 3: You have a privacy budget of ε = 1 and need to answer 10 queries. How would you allocate the budget, and what are the implications? (Click to reveal answer)

**Answer:** **Uniform Allocation:** - Each query gets ε/10 = 0.1 - Very strong privacy per query but significant noise - Total privacy cost: 10 × 0.1 = 1.0 **Non-Uniform Allocation Examples:** 1. **Priority-based:** - Important queries: ε = 0.3 (3 queries) - Regular queries: ε = 0.1 (1 query) - Total: 3×0.3 + 1×0.1 = 1.0 2. **Hierarchical:** - Use tree-based mechanisms for related queries - Can answer more queries with same total budget **Implications:** - **High noise:** Each query has substantial noise with ε = 0.1 - **Limited utility:** Individual query results may be unreliable - **No additional queries:** Budget exhausted after planned queries - **Strategic planning:** Need to prioritize most important analyses **Alternative:** Consider using (ε,δ)-differential privacy with δ = 10⁻⁵ to reduce noise while maintaining strong privacy.

Question 4: Explain why differential privacy is "composable" and why this matters for practical deployments. (Click to reveal answer)

**Answer:** **Composability** means that when multiple differentially private algorithms are applied to the same dataset, the total privacy loss can be calculated and bounded. **Sequential Composition:** - Run k algorithms with privacy parameters ε₁, ε₂, ..., εₖ - Total privacy cost: ε_total = ε₁ + ε₂ + ... + εₖ - Privacy degrades additively **Why This Matters:** 1. **Long-term Privacy Accounting:** Organizations can track cumulative privacy loss over time 2. **Budget Management:** Can allocate privacy budget across different users/queries 3. **Formal Guarantees:** Mathematical proof that privacy doesn't degrade faster than linear rate 4. **System Design:** Enables building complex privacy-preserving systems with known guarantees **Example:** A hospital database used for research: - Year 1: ε₁ = 0.3 for cancer study - Year 2: ε₂ = 0.4 for diabetes study - Year 3: ε₃ = 0.3 for heart disease study - Total privacy loss: ε_total = 1.0 After this, the dataset has "exhausted" its privacy budget and no more queries should be allowed without additional privacy protections.

Question 5: Compare the privacy-utility trade-offs of ε = 0.1, ε = 1, and ε = 10 for a count query with 1000 true results. (Click to reveal answer)

**Answer:** **For Count Query (GS = 1):** **ε = 0.1 (Very Strong Privacy):** - Noise scale: b = 1/0.1 = 10 - Standard deviation: ≈ 14.14 - Typical results: 1000 ± 28 (95% confidence) - **Privacy:** Excellent - outcomes nearly identical with/without any individual - **Utility:** Poor - high relative noise (±2.8%) **ε = 1 (Balanced):** - Noise scale: b = 1/1 = 1 - Standard deviation: ≈ 1.41 - Typical results: 1000 ± 3 (95% confidence) - **Privacy:** Good - max probability ratio of e ≈ 2.72 - **Utility:** Good - low relative noise (±0.3%) **ε = 10 (Weak Privacy):** - Noise scale: b = 1/10 = 0.1 - Standard deviation: ≈ 0.14 - Typical results: 1000 ± 0.3 (95% confidence) - **Privacy:** Weak - max probability ratio of e¹⁰ ≈ 22,026 - **Utility:** Excellent - minimal noise (±0.03%) **Recommendation:** ε = 1 offers the best balance for most applications.

📚 Additional Resources

Foundational Papers

Dwork, C. (2006). “Differential Privacy”
Dwork, C. et al. (2014). “The Algorithmic Foundations of Differential Privacy”

Practical Guides

Wood, A. et al. (2018). “Differential Privacy: A Primer for a Non-Technical Audience”
Hsu, J. et al. (2014). “Differential Privacy: An Economic Method for Choosing Epsilon”

Tools and Libraries

Google’s Differential Privacy Library: Open-source implementations
OpenDP: Modular differential privacy library
IBM’s Diffprivlib: Python library for differential privacy
Microsoft’s SmartNoise: Differential privacy platform

Real-World Applications

U.S. Census 2020: First major use of differential privacy for official statistics
Apple’s Local Differential Privacy: Used in iOS for telemetry
Google’s RAPPOR: Privacy-preserving analytics

⬅️ Week 3 | Main | Week 5 ➡️