Week 8: SMC Application to Private Data Mining

⬅️ Week 6 | Main | Week 10 ➡️

🎯 Learning Goals

By the end of this week, you should understand:

How to apply SMC protocols to machine learning and data mining tasks
Privacy-preserving classification, clustering, and regression algorithms
Federated learning and its relationship to SMC
Practical challenges in deploying private machine learning systems

📖 Theoretical Content

Introduction to Private Data Mining

The Problem: Multiple parties want to collaboratively perform data mining tasks (classification, clustering, association rules) without revealing their private datasets to each other.

Example Scenarios:

Medical Research: Hospitals collaborate on disease prediction models without sharing patient records
Financial Fraud Detection: Banks jointly detect fraud patterns without revealing customer data
Marketing Analytics: Companies analyze market trends without exposing customer databases
Genomic Research: Research institutions study genetic patterns while protecting individual privacy

Privacy Requirements:

Input Privacy: Raw data remains confidential
Computation Privacy: Intermediate results are not disclosed
Output Privacy: Only agreed-upon results are revealed
Pattern Privacy: Sensitive patterns within data are protected

Private Classification

Problem Setup:

Party A has training data (X_A, y_A)
Party B has training data (X_B, y_B)
Goal: Train classifier on combined data without data sharing
Result: Both parties get the trained model

Naive Bayes Classification: SMC-friendly due to its additive nature:

class PrivateNaiveBayes:
    def __init__(self, num_parties):
        self.parties = num_parties
        self.feature_counts = {}
        self.class_counts = {}
        
    def secure_training(self, local_datasets):
        # Step 1: Each party computes local statistics
        local_stats = []
        for party_data in local_datasets:
            stats = self.compute_local_statistics(party_data)
            local_stats.append(stats)
            
        # Step 2: Securely aggregate statistics using SMC
        global_feature_counts = self.secure_sum(
            [stats['feature_counts'] for stats in local_stats]
        )
        global_class_counts = self.secure_sum(
            [stats['class_counts'] for stats in local_stats]
        )
        
        # Step 3: Compute probabilities from aggregated counts
        self.feature_probabilities = self.compute_probabilities(
            global_feature_counts, global_class_counts
        )
        
        return self.feature_probabilities

Private Clustering

K-Means Clustering: Iterative algorithm suitable for SMC adaptation:

class PrivateKMeans:
    def __init__(self, k, max_iterations=100):
        self.k = k
        self.max_iterations = max_iterations
        
    def secure_clustering(self, distributed_datasets):
        # Step 1: Initialize centroids (can be done publicly)
        centroids = self.initialize_centroids(self.k)
        
        for iteration in range(self.max_iterations):
            # Step 2: Assign points to clusters (locally)
            assignments = []
            for dataset in distributed_datasets:
                local_assignments = self.assign_to_clusters(dataset, centroids)
                assignments.append(local_assignments)
                
            # Step 3: Securely compute new centroids
            new_centroids = self.secure_update_centroids(
                distributed_datasets, assignments
            )
            
            # Step 4: Check convergence
            if self.has_converged(centroids, new_centroids):
                break
                
            centroids = new_centroids
            
        return centroids, assignments

Federated Learning vs SMC

Federated Learning:

Participants train local models and share model updates
Central server aggregates updates to create global model
Provides some privacy but vulnerable to inference attacks

SMC for Machine Learning:

No model updates are shared in the clear
All computations done under encryption or secret sharing
Stronger privacy guarantees but higher computational cost

Comparison:

Aspect	Federated Learning	SMC-based ML
Privacy	Model updates visible	All computation private
Efficiency	High	Lower (crypto overhead)
Communication	Model-size dependent	Data-size dependent
Trust	Requires trusted aggregator	Fully distributed
Inference Attacks	Vulnerable	Resistant

🔍 Detailed Explanations

Challenges in SMC for Machine Learning

1. Non-linear Operations:

Problem: Activation functions (ReLU, sigmoid, tanh) are expensive in SMC
Solution: Polynomial approximations or garbled circuits

2. Floating-Point Arithmetic:

Problem: SMC typically works with integers
Solution: Fixed-point arithmetic with scaling factors

3. Iterative Algorithms:

Problem: Many ML algorithms require iterations with communication overhead
Solution: Reduce iterations, batch operations, optimize communication

4. Model Size and Complexity:

Problem: Large neural networks have millions of parameters
Solution: Model compression, layer-wise computation, pruning

Privacy-Preserving Neural Networks

Solutions for Deep Learning:

1. Linear Approximations:

def secure_relu_approximation(x):
    # Approximate ReLU with polynomial
    # ReLU(x) ≈ 0.5x + 0.5|x| ≈ polynomial approximation
    return secure_polynomial_evaluation(x, relu_coefficients)

2. Secret Sharing for Linear Operations:

class SecretSharedLayer:
    def __init__(self, weights, bias):
        self.weights = secret_share_matrix(weights)
        self.bias = secret_share_vector(bias)
        
    def forward(self, input_shares):
        # Matrix multiplication under secret sharing
        output_shares = secure_matrix_multiply(input_shares, self.weights)
        output_shares = secure_vector_add(output_shares, self.bias)
        return output_shares

3. Hybrid Approaches:

def hybrid_nn_protocol(layers):
    for layer in layers:
        if layer.type == "linear":
            result = secret_sharing_compute(layer)
        elif layer.type == "activation":
            result = garbled_circuits_compute(layer)
        elif layer.type == "pooling":
            result = homomorphic_compute(layer)
    return result

💡 Practical Examples

Example 1: Private Medical Diagnosis

Scenario: Three hospitals want to build a joint diabetes prediction model

class PrivateMedicalDiagnosis:
    def __init__(self, hospitals):
        self.hospitals = hospitals
        self.model = None
        
    def collaborative_training(self):
        # Step 1: Agree on feature set and preprocessing
        common_features = self.negotiate_features()
        
        # Step 2: Locally preprocess data
        preprocessed_data = []
        for hospital in self.hospitals:
            local_data = hospital.preprocess_data(common_features)
            preprocessed_data.append(local_data)
            
        # Step 3: Securely train logistic regression
        self.model = self.secure_logistic_regression(preprocessed_data)
        
        return self.model
        
    def secure_prediction(self, patient_features):
        # Each hospital can use the model for local predictions
        # without revealing individual patient data
        encrypted_features = encrypt_features(patient_features)
        encrypted_prediction = self.model.predict(encrypted_features)
        return decrypt_prediction(encrypted_prediction)

Example 2: Private Market Basket Analysis

Scenario: Competing retailers want to find common purchasing patterns

class PrivateMarketBasketAnalysis:
    def __init__(self, retailers, min_support=0.1):
        self.retailers = retailers
        self.min_support = min_support
        
    def find_association_rules(self):
        # Step 1: Standardize product catalogs
        unified_catalog = self.create_unified_catalog()
        
        # Step 2: Convert transactions to unified format
        standardized_transactions = []
        for retailer in self.retailers:
            transactions = retailer.get_transactions(unified_catalog)
            standardized_transactions.append(transactions)
            
        # Step 3: Securely mine frequent itemsets
        frequent_itemsets = self.secure_apriori(standardized_transactions)
        
        # Step 4: Generate association rules
        rules = self.generate_rules(frequent_itemsets)
        
        return rules

Example 3: Private Credit Scoring

Scenario: Banks collaborate on fraud detection without sharing customer data

class PrivateCreditScoring:
    def __init__(self, banks):
        self.banks = banks
        self.fraud_model = None
        
    def train_fraud_detection_model(self):
        # Step 1: Standardize feature representations
        feature_schema = self.agree_on_features()
        
        # Step 2: Each bank prepares local data
        local_datasets = []
        for bank in self.banks:
            local_data = bank.prepare_training_data(feature_schema)
            local_datasets.append(local_data)
            
        # Step 3: Securely train ensemble model
        self.fraud_model = self.secure_ensemble_training(local_datasets)
        
        return self.fraud_model
        
    def secure_prediction(self, transaction_features):
        # Encrypt transaction features
        encrypted_features = encrypt_transaction(transaction_features)
        
        # Each base model makes encrypted prediction
        encrypted_predictions = []
        for model in self.fraud_model.base_models:
            pred = model.secure_predict(encrypted_features)
            encrypted_predictions.append(pred)
            
        # Securely aggregate predictions (majority vote)
        final_prediction = self.secure_majority_vote(encrypted_predictions)
        
        return decrypt_prediction(final_prediction)

❓ Self-Assessment Questions

Question 1: What are the main challenges in applying SMC to machine learning algorithms? How do these differ from traditional centralized ML? (Click to reveal answer)

**Answer:** **Main SMC-ML Challenges:** **1. Non-linear Operations:** - **Problem:** Activation functions (ReLU, sigmoid, tanh) are expensive in SMC - **Traditional ML:** Direct computation of non-linear functions - **SMC Solution:** Polynomial approximations, garbled circuits, or piecewise linear approximations **2. Floating-Point Arithmetic:** - **Problem:** SMC typically works with integers; floating-point is complex - **Traditional ML:** Native floating-point support - **SMC Solution:** Fixed-point arithmetic with scaling factors **3. Iterative Algorithms:** - **Problem:** Many ML algorithms require many iterations with communication overhead - **Traditional ML:** Fast local iterations - **SMC Solution:** Reduce iterations, batch operations, optimize communication **4. Gradient Computation:** - **Problem:** Backpropagation requires division and complex derivatives - **Traditional ML:** Automatic differentiation - **SMC Solution:** Approximate gradients, secure division protocols **5. Model Size and Complexity:** - **Problem:** Large neural networks have millions of parameters - **Traditional ML:** Handle arbitrary model sizes - **SMC Solution:** Model compression, layer-wise computation, pruning **6. Performance Trade-offs:** - **Problem:** 100-1000x slower than traditional ML - **Traditional ML:** Optimized for speed - **SMC Solution:** Hybrid approaches, precomputation, specialized hardware **Key Differences:** - **Computation:** Cryptographic operations vs. direct arithmetic - **Communication:** Constant interaction vs. no communication - **Precision:** Limited precision vs. full floating-point - **Scalability:** Limited by crypto overhead vs. data/model size - **Privacy:** Strong guarantees vs. no privacy protection

Question 2: Compare the privacy guarantees and efficiency trade-offs between federated learning and SMC-based machine learning. (Click to reveal answer)

**Answer:** **Privacy Guarantees Comparison:** **Federated Learning:** - **What's Protected:** Raw data never leaves local devices - **What's Exposed:** Model updates (gradients, weights) are shared - **Attack Vulnerability:** - Gradient inversion attacks can reconstruct training data - Model inversion attacks can extract sensitive features - Membership inference attacks can determine if specific data was used - **Privacy Level:** Moderate (better than centralized, worse than SMC) **SMC-based ML:** - **What's Protected:** All intermediate computations and model updates - **What's Exposed:** Only final agreed-upon results - **Attack Resistance:** - Strong protection against gradient-based attacks - Prevents model inversion attacks - Resistant to membership inference - **Privacy Level:** High (cryptographic guarantees) **Efficiency Comparison:** | Metric | Federated Learning | SMC-based ML | |--------|-------------------|--------------| | **Computation** | ~1.5-3x overhead | ~100-1000x overhead | | **Communication** | Model size per round | Depends on data/computation | | **Rounds** | 10-100 rounds typical | Fewer rounds but heavier | | **Scalability** | 100s-1000s participants | Typically <10 participants | | **Training Time** | Hours to days | Days to weeks | **When to Choose Each:** - **Federated Learning:** Large-scale deployment, moderate privacy needs, performance critical - **SMC-based ML:** High-value data, strong privacy requirements, small number of parties - **Hybrid:** Balance privacy and efficiency for different computation phases

Question 3: Design a secure protocol for collaborative outlier detection across multiple datasets. What are the key privacy and utility considerations? (Click to reveal answer)

**Answer:** **Secure Collaborative Outlier Detection Protocol:** **Problem Setup:** - Multiple organizations have datasets with similar structure - Want to detect outliers that appear anomalous across all datasets - Each organization's data must remain private - Goal: Identify global outliers without revealing local patterns **Protocol Design:** **Phase 1: Secure Statistic Computation** ```python class SecureOutlierDetection: def __init__(self, participants, threshold=2.0): self.participants = participants self.threshold = threshold # Standard deviations for outlier def compute_global_statistics(self): # Step 1: Each party computes local statistics local_stats = [] for participant in self.participants: stats = participant.compute_local_stats() # mean, variance, count local_stats.append(stats) # Step 2: Securely aggregate statistics total_count = secure_sum([s['count'] for s in local_stats]) # Secure computation of global mean weighted_means = [] for stats in local_stats: weighted_mean = secure_multiply(stats['mean'], stats['count']) weighted_means.append(weighted_mean) global_mean = secure_divide(secure_sum(weighted_means), total_count) return global_mean, global_variance ``` **Privacy Considerations:** 1. **Statistical Privacy:** Add noise to protect individual contributions 2. **Pattern Privacy:** Group similar outliers to prevent identification 3. **Individual vs. Global Privacy:** Different guarantees for records vs. organizations **Utility Considerations:** 1. **Detection Accuracy:** Balance privacy noise with detection effectiveness 2. **Statistical Utility:** Preserve meaningful statistical properties 3. **Real-time vs. Batch:** Different latency and privacy requirements

Question 4: How would you handle the challenge of training deep neural networks using SMC? What are the bottlenecks and potential solutions? (Click to reveal answer)

**Answer:** **Deep Neural Network SMC Challenges:** **1. Scale and Complexity:** - **Problem:** Modern DNNs have millions/billions of parameters - **SMC Impact:** Each parameter requires secure computation - **Bottleneck:** Memory and computation grow linearly with model size **2. Non-linear Activations:** - **Problem:** ReLU, sigmoid, tanh are expensive in SMC - **Traditional Cost:** O(1) per activation - **SMC Cost:** O(log n) for comparison-based or O(n) for polynomial approximation **Solutions and Optimizations:** **1. Activation Function Approximations:** ```python def secure_relu_polynomial(self, x): # Approximate ReLU with degree-3 polynomial abs_x = self.secure_absolute_value_approx(x) return self.secure_multiply(0.5, self.secure_add(x, abs_x)) ``` **2. Hybrid Computation Approaches:** ```python class HybridSecureDNN: def __init__(self, public_layers, private_layers): self.public_layers = public_layers # Can be computed in clear self.private_layers = private_layers # Must be computed securely def hybrid_training(self, data_shares): # Phase 1: Public feature extraction public_features = self.extract_public_features(data_shares) # Phase 2: Secure private computation private_features = self.secure_private_layers(public_features) # Phase 3: Public classification head final_output = self.public_classification(private_features) return final_output ``` **3. Model Architecture Optimizations:** - Minimize non-linear activations - Use wider layers instead of deeper networks - Avoid normalization layers when possible - Knowledge distillation from larger teacher models **Current Limitations:** - Deep networks with >1M parameters are impractical with current SMC - Training time scales poorly with model complexity - Limited to simple architectures and approximations

Question 5: Explain the relationship between federated learning and differential privacy. How can they be combined for stronger privacy guarantees? (Click to reveal answer)

**Answer:** **Federated Learning Privacy Vulnerabilities:** - **Gradient Leakage:** Model updates can leak information about training data - **Inference Attacks:** Adversaries can infer membership, attributes, or even reconstruct data - **Honest-but-Curious Servers:** Central aggregator sees all model updates **How Differential Privacy Strengthens Federated Learning:** **1. Local Differential Privacy (LDP) in FL:** ```python class DPFederatedLearning: def __init__(self, epsilon_local, epsilon_global): self.epsilon_local = epsilon_local # Local DP budget per client self.epsilon_global = epsilon_global # Global DP budget for aggregation def client_update_with_ldp(self, local_model, local_data, learning_rate): # Standard gradient computation gradients = self.compute_gradients(local_model, local_data) # Add noise to gradients (Local DP) noisy_gradients = [] for grad in gradients: sensitivity = self.compute_gradient_sensitivity(grad) noise = self.laplace_noise(sensitivity / self.epsilon_local) noisy_gradients.append(grad + noise) return updated_model, noisy_gradients ``` **2. Advanced Composition:** - **Moments Accountant:** Provides tighter bounds than basic composition - **Rényi Differential Privacy:** Better composition for iterative algorithms - **Adaptive Noise Scheduling:** Allocate privacy budget optimally across rounds **3. Client-Level Differential Privacy:** ```python class ClientLevelDP: def __init__(self, epsilon_per_client, max_participation): self.epsilon_per_client = epsilon_per_client self.max_participation = max_participation def allocate_privacy_budget(self, client_id, round_number): remaining_rounds = self.max_participation - self.participation_counts[client_id] round_epsilon = self.epsilon_per_client / remaining_rounds return round_epsilon ``` **Benefits of DP-FL Combination:** - **Formal Privacy Guarantees:** Mathematical bounds on privacy loss - **Defense Against Inference Attacks:** Protection beyond just gradient hiding - **Flexible Privacy-Utility Trade-offs:** Tunable parameters for different needs - **Composition Properties:** Can reason about privacy across multiple training runs **Challenges:** - **Utility Degradation:** Additional noise reduces model accuracy - **Complex Parameter Tuning:** Many hyperparameters to optimize - **Implementation Complexity:** Requires expertise in both FL and DP

⬅️ Week 6 | Main | Week 10 ➡️