Week 8: SMC Application to Private Data Mining
🎯 Learning Goals
By the end of this week, you should understand:
- How to apply SMC protocols to machine learning and data mining tasks
- Privacy-preserving classification, clustering, and regression algorithms
- Federated learning and its relationship to SMC
- Practical challenges in deploying private machine learning systems
📖 Theoretical Content
Introduction to Private Data Mining
The Problem: Multiple parties want to collaboratively perform data mining tasks (classification, clustering, association rules) without revealing their private datasets to each other.
Example Scenarios:
- Medical Research: Hospitals collaborate on disease prediction models without sharing patient records
- Financial Fraud Detection: Banks jointly detect fraud patterns without revealing customer data
- Marketing Analytics: Companies analyze market trends without exposing customer databases
- Genomic Research: Research institutions study genetic patterns while protecting individual privacy
Privacy Requirements:
- Input Privacy: Raw data remains confidential
- Computation Privacy: Intermediate results are not disclosed
- Output Privacy: Only agreed-upon results are revealed
- Pattern Privacy: Sensitive patterns within data are protected
Private Classification
Problem Setup:
- Party A has training data (X_A, y_A)
- Party B has training data (X_B, y_B)
- Goal: Train classifier on combined data without data sharing
- Result: Both parties get the trained model
Naive Bayes Classification: SMC-friendly due to its additive nature:
class PrivateNaiveBayes:
def __init__(self, num_parties):
self.parties = num_parties
self.feature_counts = {}
self.class_counts = {}
def secure_training(self, local_datasets):
# Step 1: Each party computes local statistics
local_stats = []
for party_data in local_datasets:
stats = self.compute_local_statistics(party_data)
local_stats.append(stats)
# Step 2: Securely aggregate statistics using SMC
global_feature_counts = self.secure_sum(
[stats['feature_counts'] for stats in local_stats]
)
global_class_counts = self.secure_sum(
[stats['class_counts'] for stats in local_stats]
)
# Step 3: Compute probabilities from aggregated counts
self.feature_probabilities = self.compute_probabilities(
global_feature_counts, global_class_counts
)
return self.feature_probabilities
Private Clustering
K-Means Clustering: Iterative algorithm suitable for SMC adaptation:
class PrivateKMeans:
def __init__(self, k, max_iterations=100):
self.k = k
self.max_iterations = max_iterations
def secure_clustering(self, distributed_datasets):
# Step 1: Initialize centroids (can be done publicly)
centroids = self.initialize_centroids(self.k)
for iteration in range(self.max_iterations):
# Step 2: Assign points to clusters (locally)
assignments = []
for dataset in distributed_datasets:
local_assignments = self.assign_to_clusters(dataset, centroids)
assignments.append(local_assignments)
# Step 3: Securely compute new centroids
new_centroids = self.secure_update_centroids(
distributed_datasets, assignments
)
# Step 4: Check convergence
if self.has_converged(centroids, new_centroids):
break
centroids = new_centroids
return centroids, assignments
Federated Learning vs SMC
Federated Learning:
- Participants train local models and share model updates
- Central server aggregates updates to create global model
- Provides some privacy but vulnerable to inference attacks
SMC for Machine Learning:
- No model updates are shared in the clear
- All computations done under encryption or secret sharing
- Stronger privacy guarantees but higher computational cost
Comparison:
Aspect | Federated Learning | SMC-based ML |
---|---|---|
Privacy | Model updates visible | All computation private |
Efficiency | High | Lower (crypto overhead) |
Communication | Model-size dependent | Data-size dependent |
Trust | Requires trusted aggregator | Fully distributed |
Inference Attacks | Vulnerable | Resistant |
🔍 Detailed Explanations
Challenges in SMC for Machine Learning
1. Non-linear Operations:
- Problem: Activation functions (ReLU, sigmoid, tanh) are expensive in SMC
- Solution: Polynomial approximations or garbled circuits
2. Floating-Point Arithmetic:
- Problem: SMC typically works with integers
- Solution: Fixed-point arithmetic with scaling factors
3. Iterative Algorithms:
- Problem: Many ML algorithms require iterations with communication overhead
- Solution: Reduce iterations, batch operations, optimize communication
4. Model Size and Complexity:
- Problem: Large neural networks have millions of parameters
- Solution: Model compression, layer-wise computation, pruning
Privacy-Preserving Neural Networks
Solutions for Deep Learning:
1. Linear Approximations:
def secure_relu_approximation(x):
# Approximate ReLU with polynomial
# ReLU(x) ≈ 0.5x + 0.5|x| ≈ polynomial approximation
return secure_polynomial_evaluation(x, relu_coefficients)
2. Secret Sharing for Linear Operations:
class SecretSharedLayer:
def __init__(self, weights, bias):
self.weights = secret_share_matrix(weights)
self.bias = secret_share_vector(bias)
def forward(self, input_shares):
# Matrix multiplication under secret sharing
output_shares = secure_matrix_multiply(input_shares, self.weights)
output_shares = secure_vector_add(output_shares, self.bias)
return output_shares
3. Hybrid Approaches:
def hybrid_nn_protocol(layers):
for layer in layers:
if layer.type == "linear":
result = secret_sharing_compute(layer)
elif layer.type == "activation":
result = garbled_circuits_compute(layer)
elif layer.type == "pooling":
result = homomorphic_compute(layer)
return result
💡 Practical Examples
Example 1: Private Medical Diagnosis
Scenario: Three hospitals want to build a joint diabetes prediction model
class PrivateMedicalDiagnosis:
def __init__(self, hospitals):
self.hospitals = hospitals
self.model = None
def collaborative_training(self):
# Step 1: Agree on feature set and preprocessing
common_features = self.negotiate_features()
# Step 2: Locally preprocess data
preprocessed_data = []
for hospital in self.hospitals:
local_data = hospital.preprocess_data(common_features)
preprocessed_data.append(local_data)
# Step 3: Securely train logistic regression
self.model = self.secure_logistic_regression(preprocessed_data)
return self.model
def secure_prediction(self, patient_features):
# Each hospital can use the model for local predictions
# without revealing individual patient data
encrypted_features = encrypt_features(patient_features)
encrypted_prediction = self.model.predict(encrypted_features)
return decrypt_prediction(encrypted_prediction)
Example 2: Private Market Basket Analysis
Scenario: Competing retailers want to find common purchasing patterns
class PrivateMarketBasketAnalysis:
def __init__(self, retailers, min_support=0.1):
self.retailers = retailers
self.min_support = min_support
def find_association_rules(self):
# Step 1: Standardize product catalogs
unified_catalog = self.create_unified_catalog()
# Step 2: Convert transactions to unified format
standardized_transactions = []
for retailer in self.retailers:
transactions = retailer.get_transactions(unified_catalog)
standardized_transactions.append(transactions)
# Step 3: Securely mine frequent itemsets
frequent_itemsets = self.secure_apriori(standardized_transactions)
# Step 4: Generate association rules
rules = self.generate_rules(frequent_itemsets)
return rules
Example 3: Private Credit Scoring
Scenario: Banks collaborate on fraud detection without sharing customer data
class PrivateCreditScoring:
def __init__(self, banks):
self.banks = banks
self.fraud_model = None
def train_fraud_detection_model(self):
# Step 1: Standardize feature representations
feature_schema = self.agree_on_features()
# Step 2: Each bank prepares local data
local_datasets = []
for bank in self.banks:
local_data = bank.prepare_training_data(feature_schema)
local_datasets.append(local_data)
# Step 3: Securely train ensemble model
self.fraud_model = self.secure_ensemble_training(local_datasets)
return self.fraud_model
def secure_prediction(self, transaction_features):
# Encrypt transaction features
encrypted_features = encrypt_transaction(transaction_features)
# Each base model makes encrypted prediction
encrypted_predictions = []
for model in self.fraud_model.base_models:
pred = model.secure_predict(encrypted_features)
encrypted_predictions.append(pred)
# Securely aggregate predictions (majority vote)
final_prediction = self.secure_majority_vote(encrypted_predictions)
return decrypt_prediction(final_prediction)