GitHub - autrin/k-anonymity_l-diversity: A k-anonymity and l-diversity problem with the Adult data

autrin / k-anonymity_l-diversity Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

A k-anonymity and l-diversity problem with the Adult data

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.gitignore		.gitignore
Index		Index
README		README
adult.data		adult.data
adult.names		adult.names
adult.test		adult.test
anonymity.ipynb		anonymity.ipynb
entropy.ipynb		entropy.ipynb
generalized_data_gt_50k_4_2_4_5.csv		generalized_data_gt_50k_4_2_4_5.csv
generalized_data_gt_50k_adjusted_for_lDiversity.csv		generalized_data_gt_50k_adjusted_for_lDiversity.csv
generalized_data_le_50k_2_3_4_5.csv		generalized_data_le_50k_2_3_4_5.csv
hierarchy.txt		hierarchy.txt
old.adult.names		old.adult.names

Repository files navigation

# Part1:
1. This is about designing and implementing a heuristic algorithm to ensure (k1, k2)-anonymity for the Adult dataset from the UCI Machine Learning Repository. The key challenge here is anonymizing the dataset to protect users' privacy (based on their salary) while maintaining as much utility of the dataset as possible. The two levels of anonymity are:

k1 = 10 for users with salaries ≤ 50K (stronger privacy).
k2 = 5 for users with salaries > 50K (less strict privacy).

2. QIs:
The four attributes I need to anonymize through generalization or suppression are:

Age: Numerical attribute.
Education: Categorical attribute with multiple levels (e.g., Bachelors, HS-grad).
Marital-Status: Categorical attribute with different marital statuses.
Race: Categorical attribute with 5 distinct values (White, Asian-Pac-Islander, etc.).

Sensitive Attribute:
The occupation is treated as sensitive and must remain in the dataset.
However, I need to ensure that the anonymized data prevents users from being easily identified based on this attribute.

3. For each QI, I need to define generalization hierarchies:

Age: Group into ranges (e.g., 20-30, 30-40) or broader ranges if necessary.
Education: Consider collapsing similar education levels (e.g., grouping 'Bachelors' and 'Masters' into 'Higher Education').
Marital-Status: Consider merging some categories like 'Married-civ-spouse' and 'Married-AF-spouse.'
Race: Group smaller races into an "Other" category, if necessary.

4. my algorithm needs to:

Determine k1 or k2 based on the salary of the individual.
Generalize or suppress QIs to ensure that each equivalence class (a set of records that are indistinguishable) contains at least k1 or k2 individuals.
Minimize utility loss: This is the critical part. If I generalize too much (e.g., turning 'Age' into a large range like 20-50), I lose precision, but if I generalize too little, I may not meet the required anonymity level.

I can extend an existing algorithm like DataFly (which generalizes QIs based on their distinct values) or μ-Argus, or I can develop a custom heuristic that balances privacy and utility.

5. After implementing my algorithm, calculate:

Distortion: How much generalization/suppression I applied (i.e., how far I deviated from the original data).
Precision: A measure of how specific the remaining data is. High precision means less generalization.

6. For missing values (e.g., 'Occupation = ?'), consider them generalized to the top level of the hierarchy. This ensures that they still contribute to the anonymized dataset.

7. Once implemented, test my algorithm on the Adult dataset and ensure that it satisfies both k1 and k2. You’ll need to fine-tune the hierarchies and generalization steps to minimize utility loss while achieving the desired anonymity levels.

# Part 2:
I'll need to ensure that for each group of records sharing the same generalized quasi-identifier values, the entropy of the sensitive attribute is high enough to satisfy the specified diversity level ℓ.

Steps to Implement Entropy l-Diversity:
- Generalize the dataset based on my chosen generalization levels for quasi-identifiers.
- Calculate the entropy for the sensitive attribute within each q∗-block.
- Check if the entropy meets the required threshold log(ℓ).

Recursive (c, ℓ)-diversity
Here’s a brief breakdown of the Recursive (c, ℓ)-diversity:

Recursive (c, ℓ)-diversity requires that for each equivalence class, the most frequent ℓ−1 sensitive values appear less than c times the frequency of the ℓ-th most frequent sensitive value.

Impact on Precision and Distortion:
- For smaller c values (e.g.,c=0.5), the generalization levels might need to increase more in order to meet the stricter diversity requirements. This would result in higher distortion and lower precision.
- For larger c values (e.g., c=2), the algorithm may not need to generalize as much to satisfy the diversity constraints, resulting in lower distortion and higher precision.

Steps to Implement:
- Implement Generalization Logic: Adjust generalization levels to find a balance between data utility and privacy.
- Check Recursive (c, ℓ)-diversity: Implement a function to verify if each equivalence class meets the recursive (c, ℓ)-diversity.
- Adjust for Different c Values: Implement the check for different c values as specified.
- Evaluate Results: Calculate distortion and precision for each configuration.