-
Notifications
You must be signed in to change notification settings - Fork 0
autrin/k-anonymity_l-diversity
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
# Part1: 1. This is about designing and implementing a heuristic algorithm to ensure (k1, k2)-anonymity for the Adult dataset from the UCI Machine Learning Repository. The key challenge here is anonymizing the dataset to protect users' privacy (based on their salary) while maintaining as much utility of the dataset as possible. The two levels of anonymity are: k1 = 10 for users with salaries ≤ 50K (stronger privacy). k2 = 5 for users with salaries > 50K (less strict privacy). 2. QIs: The four attributes I need to anonymize through generalization or suppression are: Age: Numerical attribute. Education: Categorical attribute with multiple levels (e.g., Bachelors, HS-grad). Marital-Status: Categorical attribute with different marital statuses. Race: Categorical attribute with 5 distinct values (White, Asian-Pac-Islander, etc.). Sensitive Attribute: The occupation is treated as sensitive and must remain in the dataset. However, I need to ensure that the anonymized data prevents users from being easily identified based on this attribute. 3. For each QI, I need to define generalization hierarchies: Age: Group into ranges (e.g., 20-30, 30-40) or broader ranges if necessary. Education: Consider collapsing similar education levels (e.g., grouping 'Bachelors' and 'Masters' into 'Higher Education'). Marital-Status: Consider merging some categories like 'Married-civ-spouse' and 'Married-AF-spouse.' Race: Group smaller races into an "Other" category, if necessary. 4. my algorithm needs to: Determine k1 or k2 based on the salary of the individual. Generalize or suppress QIs to ensure that each equivalence class (a set of records that are indistinguishable) contains at least k1 or k2 individuals. Minimize utility loss: This is the critical part. If I generalize too much (e.g., turning 'Age' into a large range like 20-50), I lose precision, but if I generalize too little, I may not meet the required anonymity level. I can extend an existing algorithm like DataFly (which generalizes QIs based on their distinct values) or μ-Argus, or I can develop a custom heuristic that balances privacy and utility. 5. After implementing my algorithm, calculate: Distortion: How much generalization/suppression I applied (i.e., how far I deviated from the original data). Precision: A measure of how specific the remaining data is. High precision means less generalization. 6. For missing values (e.g., 'Occupation = ?'), consider them generalized to the top level of the hierarchy. This ensures that they still contribute to the anonymized dataset. 7. Once implemented, test my algorithm on the Adult dataset and ensure that it satisfies both k1 and k2. You’ll need to fine-tune the hierarchies and generalization steps to minimize utility loss while achieving the desired anonymity levels. # Part 2: I'll need to ensure that for each group of records sharing the same generalized quasi-identifier values, the entropy of the sensitive attribute is high enough to satisfy the specified diversity level ℓ. Steps to Implement Entropy l-Diversity: - Generalize the dataset based on my chosen generalization levels for quasi-identifiers. - Calculate the entropy for the sensitive attribute within each q∗-block. - Check if the entropy meets the required threshold log(ℓ). Recursive (c, ℓ)-diversity Here’s a brief breakdown of the Recursive (c, ℓ)-diversity: Recursive (c, ℓ)-diversity requires that for each equivalence class, the most frequent ℓ−1 sensitive values appear less than c times the frequency of the ℓ-th most frequent sensitive value. Impact on Precision and Distortion: - For smaller c values (e.g.,c=0.5), the generalization levels might need to increase more in order to meet the stricter diversity requirements. This would result in higher distortion and lower precision. - For larger c values (e.g., c=2), the algorithm may not need to generalize as much to satisfy the diversity constraints, resulting in lower distortion and higher precision. Steps to Implement: - Implement Generalization Logic: Adjust generalization levels to find a balance between data utility and privacy. - Check Recursive (c, ℓ)-diversity: Implement a function to verify if each equivalence class meets the recursive (c, ℓ)-diversity. - Adjust for Different c Values: Implement the check for different c values as specified. - Evaluate Results: Calculate distortion and precision for each configuration.
About
A k-anonymity and l-diversity problem with the Adult data
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published