Algorithmic Audit: Modeling the New York City Civilian Complaint Review Board and Auditing the Model Fairness
For Full Paper, Click Here.
For Paper Covering The Background/Invesitgation of Inequity (Paper 1), Click Here
Please note that this README functions only as an overview of the work done! Please take a look at the final papers (linked above) or their corresponding Python notebooks:
- Background
- Tools Used
- The Data
- The Logistic Regression Model
- Overview of Parity Measures and Fairness Results
- Conclusion
This paper and analysis was done during Spring 2022 as part of the course DSC 167: Fairness and Algorithmic Decision Making at UC San Diego's Halıcıoğlu Data Science Institute. The purpose of the assignment and aim of this paper was to find a system or mechanism that has a potential inequity, model the decision making process, and evaluate/audit this model using parity measures and frameworks of distributive justice. I hope that this paper can teach data scientists or those interested in the field about how ethics and fairness come into play in data science and the algorithms us data scientists create.
The New York City Civilian Complaint Review Board (CCRB), the system in this context, is an agency independent from the New York Police Department that investigates complaints against NYPD officers regarding allegations of excessive or unnecessary force, abuse of authority, discourtesy, or offensive language. Throughout the investigation process, the review board may obtain further evidence from the NYPD (e.g., body camera footage) and conduct interviews of anyone involved to use alongside any details the complainant submitted in their original complaint. Specific complainant/victim attributes collected include age, gender, and ethnicity. The complainant may also provide details on the officer’s sex, race, and physical appearance.
The potential inequity I investigated involves the "disposition" of a given complaint, or the ruling the CCRB determines. A substantiated complaint is one for which the alleged conduct was found to have happened and violated NYPD rules, potentially leading to punishment or repercussions for the accused officer. The data shows that complaints submitted by Black complainants were less often substantiated than non-Black complainants. This discrepancy was the launching pad for my investigation of this inequity and audit of the CCRB.
- Python
- Jupyter Notebooks
- Scikit Learn for Modelling
- Pandas for Data Cleaning, Exploration, etc.
- NumPy for Data Cleaning, Exploration, etc.
- Matplotlib for Data Visualizations
Python logo, Scikit learn logo, Pandas logo, NumPy logo, Jupyter logo, Matplotlib logo.
The data originally comes from ProPublica and contains information regarding complaints made starting in 1985 up to 2020. The model uses data starting in 2000 because key demographic features like complainant ethnicity were not tracked before then. All complaints include the type of complaint made, the precinct in which the alleged action took place, demographics on the complainant, and demographics on the accused officer. See allegations.csv and allegations_cleaned.csv.
Borough was not a column initially included in the dataset, so the precincts were mapped to their apporpriate boroughs using data pulled from the NYPD's website. See precinct_boroughs.csv
In order to understand the different categories of complaints and the potential consequences an officer may face, I referenced the CCRB's rules. This information was used in feature engineering.
In the first, investigatory paper, data on arrests in NYC and stop, question, and frisk patterns were investigated. Please visit this link for the arrests data and this link for the stop, question, and frisk data.
A simple logistic regression model with with L2 regularization was used to model the CCRB's decision process. To train the model, a simple 75%-25% train-test split was used. To combat the class imbalance present (only about 25% of complaints were deemed substantiated), the class_weight
hyperparameter, which assigns a weight to each class that the model uses for penalizing, was used. In order to determine the proper decision thereshold, different utility functions were compared, ultimately leading to a threshold of 0.527 instead of the default 0.5. This means that anything that the model classifies points as substantiated if the resulting regression prediction is at least 0.527. See the paper for more details.
The features used in the model are:
contact_reason
(or text indicating why the officer approached the civilian)mos_ethinicity
(officer’s ethnicity), rank_incident (officer’s rank at time of incident)mos_gender
(officer’s gender)complainant_gender
(complainant’s gender)mos_age_incident
(officer’s age at time of incident)complainant_age_incident
(complainant’s age at time of incident)borough
(the borough in which the incident took place)black
(whether the complainant is Black)allegation
(brief description of the allegation)fado_type
(type of complaint)- time/date related features (
month_received
,year_received
,month_closed
, andyear_closed
)
All categorical features except allegation and fado_type were one-hot encoded while the exceptions were ordinal encoded. The numerical features were scaled.
The class imbalance makes accuracy ill-suited for this model, so the F1 score was used instead. The test performance metrics for all groups is as follows:
- Accuracy Score: 0.623882406425216
- Recall score: 0.526413921690491
- Precision score: 0.3299571484222828
- F1 score: 0.4056513409961685
For a breakdown of different parity measures and fairness in machine learning, take a look at Google's Machine Learning Glossary: Fairness.
With the original model, a singular threshold was used. Under this model, both demographic parity and equalized odds were not satisfied, but predictive value parity was met. The original model substantiated complaints from Black civilians far less often than it did for non-Black complainants. The violation in equalized odds means the model gave more complaints submitted by non-Black complainants the “benefit of the doubt” and classified more CCRB-unsubstantiated complaints as substantiated than for Black complainants.
To try and improve these metrics for fairness, 2 different threshold tests were conducted to find better thresholds based on groups (complainant ethnicity in this case). Firstly, the thresholds that maximize utility for the individual group were used. These thresholds satisfied demographic parity and equalized odds, but not predictive value parity. Secondly, the thresholds that satisfy the 3 parity measures (demographic parity, equalized odds, and predictive value parity) were found using the training data and then tested with a separate dataset. See the paper for a more in depth breakdown on how these tests were conducted.
Depending on what is most important to the decision-making body (CCRB), different threshold(s) would be chosen. Maximizing utility overall enforces equality for all complainants, while maximizing utility for each group is more equitable. Satisfying demographic parity ensures that substantiation doesn’t depend on complainant ethnicity. Equality of odds would focus on ensuring that the chance of obtaining the benefit of substantiation is equal across groups, while predictive value parity focuses on ensuring deserving complainants receive substantiation and underserving complainants don’t. The table below demonstrates different group thresholds that could be used, depending on the goal (maximizing utility or enforcing a parity measure).
Threshold for Black Complainants | Threshold for non-Black Complainants | Test | Demographic Parity | Equality of Odds | Predictive Value Parity | |
---|---|---|---|---|---|---|
0 | 0.527 | 0.527 | Max utility overall | Not satisfied | Not satisfied | Satisfied |
1 | 0.522 | 0.546 | Max utility per group | Satisfied | Satisfied | Not satisfied |
2 | 0.522 | 0.541 | Enforce equality of odds | N/A | Strictly satisfied | N/A |
3 | 0.522 | 0.535 | Enforce predictive value parity | N/A | N/A | Not satisfied |
Hopefully this paper shows that often times, an algorithm or model may (unintentionally) be unfair or inequitable if such parity measures or metrics of fairness are not considered. As for this specific example of the CCRB, the model missses out on lots of data perhaps gathered in the investigation process, but still demonstrates evidence of inequities or unfairness. The bottom line is that any model or algorithm absolutely should consider its fairness!