diff --git a/content/post/smallset_timelines.md b/content/post/smallset_timelines.md new file mode 100644 index 0000000..4dd457f --- /dev/null +++ b/content/post/smallset_timelines.md @@ -0,0 +1,128 @@ +--- +title: "Smallset Timelines for Communicating Data Preprocessing Decisions" +description: "Data preprocessing is messy and nuanced but full of consequential decisions, a cartoon strip can be generated for your preprocessing to help understanding and reproduction." +date: "2024-11-18" +draft: false +categories: + - "research" +tags: + - "visualisation" + - "data" + - "integratedAI" +--- + +##### Posted by _Lexing Xie_ and _Lydia Lucchesi_. + +
+
+ +
+Smallset Timelines, and the associated [R package](https://cloud.r-project.org/web/packages/smallsets/index.html) [smallsets](https://lydialucchesi.github.io/smallsets/), faciliate visual documentation of data preprocessing. + + + +
+ +Data preprocessing is a crucial intermediate stage in quantitative data analysis. During this stage, data practitioners decide how to resolve dataset issues and transform, clean, and format the dataset(s). It +can be a challenging stage, full of decisions that have the potential to influence analytical outcomes. Yet, data preprocessing is often treated as behind-the-scenes work and overlooked in research dissemination. This discrepancy, in the practice and presentation of data analytics, +is limiting when it comes to replicating, interpreting, and utilising research outputs. + +
+ +The two central contributions in [Lydia's 2024 PhD Thesis](https://lydialucchesi.github.io/thesis/thesis_LydiaLucchesi.pdf) are Smallset Timelines and smallsets. The Smallset Timeline is a static +and compact visualisation, documenting the sequence of decisions in a preprocessing pipeline; +it is composed of small data snapshots of different preprocessing steps. The smallsets software builds a Smallset Timeline from a user’s data preprocessing script, containing structured +comments with snapshot instructions. Together, Smallset Timelines and smallsets are designed to support the production of accessible data preprocessing documentation. + +This post illustrates these contributions with four examples, along with an example notebook that produces them. + +1. Ebirds data in citizen science +1. HMDA homeloan data, reflecting nuances in defining and reporting on race +1. Examining fairness in income classification from American Community Survey +1. NASA software defect data + +We will conclude this overview with an example notebook to illustrate the ease of using smallsets in exisitng data-preprocessing code, along with an FAQ. + +
+ +#### **Example 1: Ebirds Data in Citizen Science** + +
+ +
+ + +
+ +
+ +#### **Example 2: HMDA Homeloan Data - Nuances in Defining and Processing Race** + +
+ +
+ +
+ +
+ + +
+ +#### **Example 3: Examining Fairness in Income Classification** + + +
+ +
+ Smallset Timeline of ACS California data preprocessed with the validity-median +setting. Smallset selected with random sampling. The preprocessing script and smallsets +code for this figure are in the code section below. +
+
+ +
+ +
+ The effect of four different preprocessing settings on data and prediction. Plot +a) shows dataset imbalance by gender. Plots b) and c) show group fairness measures in predictions from a logistic regression model. Error bars refer to 95% Newcombe intervals. +
+
+ +
+ +#### **Example 4: A widely-used dataset of software defects** + +
+ +
+ + +
+ +#### **Example notebook for the fairness example** + +
+ +
+ +
+ +
+ +
+ +#### **FAQ** (detailed answers coming soon, new questions most welcome) + +* _Will smallsets automate data-preprocessing?_ In short, no. +* _Is Python code supported?_ Yes, in ipython notebooks. +* _Will smallsets support preprocessing code across different scripts?_ Not yet. +* _Will smallsets support word embeddings, large language models and the like?_ Not yet, let us know what you think are important to support. + +
+ +#### **Resources** + +* [Smallset Timelines: A Visual Representation of Data Preprocessing Decisions](https://arxiv.org/abs/2206.04875), Lydia R. Lucchesi, Petra M. Kuhnert, Jenny L. Davis, and Lexing Xie, Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 2022 +
+* [Visualisation and Software to Communicate Data Preprocessing Decisions](https://lydialucchesi.github.io/thesis/thesis_LydiaLucchesi.pdf), Lydia R. Lucchesi, PhD Thesis, The Australian National University, 2024 diff --git a/static/img/smallset/acs.png b/static/img/smallset/acs.png new file mode 100644 index 0000000..fb766a4 Binary files /dev/null and b/static/img/smallset/acs.png differ diff --git a/static/img/smallset/ebird.png b/static/img/smallset/ebird.png new file mode 100644 index 0000000..0e6ac02 Binary files /dev/null and b/static/img/smallset/ebird.png differ diff --git a/static/img/smallset/fairness.png b/static/img/smallset/fairness.png new file mode 100644 index 0000000..a85d31c Binary files /dev/null and b/static/img/smallset/fairness.png differ diff --git a/static/img/smallset/gray_general.pdf b/static/img/smallset/gray_general.pdf new file mode 100644 index 0000000..a1a4dd3 Binary files /dev/null and b/static/img/smallset/gray_general.pdf differ diff --git a/static/img/smallset/hmda_A.png b/static/img/smallset/hmda_A.png new file mode 100644 index 0000000..0a5b873 Binary files /dev/null and b/static/img/smallset/hmda_A.png differ diff --git a/static/img/smallset/hmda_B.png b/static/img/smallset/hmda_B.png new file mode 100644 index 0000000..6ca3f56 Binary files /dev/null and b/static/img/smallset/hmda_B.png differ diff --git a/static/img/smallset/notebook1.png b/static/img/smallset/notebook1.png new file mode 100644 index 0000000..5480b19 Binary files /dev/null and b/static/img/smallset/notebook1.png differ diff --git a/static/img/smallset/notebook2.png b/static/img/smallset/notebook2.png new file mode 100644 index 0000000..abb946f Binary files /dev/null and b/static/img/smallset/notebook2.png differ