Introducing software development practices and tools for research in the behavioral and social sciences

Author: Pablo Caceres
Contact: pcaceres@wisc.edu

Research in the behavioral and social sciences (B&SS) is increasingly relying on complex computational procedures. Nonetheless, researchers in the B&SS usually have little formal training in software development in the context of scientific computing. This situation limits researchers' ability to produce data processing pipelines that are reproducible, reusable, reliable, maintainable, extensible, and shareable with the wider scientific community. Introducing a set of practices and tools from software development can significantly help to alleviate this situation and improve the long-term sustainability of research that relies on heavy computation.

In this talk, I provide a selection of practices and tools requiring relatively low effort in exchange of high impact on improving researchers' computational work-flows. I also provide a minimal example illustrating the application of these simple principles in an end-to-end data analysis project.

Getting started

In this tutorial, we will reproduce the contents of this repo step-by-step. Therefore, it is recommended to create a directory to host both the sf_for_beh_ss repo and your reproduction. To do this, run in the command line:

# make the directory
mkdir tutorial
# navigate inside the directory
cd tutorial
# clone the tutorial materials
git clone https://github.com/pabloinsente/sf_for_beh_ss.git

To run the examples, you'll need python 3.7 installed in a Linux/Mac machine.

For Windows users, there are two ways to make the code work:

Installing cygwin and running everything from the cygwin console
Installing the Ubuntu 18.04 LTS distribution in the Windows subsystem for Linux. If you use this option, you'll need to type this to access your Windows Desktop files (replace yourName with your user-name):

cd /mnt/c/Users/yourName/Desktop

Check your python installation by typing in the command line:

python --version

If you have a different python version, go and install python 3.7. Look under the "Looking for a specific release?" section.

Python tip: managing multiple versions of python may be messy. One solution is to use pyenv. Here is an excellent tutorial about installing and using pyenv in multiple OS.

Usage

Python scripts and Jupyter notebooks are provided in the /src directory. The step-by-step instructions are next in this document.

Tutorial Outline

Creating a simple and well-organized data file system
Using virtual environments
Using version control systems
Example 1: Writing a basic reproducible script
Example 2: Setting up a machine learning experiment tracking
Testing your code
Summary and conclusions
Resources to learn more

Tutorial Instructions

1. Creating a simple and well-organized data file system

In the social sciences, it is common to find code repositories where everything is dump into the same directory: data, code, charts, manuscripts, etc. I've done this multiple times in the past and I regret it. There are many ways in which you can organize your projects. We'll generate a repository structure based on a few conventions around software development. We'll use the command line to populate our project.

Note about the command-line: using the command-line may be confusing. It can make you feel like you'd break your computer if you make a typo. You can make your experience better by installing a terminal emulator in your computer. Terminal emulators add multiple capabilities like auto-complete, coloring, easy copy-paste, multiple terminals in the same window, and more. Here a few options:

Linux: Tilix
Mac-OS: iterm2
Windows: cmder

About learning to use the command-line: learning to use the command-line and bash is beyond the scope of this tutorial. There are many good resources out there for this (see here), but the trick is just using it as much as you can for your day-to-day tasks (and goggling). There a few commands worth mentioning for this tutorial:

cd: change directory
mkdir: make a directory
ls: list files
touch: create files
rm: remove files
rm -r: remove directories

1.1 Create project directory

We'll start by creating a project directory:

# make a directory to host your project
mkdir my_awesome_project
# navigate into your project directory
cd my_awesome_project

1.2 Create the repo documentation files

At the root of your project, it's usually expected to see at least three elements:

README.md: think about this as the abstract of a paper plus instructions about installation and usage.
requirements.txt: to indicate the required software dependencies (using pip). More about this later.
LICENSE.txt: to inform potential users about the usability of your code. GitHub provides a guide about how to chose a License here, and how to add one here

Let's add the files:

touch README.md requirements.txt LICENSE.txt

README.md content: creating README files is repetitive. We'll use this template to add our content:

# Title

## Requirements

## Installation

## Usage

To open the README.md in VS Code:

code README.md LICENSE.txt

To add the contents of your License, we can use this site

Copy-paste the template and save it once you're done.

1.3 Create the main folders

It is good idea to separate every element of your project into sub-directories:

mkdir src data docs results tests

We will add some fillers files to our directories in the meantime:

touch ./src/eda.py ./src/eda.ipynb
touch ./src/stats_example.py ./src/stats_refactor.py ./src/stats_helper.py
touch ./src/ml.py ./src/nn.py
touch ./src/__init__.py ./tests/__init__.py
touch ./data/fake_data.csv
touch ./docs/code_notes.md
touch ./results/fake_plot.jpg
touch ./tests/test_my_code.py

Now, if you ls your directories you'll se files:

ls ./data ./docs ./results ./src ./tests

Note: GitHub will not upload directories if they are empty. Hence, we created some empty files. More on this later.

2. Using virtual environments

Virtual environments are a way to isolate the software requirements for your project. In brief, they say: "use this version of python, and these versions of these packages, and here is where you can find them".

Using virtual environments is a good idea because they avoid interference between the dependencies of multiple projects living in your system. They also avoid altering dependencies of your system installation of python. Finally, they also facilitate reproducibility of your projects by specifying the environment in which your code was run.

Note about environments and dependencies: other programming languages like R, Julia, etc., have their own solutions for environment isolation and dependency management. Today's examples are based on python. If you need to use multiple programming languages, the best solution is to use Docker containers, in which you can "package" your whole software system: code, runtime, system tools, system libraries and settings, to be reproduced in another machine. Docker is beyond the scope of this tutorial. You can learn more here, here, and here.

There are multiple alternatives to create virtual environments in python. We'll use venv because of it's lightweight and simple to use.

2.1 Create virtual environment

In the root of your project directory type:

python3 -m venv venv

Note: if you're not sure where in your file system is your terminal, type this to see the path:

pwd

2.2 Activate virtual environment

Creating your venv is the first step. To actually use it, you need to active it by running:

source venv/bin/activate

2.3 Verify venv activation

As a sanity check:

which python

The output should point to your current directory. It should look similar to:

/home/yourname/Desktop/my_awesome_project/venv/bin/python

As long as your venv is active, python will go to that directory to search for dependencies, and pip will install dependencies in there as well.

2.4 Check installed packages

pip list

The output should look similar to this (Version may vary):

Package	Version
pip	19.2.3
setuptools	41.2.0

If you see more packages, your pip installation is probably not pointing to the /venv directory (this often happens when you forget to activate your venv).

Once we have the venv activated (this is easy to forget), we can safely install dependencies using pip.

2.5 Installing packages

One way to install packages is to simply type pip name-package. A better way is to specify the package name and version in the requirements.txt file. Open the file in vscode by typing code requirements.txt, and copy-paste:

altair==4.0.0
jupyterlab==1.2.4
numpy==1.17.4
pandas==0.25.3
scikit-learn==0.21.3
scipy==1.3.3
statsmodels==0.10.2
tensorflow==2.0.0
wandb==0.8.20
watermark==2.0.2
selenium==3.141.0
pytest==5.3.2

Before installing dependencies, check the pip version by

pip --version

If you see a version older than 19.2.3, upgrade with:

pip install --upgrade pip

Now you're ready to install the dependencies by running:

pip install -r requirements.txt

You can check the installation with:

pip list

To deactivate the environment:

deactivate

Or simply close your terminal

If you want to delete the environment run:

rm -rf venv/

The main advantage of the requirements.txt file is that allows other people to reproduce your dependencies

3. Using version control systems

Version control systems are tools for managing and tracking code changes over time in a semi-automated manner. If you have ever created something like this...

my_analysis_script.py
my_analysis_script_final.py
my_analysis_script_final_FINAL.py
my_analysis_script_final_FINAL_FOR_REAL.py
my_analysis_script_2.py
my_analysis_script_2_this_is_the_last_one.py
...

...you may need to use version control. There are many version control systems around (Subversion, Mercurial, etc), but Git is the most popular by far.

What is Git: Git is a version control software managing, tracking, and logging your code in your machine. Git is commonly used along with GitHub as a hosting service.

What is GitHub: GitHub is a hosting service for Git. It allows for saving everything related to your project in the cloud instead of your own machine.

Learning Git may take a while. Fortunately, there are relatively few commands needed to track your projects effectively. The rest can be goggled as needed.

3.1 Check git installation

git --version

If you need to install Git, got to this link and follow installation instructions for your system

3.2 Check the git status

Run this in the root of your project directory:

git status

The output should look like this:

fatal: not a git repository (or any parent up to mount point /)

This is fine. This means that Git has not initiated tracking in that directory.

3.3 Initialize git tracking

git init

This creates an empty Git repository or reinitialize an existing one. You'll not see the repository because it is a .git directory (directories starting with a "." are normally hidden)

To confirm the initialization type:

git status

It should say something like this:

On branch master

No commits yet
...

3.4 Create .gitignore file

touch .gitignore

The .gitignore file tells to git: "DON'T track this files". Whatever you put in there, should not appear in GitHub later.

Populating .gitignore: adding files to be ignored to .gitignore may be repetitive. We'll use this webpage to automatically generate .gitignore files based on the dependencies we're using. Once on the page, ask for python, jupyter, and venv in the search bar. Then copy-paste the generated text into your own .gitignore

code .gitignore

3.5 Add files to local repository

git add -A

This stages the files to be committed. This is how we tell git: "prepare these files to be committed". The "-A" flag stands for "all changes"

3.6 Commit files

git commit -m "First commit"

Commit prepares the added files to be pushed to the remote repository. This is how we tell git: "save these changes locally, I'll send them to GitHub later". The "-m" flag (short for --message) attach commentary to your commit. This is useful to record what changes you made to your code.

3.7 Creating a remote repository

To push our files to our remote repository, we need to create one in the first place.

Go to https://github.com/ and create a new empty repository (don't add README or LICENSE). Then copy the remote repository URL.

3.8 Connect with the remote repository

To connect our local repository with our remote one, run this replacing <GITHUB_URL> with your remote URL:

git remote add origin <GITHUB_URL>

3.9 Check new remote

git remote -v

The output should look like:

origin https://github.com/pabloinsente/sf_for_beh_ss.git (fetch)
origin https://github.com/pabloinsente/sf_for_beh_ss.git (push)

3.10 Push local changes to remote repository

Now we are ready to push our changes to GitHub (our remote bucket for git and our code):

git push origin master

This should prompt to you enter your username and password.

Note about connecting to GitHub: If you push and fetch a lot, you may want to avoid typing your user-name and password every time by connecting to GitHub with SSH. Here is a GitHub guide about how to configure SSH.

After pushing, If you go to your GitHub repo, you should see the added changes.

4. Example 1: Writing a basic reproducible script

We have accomplished three things:

A project structure
An isolated virtual environment to manage our dependencies
A version control system to track our progress

Now we need some code that processes data in automated and reproducible fashion. We will walk through the eda.ipynb and stats_example.py files to see an example of how this may work. You just need to copy-paste them from the workshop repo sf_for_beh_ss/src into your own my_awesome_project/src directory. Remember also to copy-paste the mental_health_tech_data.csv from sf_for_beh_ss/data to your my_awesome_project/data

Assuming that sf_for_beh_ss and my_awesome_project are under the same directory, you can copy-paste by running:

cp ../sf_for_beh_ss/src/eda.ipynb ../sf_for_beh_ss/src/stats_example.py ./src
cp ../sf_for_beh_ss/data/mental_health_tech_data.csv ./data

4.1 Jupyter Notebook (`eda.ipynb`) instructions

From the root of your repository run:

cd src
jupyter lab eda.ipynb

This should open Jupyter Lab. Further instructions in the notebook

4.2 Stats Example (`stats_example.py`)

Before moving forward, remember to push your changes by:

git add -A
git commit -m "eda results"
git push origin master

From the root of your repository run:

cd src
code stats_example.py stats_refactor.py stats_helper.py

To run the stats_example.py:

python stats_example.py

The output should look like:

Pearson Chi-square: 175.95516961872426
P-value: 0.0
Degrees of freedom: 1
Test interpretation: reject null hypothesis
Expected frequencies
             No         Yes
No   377.040767  241.959233
Yes  384.959233  247.040767

Now we will refactor the stats_example.py file by separating the functions (stats_helper.py) and the function calls (stats_refactor.py). Further instructions will be given in the workshop. Once you're done, run:

python stats_refactor.py # to print to the console
python stats_refactor.py > ../results/chi2.txt # to print to a .txt file

Print the results from your file:

cat ../results/chi2.txt

Now, you have a data analysis script that is:

reusable: you can use parts of your code for further analysis and/or projects
maintainable: easy to fix
extensible: easy to add more functionality
shareable: others can clone your repo and run your script easily

5. Example 2: Setting up machine learning experiment tracking

Before moving forward, remember to push your changes by:

git add -A
git commit -m "stats results"
git push origin master

Machine learning usually entails many rounds of iterating over multiple hyper-parameters, architectures, data partitions, etc. This makes hard to keep track of your experiments and metrics over time, which may hinder reproducibility. Several tools have been created recently to tackle this issue (e.g., MLflow, Comet, etc). In our case, we will use Weight & Biases to showcase a very simple example of how this might work.

5.1 To install Weight and Biases

This should be installed already (if you pip installed the requirements.txt). Otherwise, it can be installed by:

pip install wandb

5.2 To sing up or login to an existing account

wandb login

This should prompt you to Log in. If you don't have an account, create one and Log in. Follow the instructions and past the key to the command line. If you did this right, you should see a Successfully logged in to Weights & Biases! message.

Since we don't have time to write an ML pipeline, we will use scripts provided in the /src folder of the workshop repo, and add some code to those scripts to make things work. Again, copy and paste ml.py and nn.py from sf_for_beh_ss/src into your own my_awesome_project/src.

cp ../sf_for_beh_ss/src/ml.py ../sf_for_beh_ss/src/nn.py ./src

Then open the files in vscode:

code ./src/ml.py ./src/nn.py

5.3 Tracking configuration and metrics

Tracking configuration and metrics with wandb is done in 4 steps:

# Step 1: import wandb
import wandb

# Step 2: initialize wandb project tracking
wandb.init(project='my-awesome-project')

# Step 3: add tracking configuration
config = wandb.config # Config is a variable that holds and saves hyperparameters and inputs
config.epochs = 100
config.dropout = 0.2
...
# Step 4: tell wandb to log the experiment configuration and metrics
# For instance, at the end of a Keras model, we just need to add a Wandb Callback

# fit the model
model.fit(X_train_transform,
          y_train,
          epochs=config.epochs,
          validation_data=(X_test_transform, y_test),
          callbacks=[WandbCallback()])

Weight & Biases support multiple python frameworks: Scikit-learn, Tensorflow, Keras, Pytorch, Fast.ai, and XGBoost. Each framework follows the same steps with minimal variations. See the documentation to learn more about this.

5.4 Run the examples

To run the scikit-learn example. Navigate to the src/ directory and run

cd src/
python ml.py

If successful, wand should output something like:

wandb: Synced some-funny-name: https://app.wandb.ai/username/projectname/runs/hasg-number

Then, you can click in that link to see your results.

To run the Tensorflow/Keras example:

python nn.py

If the scripts ran successfully, wandb will generate a URL where you can see the project data and metrics on-line.

6. Testing your code

Before moving forward, remember to push your changes by:

git add -A
git commit -m "ml results"
git push origin master

Code testing is an uncommon, yet very important part of writing software for scientific computing in a reliable and reproducible fashion. There are multiple frameworks in the python ecosystem for this. We will use pytest because of its simplicity and popularity.

6.1 To install pytest run

This should be installed already (if you pip installed the requirements.txt). Otherwise, it can be installed by:

pip install pytest

6.2 Check pytest installation

pytest --version

Again, you just need to copy-paste the test_my_code.py file from sf_for_beh_ss/tests into your own my_awesome_project/tests

cp ../sf_for_beh_ss/tests/test_my_code.py ./tests

pytest works by searching for files that have test_something.py or something_test.py (note the "test" keyword), and running any function or method beginning with "test". Let's check our unit-test contents before running:

cd tests
code test_my_code.py

To run the unit test:

pytest

The pytest output should output something like this (if successful):

=========1 passed in 0.51s=========

7. Summary and conclusion

In this tutorial, we have accomplished the following:

A project structure
An isolated virtual environment to manage our dependencies
A version control system to track our progress and changes
An automated data analysis script
A machine learning experiment tracking system
A semi-automated unit testing script

By combining all these elements, we created a project workflow that is:

reusable: you can use parts of your code for further analysis and/or projects
maintainable: easy to fix
extensible: easy to add extra functionality
shareable: others can clone my repo and run your script easily
reliable: you can trust your results (with appropriate testing)
reproducible: others can produce the same results given the same data and dependencies

Of course, this is a minimal and very simple example. All the attributes that we mentioned (reliable, reproducible, etc) are not a matter of all of nothing, but guiding principles. Our hope that the practices and tools used in this tutorial contribute to getting closer to such ideals.

8. Resources to learn more

Software development is an enormous field with a lot to offer to people doing computationally intensive research. In this tutorial, the mantra was to provide a minimal set of practices and tools. Below you can find a list of resources to learn more

Virtual environments

Version control

Writing better code

ML tracking systems

Testing code

Articles about scientific computing

Wilson, G., Aruliah, D. A., Brown, C. T., Hong, N. P. C., Davis, M., Guy, R. T., ... & Waugh, B. (2014). Best practices for scientific computing. PLoS biology, 12(1), e1001745.
Wilson, G., Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., & Teal, T. K. (2017). Good enough practices in scientific computing. PLoS computational biology, 13(6), e1005510.
Sandve, G. K., Nekrutenko, A., Taylor, J., & Hovig, E. (2013). Ten simple rules for reproducible computational research.
Hart, E. M., Barmby, P., LeBauer, D., Michonneau, F., Mount, S., Mulrooney, P., ... & Hollister, J. W. (2016). Ten simple rules for digital data storage.
Rule, A., Birmingham, A., Zuniga, C., Altintas, I., Huang, S. C., Knight, R., ... & Rose, P. W. (2019). Ten simple rules for writing and sharing computational analyses in Jupyter Notebooks. PLoS computational biology, 15(7).
Perez-Riverol, Y., Gatto, L., Wang, R., Sachsenberg, T., Uszkoreit, J., Leprevost, F. da V., … Vizcaíno, J. A. (2016). Ten Simple Rules for Taking Advantage of Git and GitHub. PLOS Computational Biology, 12(7), e1004947.
Taschuk, M., & Wilson, G. (2017). Ten simple rules for making research software more robust. PLOS Computational Biology, 13(4), e1005412.
Hinsen, K. (2015). Technical Debt in Computational Science. Computing in Science & Engineering, 17(6), 103–107.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
data		data
docs		docs
results		results
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
project_template.sh		project_template.sh
requirements.txt		requirements.txt

pabloinsente/sf_for_beh_ss

Folders and files

Latest commit

History

Repository files navigation

Introducing software development practices and tools for research in the behavioral and social sciences

Getting started

Usage

Tutorial Outline

Tutorial Instructions

1. Creating a simple and well-organized data file system

1.1 Create project directory

1.2 Create the repo documentation files

1.3 Create the main folders

2. Using virtual environments

2.1 Create virtual environment

2.2 Activate virtual environment

2.3 Verify venv activation

2.4 Check installed packages

2.5 Installing packages

3. Using version control systems

3.1 Check git installation

3.2 Check the git status

3.3 Initialize git tracking

3.4 Create .gitignore file

3.5 Add files to local repository

3.6 Commit files

3.7 Creating a remote repository

3.8 Connect with the remote repository

3.9 Check new remote

3.10 Push local changes to remote repository

4. Example 1: Writing a basic reproducible script

4.1 Jupyter Notebook (eda.ipynb) instructions

4.2 Stats Example (stats_example.py)

5. Example 2: Setting up machine learning experiment tracking

5.1 To install Weight and Biases

5.2 To sing up or login to an existing account

5.3 Tracking configuration and metrics

5.4 Run the examples

6. Testing your code

6.1 To install pytest run

6.2 Check pytest installation

7. Summary and conclusion

8. Resources to learn more

Virtual environments

Version control

Writing better code

ML tracking systems

Testing code

Articles about scientific computing

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

4.1 Jupyter Notebook (`eda.ipynb`) instructions

4.2 Stats Example (`stats_example.py`)

Packages