Light Weight Reproducible Research for Data Scientists
A minimal standard for data analysis and other scientific computations is that they be reproducible: that the code and data are assembled in a way so that another group can re-create all of the results (e.g., the figures in a paper). – Karl Broman
Preface
My thoughts and views on this topic are hardly unique. The following resources have been instrumental in shaping my world view:
- Software-Carpentry
- Data-Carpentry
- Good Enough Reproducibility
- Creating A Reproducibe Workflow
- Code and Data for the Social Sciences
- CTB on Growing A Development Process
Motivations
I’ve spent the 7 years largely working on solo projects. Over the last 6 months, I’ve gotten to work with alot of other very smart people in varying situtations. You’re constantly warned about how poorly written and documented code is a nightmare. Compound this with the scientific method and data-driven decisions and it gets worse.
My hope for this document is to codify practices that have worked well for me and ones that I’ve seen work well for others; and then follow them.
Personal Statement:
Data science means different things to different people.
I personally believe that data science shouldn’t be done in a vacuum, it’s a form of software development and should be treated as such.
A well thought out and executed workflow should allow research and its resulting code to be : understandable, auditable, and transferable.
Guiding principals for this document:
Any proposed activity that isn’t directly related to model building and experimentation shouldn’t unduly burden me.
Flexibility is fine as long as it doesn’t undermine the general goal of reproducible research.
Practices should be reviewed at the closing of any project (or major version release), or whenever the hell it seems appropriate.
Activites (mandated or otherwise) should make code understandable, auditable, or transferable.
Example Project Structure
1SuperAwesomeProject/
2├── bin/
3├── data/
4├── docs/
5│ ├── CHANGELOG.txt
6│ ├── DATA.md
7│ ├── REPORT.md
8│ └── SOW.md
9├── notebooks/
10├── results/
11├── README.md
12├── requirements.txt
13└── src
bin/
This folder should contain any binaries required for the project (should the exist), version and access date should be recorded when they are compiled/ fetched and placed in this folder.
data/
Raw data (primary) should be in this folder. It may be that the raw data is to large to fit on a local share, use a softlink in this case. (If this is the case , these files should NEVER be checked into a repo !)
docs/
You don’t need all of these documents, but these are some you MIGHT want to include.
-
CHANGELOG.txt - This document should act as a high level “notebook” for the project , tracking in reverse chronological order when major decisions / changes were made.
-
DATA.md - This document should describe in detail the source data. Where it came from, How it was accessed, Descriptions of variable names, encoding etc. If intermediate processing is done to tidy / transform the data that should be recorded amd described here as well. (Given the data sprawl in many projects its reasonable to assume that there may be a data dictionary for each source / transformation.)
-
REPORT.md - A final write up describing the research effort , any conclusions drawn, insights gained, etc.
-
SOW.md - A scoping document of some sort. It should be at least a semi-formal document that outlines the goals and scope of the project. Refer to it often to keep from going down a rabbit hole.
notebooks/
Not all projects will have notebooks, but many will.
Rmarkdown/Juypter notebooks should go here. These notebooks denote the lions share of man hours spent on a project. They should be numbered in the order that they should be run and should represent minimum units of work.
├── notebooks/
├── 01 - DataExploration
├── 02 - TransformationAndPreProcessing
├── 03a - RandomForestModel
├── 03b - SupportVectorMachine
└── 04 - VotingClassifier
They should be written cleanly enough and with enough documentation that someone unfamiliar with your project can pick them up and have a good idea of at least what happened and what steps were taken to complete them.
Finished Notebooks should be clean and well put together.
This is a notebook I used for a 20 minute challenge at a company, it’s bad. I’ve seen much worse though.
Once I have some non-work related notebooks available that I am comfortable with I’ll add them as better examples.
IF YOU’RE STORING/DISPLAYING THESE IN A PLACE WHERE THEY CAN’T BE RENDERED NATIVELY KEEP COPIES AS MARKDOWN OR PDF SO THAT PEOPLE CAN READ THEM
results/
Intermediate data , transformations, csv(s), graphs, pickled models, etc. Should go in here. Basically if it prints to disk and isn’t primary put it here. (Further organizational hierarchies here may be appropriate (multiple models, graphs, intermediate data, etc.))
git gud:
-
Only check in files that need to be tracked git IS your versioning system. If you’re checking in files with words like “ver_3_final”, re-evaluate why you’re doing it.
-
Commit files once a unit of work is complete - this keeps your log cleaner and allows you to hit natural breaking points easier.
-
Choose a branch/merge workflow - go checkout (git-flow) or (github-flow)
-
Define a checklist for merges - all pull requests and subsequent merges should pass some minimum QC. Decide what that looks like and if possible have independent review be required for approval.
-
Don’t checkin large constantly changing files - If you need to, use git-lfs or a different method. Checking in these files directly will choke git quickly.
Coding guidelines
-
PEP-8
-
At the bare minimum comment what a block of code accomplishes.
-
If you write a function; put in a docstring, and populate it appropriately.
-
If you utilize someone else’s code from SO, link back.