The world is starving for more science and science is starving for more software.
Every scientist spends hours at their computer: designing experiments, documenting lab work, and analyzing data, all within inflexible, outdated point-and-click tools. No software product can fully keep up with scientific use cases, which by definition are constantly changing.
That history of clicks is as important to the quality and reproducibility of scientific results as a detailed lab protocol, yet these procedures are not recorded and are prone to mistakes and variations. Excel, the workhorse data analysis tool, can hide calculation errors in opaque cell formulas. Manual data entry into spreadsheets is tedious and error-prone.
The obvious solution is to do data analysis in code, and many scientists are eager to write their own scripts to accelerate their work. In my experience, almost all experimentalists have tried Python or R for things like plotting, but most don’t yet feel comfortable writing a robust end-to-end data pipeline.
AI is amazing at writing code, and now is the perfect moment in history for every scientist to level up as a programmer. But while LLMs will happily write usable Python code, these are typically one-off scripts that are almost as ephemeral as a point-and-click Prism workflow1. There are a few non-obvious steps required to convert that code into robust, reproducible, and well-engineered software.
I’m here to help. This article is meant to be a resource for those scientists who have begun to write code but want a bit more structure and best practice guidance to fit it into their day-to-day work2.
This post will be about principles and a standard project structure for managing data and code. If there’s enough interest, in a future post I’ll walk through a step-by-step tutorial of Cursor and Python for LLM-driven data analysis, using a mock plate reader dataset.
Project templates
Software engineers rarely have to think much about how to organize their project folders, because there are well-established conventions for every language (Python, Go, R) and type of project (web development, ML, library). This is a huge time saver both for the programmer building the project as well as the programmer who reviews the code, since they know exactly where to put (or look for) every type of file or functional module.
But despite the fact that scientists still do a large fraction of their work in their local filesystem, there are no similar standards (that I’m aware of) for how scientists should organize their files.
Using a conventional folder structure sets you up to easily layer on software best practices as you level up as a programmer, because these tools often rely on certain unspoken file and folder conventions3.
The main advantage of having a standardized, well-organized file and folder structure, though, is reproducibility.
Reproducibility
Two guiding principles for organizing your project should be4:
Someone unfamiliar with your project should be able to look at your computer files and understand in detail what you did and why.
In fact, someone unfamiliar with your project should be able to re-run every piece of it and obtain exactly the same result. That person unfamiliar with the project is probably yourself several months later after having forgotten the details.
Everything you do, you will probably have to do over again.
Reproducibility is not simply a “nice-to-have” for sharing work externally, it is an absolute necessity for the iterative nature of the research process. You need to be able to re-run any analysis 6 months later if you want to be an efficient, productive scientist.
Reproducibility implies that everything you need to produce the final tables and figures is available within the project folder. You should be able to copy the folder to any computer with access to the internet and still be able to run the full analysis. If this involves long-running or compute-heavy tasks, you have other options5 but we’ll leave that out of scope of this article.
Source of truth
In the course of data analysis, you will generate many versions of the data. If you’re not careful, you may confuse raw data with modified data. If you make a mistake in your reasoning and make an update, you’ll want the ability to go back and re-run the full analysis, which requires some starting point for the source of truth — often the raw instrument data.
Reserve a separate raw/
folder for data that goes untouched, such as instrument data, and a data/
folder for processed data generated by the project code. As a rule, I like all files in data/
to be immutable (cannot be changed once created) and completely derived from files in raw/
using code from the project.
You’ll need to store metadata about your samples, controls, platemaps, etc. This is similar to raw data, in that it is a starting point for analysis, but is often manually created and often updated during the course of analysis. This can go in a separate metadata/
folder. Use YAML or CSV format for metadata for ease of loading into an analysis script.
Scripts and Notebooks
Scientific computation comes in three flavors: libraries (classes and functions), workflows (scripts) and analyses (notebooks). Typically a project will have all three.
Libraries contain reusable, well-defined, and tested chunks of code that you can drop into scripts and notebooks. Testing is a key advantage—scripts and notebooks are hard to test but functions make testing extremely easy. If you have complex calculations in a script or notebook, you should not feel comfortable until you have pulled those key lines of code into a well-defined function and have written several tests against it.
Workflows are predefined, standardized, generalizable operations that can be run in batch and left to run unsupervised. The typical workflow is just a short Python or Bash script, but they can range up to multi-step, parallelized compute jobs running on clusters. They are often used for data preprocessing or common, repetitive procedures.
Analyses are interactive, iterative, heavy on visualization and summary statistics, and unique to a project. They convert processed data into figures, tables, and summary statistics (results). Analyses are usually in the form of computational notebooks such as Jupyter— by far the most popular, although there are other options6. Analyses intertwine docs, visualizations, and code that turn the data analysis into a human-readable story.
Notebooks serve two essential purposes: acting as a running log for day-to-day computational analysis work, and sharing the final results of an analysis. The two types of notebooks aren’t necessarily one in the same, and it’s good to keep several notebooks for logging daily computational experiments, while reserving a single, separate notebook to share the cleaned-up and well-documented final analysis.
Results
Keep a results/
folder to capture the tables and figures from your notebooks. Just add a command to save the file after each plotting code block, and liberally use .to_csv()
for any important tables.
The results folder can also contain your final writeup. This can be another notebook, or a Markdown or LaTeX doc that pulls in your figs and tables.
Naming things
You should not put metadata in data filenames. This means avoiding descriptive folder and filenames, such as “hek_6hr_20mM.csv,” because you can rarely fit in a single filename all the metadata you need to describe a sample. Assign sample IDs and name your files by those (“sample_1.csv”).
Instead, metadata belongs in a table, where it often makes sense to also include the filename as a field. In the example above, consider using a CSV-formatted table with columns sample_id
, cell_line
, timepoint
, concentration
, and filename
.
Don’t put versions in your filenames. If you are comfortable with it, use git for version control. If not, don’t sweat it7, just create a new folder for each version of your analysis (such as v1/
) and repeat the same folder and filenames structure within it. You don’t need to copy and version your raw data if it didn’t change, but you should keep your processed data in a new data/
folder alongside the version of the code that generated it. If you re-ran the experiment in the lab, of course, then you want to create another version for the new raw data.
Put libraries (source code), workflows (scripts), and analyses (notebooks) into their own folders: src/
, scripts/
and notebooks/
, respectively. Name notebooks and scripts starting with a number, like `01_format_data.py`. This is not best practice in software, and your programmer friends will scoff8, but trust me that it makes your life easier to know the ordering when you have to re-run an analysis.
READMEs
Drop a README.md
file in any folder where it’s not absolutely obvious what its contents are. Don’t write so much detail that it’s a chore, and don’t worry too much about structure. Just write what you would tell yourself or a colleague who hasn’t seen your project before.
Also keep a README.md
in the top-level folder. You don’t need to write a ton here either, just a quick overview of the project and a “Getting started” section on how to run it and where to find key results.
Makefiles
Everything you do, from the raw data to the final output, should be encoded into your scripts and notebooks. Then, you designate a single “wrapper” script that gathers all upstream dependencies and acts as the master switch to re-run the entire analysis from scratch. There are many options here, but I recommend make
to manage your scripts and notebooks because it is already installed in most command-line consoles and it is better than a simple Bash script9. Reproducing your project then becomes as simple as running “$ make build
”10 on the console.
Usually reproducing your project requires running multiple files, from raw data and metadata to processing scripts to analysis scripts, and a Makefile can track these dependencies. It checks file timestamps and won’t re-run a script if its outputs are more recent than the script itself. This comes in handy so you don’t need to re-run all of your long-running steps.
A Makefile also gives you a handy command-line tool. You can add the full bash command for a set of scripts to the Makefile, and then create a simple shortcut alias to execute them. For example, you can name a group of plotting commands “plot” and then the command make plot
will run them on the Bash command line (using the dependency checking I mentioned above). You can then have another set of commands with another name (e.g., test
) and selectively run those. This smart execution is much better than a Bash script which will just run every single command line-by-line.
Makefiles may seem foreign at first, but they’re easy and your LLM can write them for you.
Folder structure
Let’s put it all together.
Open your file browser and create a clean directory for your project. Use lower case and hyphens for the folder name instead of spaces (e.g. analysis-project
) if you want to fit in with the coders.
Create the following folders and follow the file naming patterns. If you’re not using git, then put your data/
, src/
, scripts/
, notebooks/
, and results/
under a v1/
folder.
That should get you started, and I’ll walk you through a mock analysis project in a future post.
raw/
├── README.md
└── sample_1.csv
metadata/
├── README.md
└── samples_metadata.csv
src/
├── data.py
├── stats.py
└── utils.py
data/
├── README.md
└── processed_data.csv
scripts/
├── README.md
├── 01_process_raw_data.py
└── 02_analyze_data.py
notebooks/
├── README.md
├── 01_exploratory_analysis.ipynb
└── 02_final_results.ipynb
results/
├── README.md
├── table1.csv
├── fig1.png
└── report.md
Makefile
README.md
GraphPad Prism is the go-to GUI stats and plotting software for many biologists
This is an rewrite of a post I wrote in 2018, which was an update of one I had written a decade ago, which itself was just amplifying ideas I had read in a journal article from 2009.
The original posts were geared toward purely computational scientists, but I’ve targeted this version to experimentalists. While in general dry lab scientists have leveled up their software practices in recent years and no longer need this guide, we now have a surge of scientists from all backgrounds dipping their toes into scripts and Jupyter notebooks with help from LLMs, which presents a new audience for these ideas.
For example, Python packaging is easier if your library code (e.g. classes and functions) is kept separate from your scripts and notebooks, and git version control is easier if you keep your code and data in separate folders.
Taken verbatim from Noble WS (2009) A Quick Guide to Organizing Computational Biology Projects. PLoS Comput Biol 5(7): e1000424.
Docker! The cloud!
But as soon as you’re ready to level up your software skills, you should learn git for version control. Software Carpentry has a great resource for scientists to learn git.
The reason code files don’t usually start with a number is it prevents you from importing them into other code, but we don’t care about that for scripts and notebooks.
Software Carpentry also has a great tutorial on Make for scientists.
Or “$ make run”
or whatever you want to name it, but “build” is a very common term for compiling a software project using Makefiles.