Project Layout
There isn’t a one-size-fits-all solution to project layout, both because there are subjective elements of taste involved, and because projects of different designs and complexity levels have different needs.
However, even on small projects, starting with a clean layout will give the project better room to grow and help both you and others (including future you) understand the project.
Core Principles
Regardless of scope or design, a few core principles will help keep your project directory trees on track.
Clarity
The overarching principle that should guide your project layout is clarity and understandability. Is it obvious where to find things, or where to put new things?
We don’t always know a perfect layout in advance, especially early in our data science careers. We will name things confusingly or put them in obscure locations. But when we notice that it’s hard to find something, or hard to explain where something is and why, that’s an opportunity to reflect on the organization process and how we might make it clearer in the future.
There are a few heuristics that can help assess and improve clarity:
- Naming. If it is difficult to name a file, directory, or other element of our project, that is a sign that we may not have a clear idea of its function yet. We may be trying to combine too many things in one place, or subdividing things too finely. Sometimes we will need non-obvious names, but if naming is difficult, take that as a sign that something else might be able to be improved in the design.
- Documentation Complexity. If you find that it takes a lot of explanation to describe where something is and why, the design might not be right yet.
- Confusion. If you, or team members, are regularly confused about what goes in one place vs. another, perhaps there is a clearer distinction to make.
Name Things Consistently
Consistent, parallel naming (applying the same name pattern to the same kinds of things) will help you keep your files straight, reduce the risk of clobbering output files, and make it easier to work with your files. Some examples:
- Naming evaluation metric plots
figures/<dataset>/eval-<metric>.png, for all metrics and datasets. - Naming saved models
models/<dataset>-<model>-<variant>.pkl.zst, e.g.models/ml32m-bpr-tuned.pkl.zst.
The precise naming strategy is not important, and will differ from project to project (e.g., dataset/model/variant might be too many axes for identifying a model, or it might be too few); consistency is important. There are a few reasons this is helpful:
- It’s easier for you (or others) to find the specific file you’re looking for, because you know where to look.
- You can grow and expand the project (e.g. adding more models or datasets) without renaming existing files or confusing outputs. For this reason, it’s sometimes useful to include dataset names in file names or paths even when you are starting with one.
- It’s easy to look at files for specific aspects of the project using standard filename matching (i.e. globbing). For example, with the model file names above, you can see how much disk space is used by the ML-32M models by running
du -h models/ml32m-*.pkl.zst.
Keep Things Small
It’s also useful to use subdirectories to organize your project to keep individual directories, especially the top-level directory, relatively small. My extremely rough rule of thumb is that if I cannot fit ls -l of the top-level directory’s contents in one “ordinary-sized” terminal display, I want to break it into subdirectories. I usually use subdirectories at the top level, at least, anyway.
When things are located in well-named subdirectories, instead of a large pile of files, it’s easier to see what is in a particular part of the project in my experience.
Data Flows One Way
I usually find it easier to understand a project layout if data flows one way: that is, if our workflow uses data in D1 to produce outputs in D2, we don’t then use D2 to produce more outputs in D1. Transitively, this means that the directories or sections of a project should be a directed acyclic graph.
There are times when it is useful to violate this rule in order to make the project easier to understand in other ways, but it’s a very good default.
What Lives in Root
There are a few files that (almost) always live in your project root (the top directory of your project). A quick non-exhaustive list:
- A
README.mddescribing the purpose of the project, how to start working with it, etc. This is also a good place to describe your project layout. - Project definition and dependency files, for example:
- Python:
pyproject.toml,uv.lock - Rust:
Cargo.toml,Cargo.lock - Node.js:
package.json,package-lock.json - Pixi:
pixi.toml,pixi.lock
- Python:
- The
.gitdirectory containing the actual Git repository contents - The
.dvcdirectory containing DVC configuration and cache - Project-wide tool configuration files (
.editorconfig, type checker or linter configurations, etc.) - Top-level workflow automation, such as a
dvc.yamlwith your primary DVC workflow stages (e.g. the final output). In some projects, this will be the onlydvc.yaml. - For publicly-distributed projects, a
LICENSE.mdfile with the software license.
Simple Project Layout
For a simple project, primarily in notebooks or 1–3 scripts, we can use very simple layouts, like:
data/- Directory containing the input data files, and possibly initially-processed versions of them.
outputs/- Directory containing statistical model outputs (if needed).
figures/- Directory containing figures generated in your notebooks and saved to files for inclusion in other documents.
Notebooks, along with the general root files, just go in the top-level directory.
More Complex Projects
Suppose we have a project where we are training models, generating outputs, and computing evaluation results for those outputs. It is useful to directly save the outputs, not just the final metrics, so that we can do secondary analyses of the outputs, and so we can change how we compute the evaluation metrics without re-generating all of the outputs.
We might organize such a project like this:
src/-
Python source code. Includes a single directory
src/myproject/that is a Python package containing helper modules that are used in multiple parts of the project. data/- Input data, as above. Also includes the processed versions of the input data, and the splits for train-test evaluation.
models/- Serialized versions of models trained on the training data. If we only want to save outputs, not the trained models themselves, we would omit this directory.
outputs/- Outputs from applying the models to the test data (and, if applicable, any other data we want to produce model outputs for). Typically saved in Parquet or another relevant format.
results/orevals/- Results of measuring the test outputs against the test data and computing our evaluation metrics.
figures/- Saved charts for use outside of notebooks.
scripts/- Python scripts for various tasks (e.g. training a model).
Analysis notebooks either live in the project root, or in a notebooks/ directory.
Dataset-Oriented Layout
When building a complex project with many data sets or data sources, it is sometimes useful to organize it by data source or set instead of by role. In such a project, I’ll usually have top-level src/ and scripts/ directories, along with a directory for each data set or family of data sets. Within the dataset directories, again follow a consistent layout, perhaps with data/, models/, and results/ directories.
When such a project has top-level analytic results (computed across multiple data sources), there will still be a top-level results/ directory for those results.
I use this layout in a couple of major projects: