If you’re new to Make, check out Mike Bostock’s article Why Use Make, it’s excellent! This post is intended as a follow-up to Mike’s introduction.
I love Makefiles because they allow me to describe my workflow as a directed acyclic graph. Makefiles are a great example of declarative programming. When I specify a rule like the following:
targetfile: sourcefile command
I am saying that the
targetfile depends on the
sourcefile. Whenever I issue the command
make targetfile, Make checks to see if anything in the
targetfile’s dependency graph needs to be recompiled and it runs the necessary commands to bring the
targetfile up to date. I enjoy using Make because it provides:
- A framework for writing reproducible research.
- A transparent caching mechanism. Often downloading data can take a lot of time, while cleaning data once it’s downloaded is relatively fast. By breaking these into two rules. I only need to download the data once and then I can focus on data cleaning and data analysis without re-running code from previous steps.
- A mechanism for building projects in parallel. Using
lsmakeon the Grid) tells Make to run commands in parallel. All I have to specify is how each file in my project is built, Make figures out how to run everything in parallel.
Makefiles as Glue
I often find myself using different tools for different jobs. I like using Python for web scraping, R for data visualization, and Stata for certain statistical models. Makefiles make it easy to combine different tools:
DATA = data/processed/data.csv $(DATA): src/download.py python $< reports/figures/graph.pdf: src/graph.R $(DATA) Rscript $< reports/figures/table.tex: src/table.do $(DATA) stata-mp -b do $<
Compiling a Bunch of Files at Once
Often the projects I work on require a lot of analyses. Imagine the following directory structure:
. ├── Makefile ├── data │ └── processed │ └── data.dta └── src └── tables ├── table1.do ├── table2.do └── table3.do
Putting the following two rules in my Makefile allows me to recompile all tables with a single
make tables command:
%.log: %.do data/processed/data.dta cd $(dir $<); stata-mp -b do $(notdir $<) DO_FILES = $(shell find src/tables -name "*.do") LOG_FILES = $(patsubst %.do,%.log,$(DO_FILES)) tables: $(LOG_FILES)
Working with Databases
Make cannot inspect when a database table was last modified. Imagine we have a script that updates a table of patent data. We can work this into a Makefile by creating a corresponding file to keep track of when the database table was last updated. A rule like the following will allow Make to keep track of when the patents table was last updated:
data/processed/patents.table: src/patents.py python $< echo "Data stored in PostgreSQL database." > $@
There are a crazy number of alternatives to Make. Here are just a few:
For the most part, I’ve found Make does everything I need it to do. Although the syntax is ugly, I appreciate how it ships with Unix-like operating systems (I find it annoying when I want to install a project and first I have to install the installation tool). That being said, I am very interested to experiment with Luigi (I’ve heard great things).
If you want to learn more about how I structure my projects, check out Cookiecutter Data Science.