doit for task automation
Having to manage computational workflows can be difficult, more importantly tracking which input and output files have recently been updated to ensure that you don’t repeat work will likely result in you screwing something up or repeating work.
doit solves this problem by tracking which files within a workflow need to be created or updated at any time , and executing those processes in the correct order. If you’ve ever written a make file, none of this will be new to you - I do hope that you’ll appreciate the much simpler and easy to use syntax though !
Here I’ll go over a few use examples and use cases.
Getting pydoit
pip install doit
Hello World
doit allows you to build a series of rules that describe tasks to be completed. Each task has a series of dependencies, an action, and a target (or output). The simples doit work flow would be something like this:
1file_in --> processing_program --> file_out
When you run doit a few things can happen:
-
pydoit sees that
file_out
doesn’t exist, so it executes the task to make sure it is created. -
pydoit sees that
file_out
exists, butfile_in
has changed since the last time the task has run. So it re-executes the task to make sure that file_out is re-formed. -
pydoit sees that
file_out
exists, and nothing up stream of it has changed. So it does nothing.
Minimum Example:
Put all three files into a folder:
1├── dodo.py
2├── ShoulgisdISayHello
3└── HelloWorld.py
and run :
1doit
resulting in :
1.
2├── dodo.py
3├── Hello.txt
4├── HelloWorld.py
5└── ShouldISayHello
Pretty simple.
Doit for batch processing
By using a yield statement in our task, we dynamically generate tasks over any arbitrary list. The dodo file is pretty self explanatory, just take special notice of the use of python yield
and the keyword name
.
Grab these files and put them in a folder.
Run the following commands:
1mkdir data
2cd data
3for i in {1..3}; do touch $i.in;done
4cd ..
5doit -n 3
Outputs something like:
1. ingest:ingest_data/1.in
2. ingest:ingest_data/2.in
3. ingest:ingest_data/3.in
Doit just processed those files in parallel using 3 processes.
Lets add some files into our landing area :
1cd data
2for i in {5..20}; do touch $i.in;done
3cd ..
4doit -n 3
Outputs something like:
1-- ingest:ingest_data/2.in
2-- ingest:ingest_data/1.in
3-- ingest:ingest_data/3.in
4. ingest:ingest_data/5.in
5. ingest:ingest_data/17.in
6. ingest:ingest_data/18.in
7. ingest:ingest_data/16.in
8. ingest:ingest_data/15.in
9. ingest:ingest_data/14.in
10. ingest:ingest_data/20.in
11. ingest:ingest_data/12.in
12. ingest:ingest_data/10.in
13. ingest:ingest_data/6.in
14. ingest:ingest_data/7.in
15. ingest:ingest_data/8.in
16. ingest:ingest_data/11.in
17. ingest:ingest_data/9.in
18. ingest:ingest_data/13.in
19. ingest:ingest_data/19.in
It skipped over the first 3 ingestion tasks as they’d been completed.
It should pretty obvious how one might link up doit to a cron job to watch a landing area for new data to enter a pipeline.
Final Thoughts
Because doit provides a simple and modular way of describing tasks within a workflow, it allows for a simple and lightweight method of describing and executing workflows.
Its a powerful framework that has plenty of knobs to turn. Documentation of some of the cooler features are here