On Software Craftsmanship

Last week I found myself engaged with a group of students from Los Alamos National Laboratories in our Software Engineering for Scientists & Engineers, known informally as Software Craftsmanship. Apart from the epic New Mexico skies, the grand vistas, and the welcome relief from the heat and humidity of my hometown Austin, what I particularly loved about the week was the focus on craftsmanship.

This class had a high proportion of people I’d worked with before learning Python programming, data analysis, and/or machine learning, so it was easy to build rapport. Questions and dialog flowed easily. One student had this to say:

The interactivity of the in-person class, paired with the detailed course slides, was very effective. The source control (git), readable code, refactoring, and unit testing sections were all very useful and will be directly impactful to my work. There were multiple instances throughout the week where I learned something that would have saved me significant time on a problem I had encountered within the last 6 months.

One of the things we cover in the class is code review, the practice of submitting your code for review and critique before it’s accepted into a project, in some ways similar to the academic peer-review process. At Diller Digital, we try model this process by submitting and responding to feedback on the course materials. In response to a session of Software Engineering earlier this year, students suggested we learn source control and the details of git at the start of the class and then use it in a workflow typical of small teams in an R&D environment. Diller Digital has a git server (powered by Gitea, a close analog to GitHub and GitLab), and we created a class repository and developed a couple of small libraries that can serve as best-practice examples of variable naming, use of the logging library, Sphinx-ready documentation, unit testing, and packaging using standard tooling. One of the many jokes about git is that you can learn how to do 90% of what you’ll need with only understanding 10% of what’s actually going on. I’m not sure about the numbers there, but I do know that using and practicing what you’ve learned makes all the difference.

The in-person, instructor-led format makes engagement much easier and lowers the barriers to asking questions and providing individualized help. But one of the important principals behind that is the role of effortful thinking in learning. I like the way Derek Muller (of Veritasium fame) explains in this video how we have two systems in our brain, one fast — for instinctive, rapid-fire processing of the kind you’re using to parse the words on this page, and one slow – the effortful, brain-taxing system required for understanding something.

It’s probably that effortful system you’re using trying to understand my point, and you’ll surely use it trying to tell whether 437 is evenly divisible by 7 in your head. It’s not quite as simple as two distinct systems, as the author of that idea, Daniel Kahneman, makes clear in his book, Thinking, Fast and Slow, but it gives us a useful mental model for talking about software craftsmanship, and why we teach the way we do at Diller Digital. One of the main takeaway points is that effortful thinking is necessary for learning, but not all effortful thinking results in useful learning.

One of the first ideas we introduce in Software Engineering is that of cognitive load and its management. Cognitive load is a measure of effortful thinking — it’s the effort required to understand something, and we would like that effort to be spent on important things like the business logic of an algorithm and not on trivial things like indentation and syntax. That’s the purpose using a coding standard — once your brain gets used to seeing code that’s formatted in a common way (for Python it’s embodied in PEP8), the syntax becomes transparent (it’s handed off to the fast thinking part of our brain), and you can see through it to the logic of the code and spend your effort understanding that. Code that’s not formatted that way introduces a small cognitive tax on each line that adds up to measurable fatigue over time. If you want an example of that kind of fatigue, try this little game.

So managing cognitive load informs choices of layout, use of white space, and selecting the names of Python objects, and this is one of the important things we teach in Software Engineering. But it also informs the way we design our courses. We introduce ideas and demonstrate them and then have our students spend effort internalizing them, first in a simple “Give It A Try” exercise and eventually in a comprehensive exercise. The goal is to direct our students’ effort to increasingly independent tasks, in what is sometimes called a “fading scaffold”, where early effort is guided closely, and in later efforts, students have more room to make and recover from mistakes. This is also the thinking behind the presence in some courses of “Live Coding” scripts, where demos and exercises are set up already, and the student only has to focus on the learning goal and not on typing all of the supporting code around it. These have proven to be especially popular in our Machine Learning and Deep Learning classes.

This also suggests a strategy for the effective use of Large Language Models for coding. Use them reduce effort where it’s not critical to gain understanding or to gain a skill. But don’t let them replace effortful thinking where it counts most — in learning and in crafting your scientific, engineering, or other analytical workflow. And if you want a guide in your learning journey, we’re here to help. Click here for the course schedule.

I have taken four courses with Diller Digital and this [Software Engineering] is by far the most useful one. Many of us have learned programming as a need to do research, but we do not have any formal background in computational programming. I think this course takes basic Python programming skills to a more formal level, aligned with people with programming background allowing us to improve the quality of code we produce, the efficiency in the implementation and collaboration. 
Also, hosting the course in person made a big difference for me. I was easily engaged the entire day, the exercises and the possibility to ask in person made the entire course smoother.

I think this course material is incredibly helpful for people who don’t have professional software engineering experience. Of all the courses I took from Diller Digital, I found this the most foundational and immediately useful.

“Popping the Hood” in Python

One man holds the hood of a car open while he and his friend look at the engine together.

Last weekend found me elbow-deep in the guts of my car, re-aligning the timing chain after replacing a cam sprocket. As I reflected on the joys of working on a car with only 4 cylinders and a relatively spacious engine bay, I found myself reflecting on one of the things I love best about the Python programming language — that is the ability to proverbially “pop the hood” and see what’s going on behind the abstractions. (With a background in Mechanical Engineering, car metaphors come naturally to me.)

As an Open Source, well-documented, scripted language, Python is already accessible. But there are some tools that let you get pretty deeply into the inner workings in case you want to understand how things work or to optimize performance.

Use the Source!

The first and easiest way to see what’s going on is to look at the inline help using Python’s built-in help() function, which displays the docstring using a pager. But I almost always prefer using the ? and ?? in IPython or Jupyter to display the just the docstring or all of the source code if available. For example consider the relatively simple parseaddr function from email.utils:

In [1]: import email

In [2]: email.utils.parseaddr?
Signature: parseaddr(addr, *, strict=True)
Docstring:
Parse addr into its constituent realname and email address parts.

Return a tuple of realname and email address, unless the parse fails, in
which case return a 2-tuple of ('', '').

If strict is True, use a strict parser which rejects malformed inputs.
File:      /Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/email/utils.py
Type:      function

In our Python Foundations course, I can usually elicit some groans by encouraging my students to “Use the Source” with the ?? syntax, which displays the source code, if available:

In [3]: email.utils.parseaddr??
Signature: parseaddr(addr, *, strict=True)
Source:   
def parseaddr(addr, *, strict=True):
    """
    Parse addr into its constituent realname and email address parts.

    Return a tuple of realname and email address, unless the parse fails, in
    which case return a 2-tuple of ('', '').

    If strict is True, use a strict parser which rejects malformed inputs.
    """
    if not strict:
        addrs = _AddressList(addr).addresslist
        if not addrs:
            return ('', '')
        return addrs[0]

    if isinstance(addr, list):
        addr = addr[0]

    if not isinstance(addr, str):
        return ('', '')

    addr = _pre_parse_validation([addr])[0]
    addrs = _post_parse_validation(_AddressList(addr).addresslist)

    if not addrs or len(addrs) > 1:
        return ('', '')

    return addrs[0]
File:      /Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/email/utils.py
Type:      function

Looking at the next-to-last line, you see there’s a path to the source code. That’s available programmatically in the module‘s .__file__ attribute, so you could open and print the contents if you want. If we do that for Python’s this module, we can expose a fun little Easter Egg.

In [4]: import this
# <output snipped - but try it for yourself and see what's there.>

In [5]: with open(this.__file__, 'r') as f:
   ...:     print(f.read())
   ...: 
s = """Gur Mra bs Clguba, ol Gvz Crgref

Ornhgvshy vf orggre guna htyl.
Rkcyvpvg vf orggre guna vzcyvpvg.
Fvzcyr vf orggre guna pbzcyrk.
Pbzcyrk vf orggre guna pbzcyvpngrq.
Syng vf orggre guna arfgrq.
Fcnefr vf orggre guna qrafr.
Ernqnovyvgl pbhagf.
Fcrpvny pnfrf nera'g fcrpvny rabhtu gb oernx gur ehyrf.
Nygubhtu cenpgvpnyvgl orngf chevgl.
Reebef fubhyq arire cnff fvyragyl.
Hayrff rkcyvpvgyl fvyraprq.
Va gur snpr bs nzovthvgl, ershfr gur grzcgngvba gb thrff.
Gurer fubhyq or bar-- naq cersrenoyl bayl bar --boivbhf jnl gb qb vg.
Nygubhtu gung jnl znl abg or boivbhf ng svefg hayrff lbh'er Qhgpu.
Abj vf orggre guna arire.
Nygubhtu arire vf bsgra orggre guna *evtug* abj.
Vs gur vzcyrzragngvba vf uneq gb rkcynva, vg'f n onq vqrn.
Vs gur vzcyrzragngvba vf rnfl gb rkcynva, vg znl or n tbbq vqrn.
Anzrfcnprf ner bar ubaxvat terng vqrn -- yrg'f qb zber bs gubfr!"""

d = {}
for c in (65, 97):
    for i in range(26):
        d[chr(i+c)] = chr((i+13) % 26 + c)

print("".join([d.get(c, c) for c in s]))

Another way to do this is to use the inspect module from Python’s standard library. Among many other useful functions is getsource which returns the source code:

In [6]: import inspect
In [7]: my_source_code_text = inspect.getsource(email.utils.parseaddr)

This works for libraries and functions that are written in Python, but there is a class of functions that are implemented in C (for the most popular version of Python, known as CPython) and called builtins. Source code is not available for those in the same way. The len function is an example:

In [8]: len??
Signature: len(obj, /)
Docstring: Return the number of items in a container.
Type:      builtin_function_or_method

For these functions, it takes a little more digging, but this is Open Source Software, so you can go to the Python source code on Github, and look in the module containing the builtins (called bltinmodule.c). Each of the builtin functions is defined there with the prefix builtin_, and the source code for len is at line 1866 (at least in Feb 2025 when I wrote this):

static PyObject *
builtin_len(PyObject *module, PyObject *obj)
/*[clinic end generated code: output=fa7a270d314dfb6c input=bc55598da9e9c9b5]*/
{
    Py_ssize_t res;

    res = PyObject_Size(obj);
    if (res < 0) {
        assert(PyErr_Occurred());
        return NULL;
    }
    return PyLong_FromSsize_t(res);
}

There you can see that most of the work is done by another function PyObject_Size(), but you get the idea, and now you know where to look.

Step by Step

To watch the Python interpreter step through the code a line at a time and explore code execution, you can use the Python Debugger pdb, or its tab-completed and syntax-colored cousin ipdb. These allow you to interact with the code as it runs and execute arbitrary code in the context of any frame of execution, including printing out the value of variables. They are the basis for most of the Python debuggers built in to IDEs like Spyder, PyCharm, or VS Code. Since they are best demonstrated live, and since we walk through their use in our Software Engineering for Scientists & Engineers class, I’ll leave it at that.

Inside the Engine

Like Java and Ruby, Python runs in a virtual machine, commonly known as the “Interpreter” or “runtime”. So in contrast to compiling code in, say, C, where the result is an executable object file consisting of system- and machine-level instructions that can be run as an application by your operating system, when you execute a script in Python, your code gets turned into bytecode. Bytecode is a set of instructions for the Python virtual machine. It’s what we would write if we were truly writing for the computer (see my comments on why you still need to learn programming).

But while it’s written for the virtual machine, it’s not entirely opaque, and it can sometimes be instructive to take a look. In my car metaphor, this is a bit like removing the valve cover and checking the timing marks inside. Usually we don’t have to worry about it, but it can be interesting to see what’s going on there, as I learned when producing and answer for a Stack Overflow question.

In the example below, we make a simple function add. The bytecode is visible in the add.__code__.co_code attribute, and we can disassemble it using the dis library and turn the bytecode into something slightly more friendly for human eyes:

In [9]: import dis
In [10]: def add(x, y):
    ...:     return x + y
    ...: 
In [11]: add.__code__.co_code
Out[11]: b'\x95\x00X\x01-\x00\x00\x00$\x00'
In [12]: dis.disassemble(add.__code__)
  1           RESUME                   0

  2           LOAD_FAST_LOAD_FAST      1 (x, y)
              BINARY_OP                0 (+)
              RETURN_VALUE

In the output of disassemble, the number in the first column is the line number in the source code. The middle column shows the bytecode instruction (see the docs for their meaning), and the right-hand side shows the arguments. For example in line 2, LOAD_FAST_LOAD_FAST pushes references to x and y to the stack, and the next line BINARY_OP executes the + operation on them.

Incidentally, if you’ve ever noticed files with the .pyc extension or folders called __pycache__ (which are full of .pyc files) in your project directory, that’s where Python stores (or caches) bytecode when a module is imported so that next time, the import is faster.

In Conclusion

There’s obviously a lot more to say about bytecodes, the execution stack, the memory heap, etc. But my goal here is not so much to give a lesson in computer science as to give an appreciation for the accessibility of the Python language to curious users. Much as I think it’s valuable to be able to pop the hood on your car and point to the engine, the oil dipstick, the brake fluid reservoir, and the air filter, I believe it’s valuable to understand some of what’s going on “under the hood” of the Python code you may be using for data analysis or other kinds of scientific computing.