Class Questions Archives - Diller Digital Blog

On Python environments, pip, and uv

Often at the start of classes I teach, I get a question about virtual environments — why do we use them? What are they? How do they work? What is your opinion about uv (or the latest packaging tool)? And as I may have mentioned before on this blog, I enjoy a good question, so I thought I’d share a bit on why we do what we do with virtual environments at Diller Digital.

Virtual Environments

First of all, what is a virtual environment? For most applications on a computer or mobile device, you just install it once to get started and then download periodic updates to keep it current. Creative suites with lots of modules and extensions like PhotoShop have a single executable application with a single collection of libraries that extend their capabilities. The MathWorks’ MATLAB works this way, too, with a single runtime interpreter and a collection of libraries curated by The MathWorks. In this kind of monolithic setup, the publisher of the software implicitly guarantees the interoperability of the component parts by controlling the software releases and conditions of whatever extension marketplace they offer.

The world of Python is different: there is no single entity curating the extensions; different libraries are added for different use cases, resulting in different sets of dependencies for different applications; and there is no guarantee of backward compatibility. In fact, managing dependencies and preserving working environments is part of the price to pay for Free Open Source Software (FOSS). And that’s where virtual environments come into play. If the world of FOSS can feel like the Wild West, then think of virtual environments as a way to make a safe “corral” where you have a Python interpreter (runtime) and a set of libraries that work together and can control when and if they are updated so they don’t break code that is working.

As a practical example, let’s consider the TensorFlow library, which Diller Digital uses in one version of our Deep Learning for Scientists & Engineers course (we also provide a version that uses PyTorch). TensorFlow has components that are compiled in C/C++, and when Python 3.12 was released in October 2023, it included several internal changes that caused many C-extension builds to fail, including TensorFlow’s. At that time, the latest version of TensorFlow was 2.14, and it could only be run with Python versions between 3.8 and 3.11. It was early 2024 before TensorFlow 2.16 was released, which finally enabled use of Python 3.12. By that time, Python 3.13 was being released, and the cycle started again. Thus, users of TensorFlow have specific version requirements for their Python runtime, and it sometimes varies by hardware.

Consider, though, that a user of TensorFlow may also have another unrelated project that could really benefit from a recent release of another library, say Pandas, that is only available (or only provides wheels for) newer versions of Python. In this case, the TensorFlow dependencies conflict with the Pandas dependencies, and if Python were a monolithic application, you would have to choose between one or the other.

Virtual environments ease that dilemma by creating independent runtime environments, each with their own Python executable (potentially of different versions) and set of libraries. Recent versions of Python (if downloaded and installed from Python.org) can sit next to each other in an operating system and run independently. In the macOS, these are generally found in the Library/Frameworks/ directory and can be accessed with python3 (which is an alias to the latest one installed) or using the full version number.

❯ which python3
/Library/Frameworks/Python.framework/Versions/3.13/bin/python3
❯ which python3.12
/Library/Frameworks/Python.framework/Versions/3.12/bin/python3.12
❯ which python3.13
/Library/Frameworks/Python.framework/Versions/3.13/bin/python3.13
❯ which python3.14
/Library/Frameworks/Python.framework/Versions/3.14/bin/python3.14

❯ which python3
/Library/Frameworks/Python.framework/Versions/3.13/bin/python3
❯ which python3.12
/Library/Frameworks/Python.framework/Versions/3.12/bin/python3.12
❯ which python3.13
/Library/Frameworks/Python.framework/Versions/3.13/bin/python3.13
❯ which python3.14
/Library/Frameworks/Python.framework/Versions/3.14/bin/python3.14

Note how each minor version (the 12 or 13 in 3.12 or 3.13) has its own runtime.

In Windows it’s similar, but the locations are different.

C:\Users\tim> where python
C:\Users\tim\AppData\Local\Programs\Python\Python314\python.exe
C:\Users\tim\AppData\Local\Programs\Python\Python313\python.exe
C:\Users\tim\AppData\Local\Programs\Python\Python312\python.exe
C:\Users\tim\AppData\Local\Microsoft\WindowsApps\python.exe

C:\Users\tim> where python
C:\Users\tim\AppData\Local\Programs\Python\Python314\python.exe
C:\Users\tim\AppData\Local\Programs\Python\Python313\python.exe
C:\Users\tim\AppData\Local\Programs\Python\Python312\python.exe
C:\Users\tim\AppData\Local\Microsoft\WindowsApps\python.exe

That last entry in the Windows example is sneaky — execute it and you’re taken to the Windows App store. I don’t recommend installing that way.

What we often do in Diller Digital classes is to create a virtual environment dedicated to the class. This makes sure we don’t break anything already set up and working, and if future changes to the libraries break the examples we used in class, there will at least be a working version in that environment. The native-Python way to do this (by which I mean “the way that uses only standard libraries that ship with Python”) is using the venv library. Many Python libraries have a “main” mode that you can invoke with the -m flag, so setting up a new Python environment looks like this:

python -m venv my_new_environment

Which python you use to invoke venv determines the version of the runtime that gets installed into the virtual environment. In Windows, if I wanted something other than the version that appears first in the list, I’d have to specify the entire path. So, for example, to create a version that used Python 3.12, I’d type:

C:\Users\tim\AppData\Local\Programs\Python\Python312\python.exe -m venv my_new_environment

In macOS as I showed above, I’d type

python3 -m venv my_new_environment

and it would use version 3.13. The virtual environment amounts to a new directory in the place where I invoked the command with contents like this (for macOS environments):

my_new_environment├── bin
├── etc
├── include
├── lib
├── pyvenv.cfg
└── share

There may be some minor variations with platform, but critically, there’s a lib/python3.13/site-packages directory where libraries get installed, and a bin directory with executables (it’s the Scripts directory on Windows). Note that the Python runtime executables are actually just links to the Python executable used to create the virtual environment. When you “activate” a virtual environment, two things happen:
1 – The prompt changes, so that my_new_environment now appears at the start of the command prompt, and
2 – Some path-related environment variables are modified such that my_new_environment/bin is added to system path, and Python’s import path resolves to my_new_environment/lib/python3.13/site-packages.

The environment is activated with a script located at my_new_environment/bin/activate (macOS) or my_new_environment\Scripts\activate (Windows).

❯ source my_new_environment/bin/activate

my_new_environment ❯ which python
/Users/timdiller/my_new_environment/bin/python

my_new_environment ❯ python
Python 3.13.9 (v3.13.9:8183fa5e3f7, Oct 14 2025, 10:27:13) [Clang 16.0.0 (clang-1600.0.26.6)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.path
['', '/Library/Frameworks/Python.framework/Versions/3.13/lib/python313.zip', '/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13', '/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/lib-dynload', '/Users/timdiller/my_new_environment/lib/python3.13/site-packages']

That’s it. That’s the magic. The fun begins when you start installing packages…

PIP, the Package Installer for Python

In Python, the standard installation tool is called pip, the Package Installer for Python. For many use cases, pip works just fine, and if I were to only ever use a Mac with an SSD, I would probably just stick with pip. Although you can use pip to install packages from anywhere (including local code or from a code repository like GitHub) pip‘s default repository is PyPI, the Python Package Index, a central repository for community-contributed libraries. All of the major science and engineering packages including NumPy, SciPy, and Pandas are published through PyPI.

When someone uses pip install to install a package into their Python environment, broadly speaking, two things happen:
1 – pip resolves the dependencies for the package to be installed (properly listing dependencies is up to the author(s)) and computes a list of everything that needs to be installed or updated; this is a non-trivial task involving a mini-language for specifying acceptable versions and SAT Solving, and accomplishing it efficiently was an early distinctive advantage for early 3rd-party packaging tools like Enthought’s EDM and Continuum’s (now Anaconda’s) conda; since version 20.3, pip generally handles dependency resolution as well as any of its competitors (more on this below).
2 – pip places copies of the packages into the site-packages directory, downloading new versions as needed. pip keeps a cache of libraries so that it doesn’t have to download each one from PyPI every time. Unlike some of the newer tools, pip does not try to de-duplicate installed files across environments with hard links or shared package stores. Each environment is independent and isolated.

pip defines the canonical package management behavior for Python: safe, debug-friendly, and transparent, if somewhat inefficient in terms of file system utilization. In practice, I’ve found little to complain about using pip on the macOS, but in Windows, the file-copying step can be painfully slow, especially if there is any kind of malware scanning tool involved.

Enter the 3rd-party Package Managers

In the early days of scientific computing in Python, Enthought published the Enthought Python Distribution (EPD), a semantic-versioned Python distribution with a runtime and a reasonably complete set of libraries that were extensively tested and guaranteed to “play nicely” together. In 2011, Continuum brought a competing product to market, Anaconda, with a focus on broad access to open source packages, including the newest releases. Anaconda shipped with a command line tool conda for managing its own version of virtual environments. Meanwhile, Enthought focused on rigorously-tested, curated package sets with security-conscious institutional customers in mind and evolved EPD into the Enthought Deployment Manager (EDM), which also defined “applications” that could be deployed within an institution using the edm command-line tool. This involved managing virtual environments in a registry or database that could be synced with an institutional server. Both edm and conda implement managed environments and employ shared package stores and hard links where possible to reduce the storage footprint of virtual environments. But, notably, they also both follow a cultural norm of “playing nicely” (more or less — edm is substantially better about this than conda) with pip and following the same (more or less) command-line interface. conda has the notable disadvantage of a substantially more-invasive installation. In my experience, it is harder to debug conda environments (I do this a lot on the first days of class) and can take substantial effort to reverse an installation of Anaconda and remove all of the “magic” it contributes.

Meanwhile, improvements to pip over the last several years, especially since late 2020, effectively erased the dependency-solving advantages offered by edm and conda, and the main advantages offered by them are space efficiency and familiarity. Enthought’s packages continue to be rigorously tested and curated, and Anaconda’s conda-forge remains a trusted source for the latest versions of packages. But neither of them was particularly optimized for the fast, ephemeral environments needed for continuous integration workflows.

Astral – `uv`

That brings us to a newer player, uv, published by Astral, whose focus is on high-performance tooling for Python written in Rust. (Rust is a compiled language that prioritizes memory and thread safety and is often found in web services and system software.) uv addresses the needs presented by continuous integration environments, where source code repositories like GitHub enable automatic actions for testing and building code as part of the regular development process. Many providers of cloud services like GitHub charge for compute time, so that minimizing the time to set up environments for testing translates directly to cost savings. Depending on the situation, uv can operate from 10-100 times faster than pip.

That performance comes with tradeoffs, however. uv intentionally diverges from some long-standing pip conventions in order to optimize for speed, reproducibility, and CI-friendly workflows. Most notably, it decouples installs from shell activation, aggressively deduplicates packages across environments, and it treats dependency resolution as a first-class, cached artifact in a lock file.

For example, whereas venv is quite explicit about which runtime is used and what the virtual environment is named, uv will implicitly create an environment with a default name as needed, and activation is often unnecessary. And consider caching, uv‘s aggressive use of references to a global package store deliberately discards the independent isolation of venv environments; environments are no longer self-contained as they are under venv management.

These choices optimize for speed and reproducibility in CI pipelines, but come at the cost of some of the simplicity and transparency that make venv and pip effective teaching tools. Those design choices make a lot of sense for automated build systems that create and destroy environments repeatedly, but they are less compelling in interactive or instructional settings, where environment creation happens infrequently and clarity matters more than raw speed. They also add a lot of additional conceptual “surface area” to cover during class.

Conclusion

We acknowledge the curated safety and testing of edm environments, the ubiquity of conda environments and the conda-forge ecosystem, and the impressive engineering and benefit to modern CI pipelines of uv. However, we generally prefer to stick with the standard, simple, tried-and-true venv + pip tooling for teaching and day-to-day development. It’s the easiest foundation to build on if students want to explore the other options.

NumPy Indexing — the lists and tuples Gotcha

In a recent session of Python Foundations for Scientists & Engineers, a question came up about indexing a NumPy ndarray. Beyond getting and setting single values, NumPy enables some powerful efficiencies through slicing, which produces views of an array’s data without copying, and fancy indexing, which allows use of more-complex expressions to extract portions of arrays. We have written on the efficiency of array operations, and the details of slicing are pretty well covered, from the NumPy docs on slicing, to this chapter of “Beautiful Code” by the original author of NumPy, Travis Oliphant.

Slicing is pretty cool because it allows fast efficient computations of things like finite difference, for say, computing numerical derivatives. Recall that the derivative of a function describes the change in one variable with respect to another:

\frac{dy}{dx}

And in numerical computations, we can use a discrete approximation:

\frac{dy}{dx} \approx \frac{\Delta x}{\Delta y}

And to find the derivative at any particular location i, you compute the ratio of differences:

\frac{\Delta x}{\Delta y}\big|_i = \frac{x_{i+1} - x_{i}}{y_{i+1} - y{i}}

NumPy allows you to use slicing to avoid setting up costly-for-Python for: loops by specifying start, stop, and step values in the array indices. This lets you subtracting all of the i indices from the i+1 indices at the same time by specifying one slice that starts at element 1 and goes to the end (the i+1 indices), and another that starts at 0 and goes up to but not including the last element. No copies are made during the slicing operations. I use examples like this to show how you can get 2 and sometimes 3 or more orders of magnitude speedups of the same operation with for loops.

>>> import numpy as np

>>> x = np.linspace(-np.pi, np.pi, 101)
>>> y = np.sin(x)

>>> dy_dx = (
...     (y[1:] - y[:-1]) /
...     (x[1:] - x[:-1])
... )
>>> np.sum(dy_dx - np.cos(x[:-1] + (x[1]-x[0]) / 2))  # compare to cos(x)
np.float64(-6.245004513516506e-16)  # This is pretty close to 0

>>> import numpy as np

>>> x = np.linspace(-np.pi, np.pi, 101)
>>> y = np.sin(x)

>>> dy_dx = (
...     (y[1:] - y[:-1]) /
...     (x[1:] - x[:-1])
... )
>>> np.sum(dy_dx - np.cos(x[:-1] + (x[1]-x[0]) / 2))  # compare to cos(x)
np.float64(-6.245004513516506e-16)  # This is pretty close to 0

Fancy indexing is also well documented (but the NumPy docs now use the more staid term “Advanced Integer Indexing“, but I wanted to draw attention to a “Gotcha” that has bitten me a couple of times. With fancy indexing, you can either make a mask of Boolean values, typically using some kind of boolean operator:

>>> a = np.arange(10)
>>> evens_mask = a % 2 == 0
>>> odds_mask = a % 2 == 1
>>> print(a[evens_mask])
[0 2 4 6 8]

>>> print(a[odds_mask])
[1 3 5 7 9]

>>> a = np.arange(10)
>>> evens_mask = a % 2 == 0
>>> odds_mask = a % 2 == 1
>>> print(a[evens_mask])
[0 2 4 6 8]

>>> print(a[odds_mask])
[1 3 5 7 9]

Or you can specify the indices you want, and this is the Gotcha, with tuples or lists, but the behavior is different either way. Let’s construct an example like one we use in class. We’ll make a 2-D array b and construct at positional fancy index that specifies elements in a diagonal. Notice that it’s a tuple, as shown by the (,) and each element is a list of coordinates in the array.

>>> b = np.arange(25).reshape(5, 5)
>>> print(b)
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]
 [20 21 22 23 24]]
>>> upper_diagonal = (
...     [0, 1, 2, 3],  # row indices
...     [1, 2, 3, 4],  # column indices
... )
>>> print(b[upper_diagonal])
[ 1  7 13 19]

>>> b = np.arange(25).reshape(5, 5)
>>> print(b)
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]
 [20 21 22 23 24]]
>>> upper_diagonal = (
...     [0, 1, 2, 3],  # row indices
...     [1, 2, 3, 4],  # column indices
... )
>>> print(b[upper_diagonal])
[ 1  7 13 19]

In this case, the tuple has as many elements as there are dimensions, and each element is a list (or tuple, or array) of the indices to that dimension. So in the example above, the first element comes from b[0, 1], the second from b[1, 2] so on pair-wise through the lists of indices. The result is substantially different if you try to construct a fancy index from a list instead of a tuple:

>>> upper_diagonal_list = [
    [0, 1, 2, 3],
    [1, 2, 3, 4]
]
>>> b_with_a_list = b[upper_diagonal_list]
>>> print(b_with_a_list)
[[[ 0  1  2  3  4]
  [ 5  6  7  8  9]
  [10 11 12 13 14]
  [15 16 17 18 19]]

 [[ 5  6  7  8  9]
  [10 11 12 13 14]
  [15 16 17 18 19]
  [20 21 22 23 24]]]

>>> upper_diagonal_list = [
    [0, 1, 2, 3],
    [1, 2, 3, 4]
]
>>> b_with_a_list = b[upper_diagonal_list]
>>> print(b_with_a_list)
[[[ 0  1  2  3  4]
  [ 5  6  7  8  9]
  [10 11 12 13 14]
  [15 16 17 18 19]]

 [[ 5  6  7  8  9]
  [10 11 12 13 14]
  [15 16 17 18 19]
  [20 21 22 23 24]]]

What just happened?? In many places, lists and tuples have similar behaviors, but not here. What’s happening with the list version is different. This is in fact a form of broadcasting, where we’re repeating rows. Look at the shape of b_with_a_list:

>>> print(b_with_a_list.shape)
(2, 4, 5)

>>> print(b_with_a_list.shape)
(2, 4, 5)

Notice that its dimension 0 has 2 elements, which is the same as the number of items in upper_diagoal_list. Notice the dimension 1 has 4 elements, corresponding to the size of each element in upper_diagoal_list. Then notice that dimension 2 matches the size of the rows of b, and hopefully it will be clear what’s happening. In upper_diagoal_list we’re constructing a new array by specifying the rows to use, so the first element of b_with_a_list (seen as the first block above) consist of rows 0, 1, 2, and 3 from b, and the second element is the rows from the second element of upper_diagonal_list. Let’s print it again with comments:

>>> print(b_with_a_list)
[[[ 0  1  2  3  4]   # b[0] \
  [ 5  6  7  8  9]   # b[1]  | indices are first element of
  [10 11 12 13 14]   # b[2]  | upper_diagonal_list
  [15 16 17 18 19]]  # b[3] /

 [[ 5  6  7  8  9]   # b[1] \
  [10 11 12 13 14]   # b[2]  | indices are second element of
  [15 16 17 18 19]   # b[3]  | upper_diagonal_list
  [20 21 22 23 24]]] # b[4] /

>>> print(b_with_a_list)
[[[ 0  1  2  3  4]   # b[0] \
  [ 5  6  7  8  9]   # b[1]  | indices are first element of
  [10 11 12 13 14]   # b[2]  | upper_diagonal_list
  [15 16 17 18 19]]  # b[3] /

 [[ 5  6  7  8  9]   # b[1] \
  [10 11 12 13 14]   # b[2]  | indices are second element of
  [15 16 17 18 19]   # b[3]  | upper_diagonal_list
  [20 21 22 23 24]]] # b[4] /

Forgetting this convention has bitten me more than once, so I hope this explanation helps you resolve some confusion if you should ever run into it.

Managing Pandas’ deprecation of the Series first() and last() methods.

Have you stumbled across this warning in your code after updating Pandas: “FutureWarning: last is deprecated and will be removed in a future version. Please create a mask and filter using `.loc` instead“? In this post, we’ll explore how that method works and how to replace it.

I’ve always loved the use-case driven nature of methods and functions in the Pandas library. Pandas is such a workhorse in scientific computing in Python, particularly when it comes to things like timeseries data and dealing with calendar-labeled data in particular. So it was with a touch of frustration and puzzlement that I discovered that the last() method had been deprecated, and its removal from Pandas’ Series and DataFrame types is planned. In the Data Analysis with Pandas` course, we used have an in-class exercise where we recommended getting the last 4 weeks’ data using something like this:

In [1]: import numpy as np
   ...: import pandas as pd
   ...: rng = np.random.default_rng(42)
   ...: measurements = pd.Series(
   ...:    data=np.cumsum(rng.choice([-1, 1], size=350)),
   ...:    index=pd.date_range(
   ...:        start="01/01/2025",
   ...:        freq="D",
   ...:        periods=350,
   ...:   ),
   ...:)
In [2]: measurements.last('1W')
<ipython-input-5-ec16e51fe7ce>:1: :1: FutureWarning: last is deprecated and will be removed in a future version. Please create a mask and filter using `.loc` instead
  measurements.last('1W')
Out[2]:
2025-12-15   -7
2025-12-16   -8
Freq: D, dtype: int64

This has the really useful behavior of selecting data based on where it falls in a calendar period. Thus the command above usefully returns the two elements from our Series that occur in the last calendar week, which begins (in ISO format) on Monday, Dec 15.

The deprecation warning says “FutureWarning: last is deprecated and will be removed in a future version. Please create a mask and filter using .loc instead.” Because .last() is a useful feature, I wanted to take a closer look to see if I could understand what’s going on and what the best way to replace it would be.

Poking into the code a bit, we can see that the .last() method is a convenience function that uses pd.tseries.frequencies.to_offset() to turn '1W', technically a designation of period, into an offset, which is subtracted from the last element of the DatetimeIndex, yielding the starting point for a slice on the index. From the definition of last:

 ...
    offset = to_offset(offset)

    start_date = self.index[-1] - offset
    start = self.index.searchsorted(start_date, side="right")
    return self.iloc[start:]

Note that side='right' in searchsorted() finds the first index greater than start_date. We could wrap all of this into an equivalent statement that yields no FutureWarning thus:

In [3]: start = measurement.index[-1] - to_offset('1W')
In [4]: measurement.loc[measurement.index > start]
Out[4]:
2025-12-15   -7
2025-12-16   -8
Freq: D, dtype: int64

There’s a better option, though, which is to use pd.DateOffset. It’s a top-level import, and it gives you control over when the week starts, which to_offset does not. Remember we are using ISO standards, so Monday is day 0:

In [5]: start = measurements.index[-1] - pd.DateOffset(weeks=1, weekday=0)
In [6]: measurements.loc[measurements.index > start]
Out[6]:
2025-12-15   -7
2025-12-16   -8
Freq: D, dtype: int64

Slicing also works, even if the start point doesn’t coincide with a location in the index. Mixed offset specifications are possible, too:

In [7]: measurements.loc[measurements.index[-1] - pd.DateOffset(days=1, hours=12):]
Out[7]:
2025-12-15   -7
2025-12-16   -8
Freq: D, dtype: int64

The strength of pd.DateOffset is that it is calendar aware, so you can specify the day of the month, for example:

In [8]: measurements.loc[measurements.index[-1] - pd.DateOffset(day=13):]
Out[8]:
2025-12-13   -7
2025-12-14   -6
2025-12-15   -7
2025-12-16   -8
Freq: D, dtype: int64

There’s also the non-calendar-aware pd.Timedelta you can use to count back a set time period without taking day-of-week or day-of-month into account. Note: as with all Pandas location-based slicing, it is endpoint inclusive, so 1 week yields 8 days’ measurements:

In [9]: measurements.loc[measurements.index[-1] - pd.Timedelta(weeks=1):]
Out[9]:
2025-12-09   -9
2025-12-10   -8
2025-12-11   -7
2025-12-12   -8
2025-12-13   -7
2025-12-14   -6
2025-12-15   -7
2025-12-16   -8
Freq: D, dtype: int64

You may have noticed I prefer slicing notation, whereas the deprecation message suggests using a mask array. There’s a performance advantage to using slicing, and the notation is more compact than the mask array but less so than the .last() method. In IPython or Jupyter, we can use %timeit to quantify the difference:

In [10]: %timeit measurements.loc[measurements.index[-1] - pd.DateOffset(day=13):]
45.7 μs ± 2.36 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [11]: %timeit measurements.last('4D')
56.3 μs ± 14.9 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [12]: %timeit measurements.loc[measurements.index >= measurements.index[-1] - pd.DateOffset(day=13)]
89.2 μs ± 6.31 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

After spending some time with git blame and the Pandas-dev source code repository, the reasons for the deprecation of the first and last methods make sense:

there is unexpected behavior when passing certain kinds of offsets
they don’t behave analogously to SeriesGroupBy.first() and SeriesGroupBy.last()
they don’t respect time zones properly

Hopefully this has been a useful exploration of pd.Series.last (and .first), their deprecation, and how to replace them in your code with the more-explicit and better-defined masks and slices. Happy Coding!

Arrays and Lists

I often get questions about the difference between the Python list and the NumPy ndarray, and we talk about this a lot in Python Foundations for Scientists & Engineers. But recently I had a question about the array object that is part of the Python language, and why don’t we use it for computations? Why is NumPy necessary if Python has an array data type? Well, this is a great question!

First of all, let’s take a quick look at the Python array. It’s found in the array standard library and usually imported as arr:

>>> import array as arr

Like a NumPy ndarray, it has a data type, and every element of the array has to conform to that type. Set the data type in in the first argument when creating the array. 'l' in this case specifies a 4-byte signed integer:

>>> a = arr.array('l', range(5))
>>> a
array('l', [0, 1, 2, 3, 4])

It has the same indexing and slicing behavior as a Python list, and we even see that it has the same “multiplication” behavior as lists: i.e. “multiplying” means repetition, not element-wise, arithmetic multiplication as with the NumPy ndarray.

With that in mind, computations look the same for arrays as for lists:

>>> b = arr.array(a.typecode, (v + 1 for v in a))
>>> b
array('l', [1, 2, 3, 4, 5])

Let’s use IPython’s %timeit magic to see how array computations fare in comparison with Python lists and NumPy ndarrays. For the list and array, we’ve specified a list comprehensions as the most efficient way to do the computations and a simple vector computation for the NumPy ndarray, and our test is to increment sequence of 10 integers:

In [1]: import array as arr
   ...: import numpy as np
   ...:
   ...: %timeit [val + 1 for val in list(range(10))]
   ...: %timeit arr.array('q', [val + 1 for val in arr.array('q', range(10))])
   ...: %timeit np.arange(10) + 1
279 ns ± 45.8 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
825 ns ± 5.22 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
1.03 μs ± 98.7 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

It’s surprising to note that list comprehension is fastest, beating array computation by a factor of 3 or so, and beating ndarray computation by a factor of 3.7! Suspecting that a 10-integer sequence may not be representative, let’s bump it up to 1000:

In [2]: %timeit [val + 1 for val in list(range(1000))]
   ...: %timeit arr.array('q', [val + 1 for val in arr.array('q', range(1000))])
   ...: %timeit np.arange(1000) + 1
28.4 μs ± 3.47 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
82.4 μs ± 9.05 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
1.82 μs ± 175 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

With more numbers to chew on, NumPy clearly pulls ahead, suggesting the overhead of creating an ndarray may dominate computation time for small arrays, but the actual computation time is relatively small. The Python array is still relatively inefficient compared to the Python list. Let’s make some helper functions to compare how long it takes to increment each value in lists, arrays, and NumPy ndarrays accross a broad range of different sizes. Each function will accept a list, array, or ndarray and return the time (in ms) required to increment each value by 1.

In [3]: from datetime import datetime
   ...:
   ...: def inc_list_by_one(l):
   ...:     start = datetime.now()
   ...:     res = [val + 1 for val in l]
   ...:     return (datetime.now() - start).total_seconds() * 1000
   ...:
   ...: def inc_array_by_one(a):
   ...:     start = datetime.now()
   ...:     res = arr.array(a.typecode, [val + 1 for val in a])
   ...:     return (datetime.now() - start).total_seconds() * 1000
   ...:
   ...: def inc_numpy_array_by_one(arr):
   ...:     start = datetime.now()
   ...:     res = arr + 1
   ...:     return (datetime.now() - start).total_seconds() * 1000

Now let’s record the computation times for a range of array sizes and plot them using matplotlib and seaborn:

In [14]: import matplotlib.pyplot as plt
    ...: import seaborn as sns
    ...:
    ...: times = []
    ...: for size in range(1, 10_000_001, 200_000):
    ...:     times.append((
    ...:         size,
    ...:         inc_list_by_one(list(range(size))),
    ...:         inc_array_by_one(arr.array('q', range(size))),
    ...:         inc_numpy_array_by_one(np.arange(size)),
    ...:     ))
    ...: times_arr = np.array(times)
    ...:
    ...: sns.regplot(x=times_arr[:, 0], y=times_arr[:, 1], label='Python list')
    ...: sns.regplot(x=times_arr[:, 0], y=times_arr[:, 2], label='Python array')
    ...: sns.regplot(x=times_arr[:, 0], y=times_arr[:, -1], label='Numpy array')
    ...: plt.xlabel('Array Size (num elements)')
    ...: plt.ylabel('Time to Increment by 1 (ms)')
    ...: plt.legend()
    ...: plt.show()

…and now it’s clear that computation times for Python lists and arrays scale with the size of the object much faster than NumPy ndarrays do. And it’s also clear that the Python array is no improvement for computations, static typing notwithstanding. Let’s quantify:

In [26]: print(f'Python list {np.polyfit(times_arr[:, 0], times_arr[:, 1], 1)[0]:.2e} (ms/element)')
    ...: print(f'Python array {np.polyfit(times_arr[:, 0], times_arr[:, 2], 1)[0]:.2e} (ms/element)')
    ...: print(f'NumPy ndarray {np.polyfit(times_arr[:, 0], times_arr[:, -1], 1)[0]:.2e} (ms/element)')
Python list 4.02e-05 (ms/element)
Python array 6.50e-05 (ms/element)
NumPy ndarray 2.01e-06 (ms/element)

So if we take the slope of those curves as a gauge of performance, Python arrays are roughly 40% slower in computation than lists, and the NumPy ndarray computations run in about 95% less time, on average for arrays that are not trivially small. This is because NumPy takes advantage of knowing the data type of each element and setting up efficient strides in memory for computations, which can be optimized at the operating system and hardware levels.

What, then, is the advantage of the Python array over a Python list? With helper functions similar to the ones above, we can return sys.getsizeof() sample list and array objects over the same range of sizes. Plotting the results, we see something interesting:

Memory consumption for both Python arrays and NumPy ndarrays scale exactly linearly with the number of elements (the q typcode specifies a signed integer with a minimum 8 bytes, which corresponds to the default int64 dytpe for integer NumPy ndarrays.) But notice how the Python list memory consumption increases stepwise. This is an implementation detail of the Python list type that allows it to quickly add elements without shifting memory after each append operation. And this is likely the explanation for why lists can be faster and more efficient than arrays. When a new value is written to an array, it likely often occurs that the some or all of the object needs to be shuffled in memory to find contiguous locations, but a list is optimized for appending, so a certain number of additions can be made before any reshuffling happens.

To test this, we can compare the computation times for an array computed with a list comprehension against an array that is copied and overwritten:

def inc_array_by_one_with_copy(a):
    start = datetime.now()
    res = copy(a)
    for i, val in enumerate(a):
        res[i] = val + 1
    return (datetime.now() - start).total_seconds() * 1000

Plotted, we see that the copy-fill approach is marginally faster, and there is less variation, likely due to fewer memory relocations.

Computation time v array size for arrays created with list comprehension and copy-fill strategies

In conclusion, it helps to understand that the array type is really intended as a Python wrapper around a C array, and there are some functions and codes that use it to efficiently return a Python object. But as for computations, I hope I’ve convinced you that the NumPy ndarray is the right type for computations in terms of both memory and computation efficiency. As for Python arrays, now you know what they are, and you can safely ignore them in favor of the ndarray for your scientific computations.

Virtual Environments

PIP, the Package Installer for Python

Enter the 3rd-party Package Managers

Astral – uv

Conclusion

Astral – `uv`