Doctesting for PyData Libraries

Discovering the PyData World

Hey there, my name is Sheila Kahwai, and before this internship, I was a PyData newbie! Yes, I hadn't dipped my toes into the world of NumPy, and my first time locally building SciPy happened to be a month into my internship.

Even though I had no experience working with SciPy or NumPy, I knew I had the potential to create something valuable for the PyData community. So, when I was assigned the task of building a pytest plugin, something I am all too familiar with, I thought, "Maybe a month tops, right? Quick operation, in and out!" Lol, was I in for a surprise!

It was a journey filled with unexpected roadblocks. There were moments I thought I was seeing the light at the end of the tunnel only to realize that the tunnel had light wells. But through it all, I remained positive because my primary goal was to learn and grow, and this internship was an endless source of knowledge and personal growth.

Navigating the Doctesting Landscape

Let's dive into the technical stuff now. The "refguide-check" tool is a SciPy and NumPy module that deals with docstrings. One of its essential functions is doctesting, which involves testing docstring examples to ensure they are accurate and valid. Docstring examples are critical because they serve as documentation to show users how to use your code. However, having them is not enough; they must also be accurate.

NumPy and SciPy use a modified form of doctesting in their refguide-check utilities. My mentor, Evgeni Burovski, managed to isolate this functionality into a separate package called "scpdt". Scpdt is not your ordinary doctesting tool. It has the following capabilities:

Floating-Point Awareness: Scpdt is acutely aware of floating-point intricacies. E.g: It recognizes that 1/3 isn't precisely equal to 0.333 due to floating-point precision. It incorporates a core check using np.allclose(want, got, atol=..., rtol=...), allowing users to control absolute and relative tolerances.
>>> 1 / 3 0.333
Human-Readable Skip Markers: Scpdt introduces user-friendly skip markers like # may vary and # random. These markers differ from the standard # doctest: +SKIP in that they selectively skip the output verification while ensuring the example source remains valid Python code.
>>> np.random.randint(100) 60 # may vary
Handling Numpy's Output Formatting: Numpy has a unique output formatting style, such as array abbreviation and often adding whitespace that can confound standard doctesting, which is whitespace-sensitive. Scpdt ensures accurate testing even with Numpy's quirks.
>>> import numpy as np >>> np.arange(10000) array([0, 1, 2, ..., 9997, 9998, 9999])
User Configurability: Through a DTConfig instance, users can tailor the behavior of doctests to meet their specific needs.
#If an example contains any of these stopwords, do not check the output # (but do check that the source is valid python). config = DTConfig() config.stopwords = {'plt.', '.hist', '.show'}
Flexible Doctest Discovery: One can use testmod(module, strategy='api') to assess only public module objects, which is ideal for complex packages. The default strategy=None mirrors standard doctest module behavior.
>>> from scipy import linalg >>> from scpdt import testmod >>> res, hist = testmod(linalg, strategy='api') >>> res TestResults(failed=0, attempted=764)

But here's the twist: Scpdt could only perform doctesting on SciPy's and NumPy's public modules through a helper script, and that wasn't ideal. So, guess who stepped in to bridge the gap?

Bridging the Gap with Pytest

Pytest already has a doctesting module, but unfortunately, it doesn't meet the specific needs of the PyData libraries. Therefore, the crucial task was to ensure pytest could leverage the power of Scpdt for doctesting. This involved overriding some of doctest's functions and classes to incorporate scpdt's alternative doctesting objects. It also meant modifying pytest's behavior by implementing hooks, primarily for initialization and collection.

Once all the technical juggling was done, it was time for what my mentor called "dogfooding" (a term he picked up from Joel Spolsky's essay). The term simply means putting your own product to the test by using it, and I had to make sure that the plugin functioned as expected. I did this by locally running doctests on SciPy's modules. It was an eye-opener, exposing issues like faulty collection – for example, the plugin wasn't collecting compiled and NumPy universal functions for doctesting.

With the bugs and vulnerabilities exposed during this process, I was able to refine the plugin further. I then created a pull request to demonstrate how the pytest plugin could be seamlessly integrated into SciPy. The process is fairly straightforward:

Installation: Install the plugin via pip.
Configuration: Customize your doctesting through a conftest.py file.
Running Doctests in SciPy: If you're running doctests on SciPy, execute the command python dev.py test --doctests in your shell.
Running Doctests on Other Packages: If you're not working with SciPy, use the command pytest --pyargs <your-package> --doctest-modules to run your doctests.

Voila! 🎉

An image featuring Kamala Harris on a phone call, with the phrase 'We did it, Joe' displayed at the bottom of the image. Adjacent to the image are the pytest logo, a plus sign, and the text representing the doctesting package 'SCPDT'.

Future Goals

I am currently in the process of integrating the plugin into SciPy; for more details, you can check out the PR. Looking ahead, our goal is to publish the plugin on PyPI and extend its integration to NumPy and other PyData libraries.

If you run into challenges with floating-point arithmetic, face output issues related to whitespace and array abrreviations, need to validate example source code without output testing, or simply desire a customized doctesting experience, consider giving this plugin a try.

The Journey's End

Throughout this incredible journey, I cherished every moment spent working, learning from my mentors: Evgeni Burovski and Melissa Weber Mendonça, and being part of the Quansight team. I'm incredibly grateful for this opportunity, and I look forward to continuing my contributions to the pytest plugin even after the internship.

Curious? Check out the plugin repository on GitHub. Feel free to contribute – the more, the merrier! 🚀🐍

Stay tuned for more exciting developments!