Misapplying LazyRecursiveDefaultDict

A cautionary tale of how I misapplied the wrong software tool to a problem, and what I've learned from it.

by Danver Braganza on 2021-01-10

This story goes back several years, to when I was working at Hipmunk on the Backend Platform Team. Among the many varied responsibilities of this team¹, we were responsible for maintaining our integrations with our partners via their APIs. Most of these partner APIs returned data as nested dictionaries, encoded as JSON. You may be able to imagine that some of these APIs were less reliable than others, and so we had adapted with a bag of tricks to deal with malformed and outright bad data. The story I’m about to tell you describes when I got carried away trying to solve one specific problem, the consequences it had, and the lessons I’ve learned from it.

The nature of the problem

First, lets abstract away the specifics of our API integration. In essence, we would make some kind of standard, authenticated request to our partner, and the response would be in the form of a JSON encoding of some nested dictionary tree structure. It would look something like this:

{
    "status": "success",
    "result": {
        "mandatory_key": "some attribute",
	"optional_key": "some attribute",
	"nested_object_1": {
            "key_1": "value 1",
	    "optional_nested_object_2a": {
                "key_1": "value 1"
		"possible_display_key": "display me if present",
            },
	    "nested_object_2b": {
		"possible_display_key": "display me as a fallback",
            }
        }
    }
}

Pay attention to possible_display_key, which appears twice in this response. This is a stripped-back example of what the problem was, both in the number of missing fields, and the depth of nesting. optional_nested_object_2a may or may not be present in the response returned by the partner, depending on whether their data covered that use case. For a fictional example, the entire hotel image subtree might be missing from the results for a specific hotel if the partner we queried had no images for that hotel.

Our code that tried to pull out these values was written with the end uses in mind. So these missing subtrees made for clunky code, like the following:

def extract_display_field(response):
    nested_object_1 = response['result']['nested_object_1']
    if 'optional_nested_object_2a' in nested_object_1 and 'possible_display_key' in nested_object_1['optional_nested_object_2a']:
        return nested_object_1['optional_nested_object_2a']['possible_display_key']
    if 'nested_object_2b' in nested_object_1 and 'possible_display_key' in nested_object_1['nested_object_2b']:
        return nested_object_1['nested_object_2']['possible_display_key']
    return 'some default'

Of course, I need you to put in some effort to imagine what it actually looked like, since those nestings could be 3-6 layers deep of potentially missing structures². Our parser code was trying to express that we wanted to do something with the element at a certain path, if it existed. Because entire subtrees might be missing, however, we needed to be careful about how we traversed the structure, so that we didn’t throw a KeyError if we accidentally stepped off the tree.

Over time, we settled on an idiom that used the default parameter of dict.get(), and it looked like the following:

def extract_display_field(response):
    return (
        response['result']['nested_object_1'].get(
            'optional_nested_object_2a', {}).get('possible_display_key')
        or response['result']['nested_object_1'].get(
            'nested_object_2b', {}).get('possible_display_key')
	or 'some default
    )

Because of the depth, we’d need to chain those .gets so they’d look like:

return response.get('foo', {}).get('bar', {}'
       ).get('baz', {}).get('quux', {}) or 'default'

Why not just catch the KeyError?

Why not rely on KeyError to detect when a value was missing? It’s a common adage in Python that it’s Easier to Ask Forgiveness Than Permission, and so most idiomatic Python would use:

try:
    val = response['foo']['bar']['baz']
except KeyError:
    val = "default"

Well, the reason was just one of convenience. try and except define clauses, which cannot be part of expressions in the language. What we wanted was to be able to write concise expressions like:

response['foo']['bar']['baz'] or response['foo']['bar']['baz2']

without needing to pad out the number of lines.

Why pick out fields rather than unmarshalling objects

Another question that you might have is why not parse the entire API response structure into objects, and then use methods of those objects to find what we want, rather than querying dict trees? The answer there is cost. We had scores of these API integrations to maintain, and our partners all had different domain models for their objects.

For example, some APIs would represent the Lat/Long coordinates of an Airport as part of the Airport object. Others might have a Location object within the Airport, which then had the coordinates in there. The purpose of our integration code was to pick out the coordinates, and put them into the common structured object model we had for all such requests. Maintaining another set of objects to represent each partner’s structure would bloat the code for no additional purpose.

Enter LazyRecursiveDefaultDict

This was where I had my bright idea. I wasn’t the first person to come up with the idea of a recursive defaultdict.

However, applying it to the problem of parsing these APIs seemed so clever, that I was genuinely pleased with myself for having thought of it. That must have been why I was so eager to implement it, despite the concerns my reviewers raised about the excess complexity.

The code itself was very small. First let’s have the code, and then I’ll go through the points I want to you notice. Remember, this is a cautionary tale, so please don’t add this in any production context. Also, I don’t have access to the original code anymore, so I threw this code together for this blog article.

from collections import defaultdict

class LazyRecursiveDefaultDict(defaultdict):
    """Do not use in production."""

    def __init__(self, tree=None):
        if tree is None:
            tree = {}

        super(LazyRecursiveDefaultDict, self).__init__(
            LazyRecursiveDefaultDict, tree)

    def __getitem__(self, key):
        retval = super(LazyRecursiveDefaultDict, self).__getitem__(key)
        if isinstance(retval, dict):
            return LazyRecursiveDefaultDict(retval)
        return retval

A LazyRecursiveDefaultDict is a subclass of defaultdict. Defaultdict is a type of collection available in the Python standard library. It behaves like a dictionary, with the added ability to specify a default value for missing keys. Judicious use of defaultdict simplifies a lot of code.

This code uses a recursive definition at __init__ to set the default constructed dictionary to be an empty LazyRecursiveDefaultDict. This way, any number of nested, missed accesses would just return an empty LazyRecursiveDefaultDict

>>> response["foo"]["bar"]["wrong"][
...     "somewhere"]["these"]["keys"]["are"]["missing"]
defaultdict(<class '__main__.LazyRecursiveDefaultDict'>, {})

The reason this is Lazy, is so that it doesn’t have to process the entire tree of dictionaries, converting them all, at construction time. By overriding __getitem__, I was able to make sure that whenever the item being returned was a dictionary, the appropriate construction code would come into play.

Rollout, adoption, eventual fall

I initially wrote this while implementing an API integration, whose rollout was successful. I reported about LazyRecursiveDefaultDict to my teammates, and recommended that we start adopting it more widely. And then, after I’d publicly let everyone know that this tool was available for everyone’s use, the problems started!

Some of our existing parsers relied on the fact that nested accesses would throw KeyErrors if not met. They had been written over many years of edge cases, planned and unplanned mutations to external APIs, and changes to our systems, and so there was a lot of surface area to cover. I had thought that all we needed to do was to convert the response tree to LazyRecursiveDefaultDict, to start reaping the benefits. Instead, we found that this exposed all sorts of subtle errors started showing up in testing.

In addition, we found that:

Debugging became harder, now that there was another recursive data structure to reason about
There was a lot of needless copying during every dictionary access
LazyRecursiveDefaultDict didn’t convert the objects returned from APIs if they were an element of a list
LazyRecursiveDefaultDicts were escaping their contexts, since parsers might send them for logging or error reporting, and they had the potential to cause subtle problems there too.

None of these problems were unsolvable. However, solving those problems would require writing more code, making this class even more complex. That forced me to think about how much I was willing to invest in this as a solution, and whether it was the right tool for the job.

Restoration, and Replacement

As soon as I realized that LazyRecursiveDefaultDict was not the panacea we were looking for, I made sure to ticket work to remove it. In the end, a simple solution was much better, a function that looked something like:

def get_from_tree(
    tree: dict,
    path: typing.List[typing.Union[str, int]],
    default=None):
    """Implementation left as an exercise for the reader.

    Hint: use a for loop to traverse the path, picking
    your way down the tree. If you get a KeyError
    (or an IndexError!), return the default value instead.
    """

# Call it this way:
get_from_tree(response, ["foo", "bar", "baz", "quux"], "default")

The function was much easier to understand, which lowered the barrier to adoption by the members of our team. Additionally, things that looked like a dictionary access behaved like a dictionary access, which reduced the amount of questions that a newcomer to our codebase might have.

LazyRecursiveDefaultDict required, or at least strongly insisted on, total buy-in within the parser, since you apply it to the response returned from the partner before operating on it. In contrast, the accessor function allows the developers to opt in on a expression-by-expression basis, allowing a slower rollout of the cleaner code.

Finally, the function suffered from none of the other shortcomings of LazyRecursiveDefaultDict that were caused by its method of implementation.

For all of these reasons, this made the function solution the better choice.

Lessons

By way of conclusion, let me list the main lessons I learned from this story.

We engineers can be seduced by code that is “elegant”, at the expense of real quality metrics. I doubt I would have been so excited by this solution if it hadn’t been lazy and recursive, and I certainly would have been more heedful of my colleagues criticisms.
You like solutions even more when you feel like you’ve had a hand in inventing them. There was something sufficiently novel about LazyRecursiveDefaultDict that I felt invested in it, and I wanted to see it succeed. I had even planned to write a blog post about it, describing how awesome it was. Many years later, this is what that blog post turned into.
You should build the smallest thing you can that solves the problem. This sounds a bit like Occam’s Razor but for producing things.
As a way of operationalizing the point above, and to avoid getting carried away again by fancy solutions, I now make sure to consider as least one alternative implementation for everything novel I create. Now, when writing design docs, I force myself to brainstorm at least one alternative option for every significant tech choice. Often the alternatives are strikingly inferior to the chosen design, but sometimes they can outperform it on a few quality metrics. In those cases, it is a prompt to think if there’s yet another solution that I’m missing, or if it’s a case of a legitimate tradeoff.
You’ll find your solutions adopted more readily if you require less buy-in from your customers. As a platform engineer, my customers were the other engineers on my team. I was able to get further by making my solution easy to adopt, and without requiring them to relearn what the dictionary access syntax could possibly mean in a different context.
You’re underestimating the edge cases hidden in legacy code. I’ve never not been surprised at just how much depth even a moderately old codebase quickly acquires. This is why best practices such as strong test coverage and aggressive refactoring are very important. And, when it comes time to make a large change to a legacy codebase, actively seek ways to go incrementally, if possible.

I’ve observed across multiple companies that the team named “Platform” tends to end up responsible for a really wide variety of projects. Teams with names such as Reporting, Billing, Customer Outreach, can have narrow focus, but the Platform Team ends up looking after a grab-bag of responsibilities related to growing the technological capabilities of the organization, and everything else that doesn’t fit elsewhere. Open questions: does this set the Platform Team up for success? Would I structure my ideal organization to include a platform team? ↩︎
I am a huge proponent of code reuse, but one place where it needs to applied with much more caution is in the representation of objects in APIs. Some team initially creates an object, and defines a method showing how to represent it as jsonizable dict. Because that method exists, now every team building an API that returns that object is tempted to reuse that same specific representation. In the best-case scenario, that means that all clients can share parsers for that same object in multiple contexts. However, in the worst case scenario, you get bulky, unnatural APIs that just return a bag of objects, their graphs and their attributes, with no thought for the suitability in the API context. ↩︎