The Empty Case is not Special
Instead of explicitly handling the empty case in functions, try this instead.
by Danver Braganza on 2020-09-07
You may often encounter a pattern in code where a function that needs to iterate
over a collection begins by specifically handling the empty case with an
if-statement. The empty case defines what that function should return if the input
collection has no members, say if
str.join is called with no fragments, or the
sum(). Writing out the branch to specifically handle that empty code is a habit of many programmers.
Here is a trivial example, re-implementing the
sum function for ints, the
describes the pattern. Although I've used Python for the examples here, the
principles should apply to most common programming languages.
def sum(summands): if not summands: return 0 total = 0 for sum in summand: total += sum return total
The real examples you will run into of this pattern are much more complex, since
they might involve database accesses or network calls to retrieve elements, and
rather than straight iteration, the function might be recursive. The reduce
function will also almost alway be more complex than
+. You may also see this
run on multiple collections, for example finding the intersection between two sets:
def intersection(a, b): if not a or not b: return  return set(a).intersection(set(b))
Nevertheless, I want to focus on the first example, because makes it easy to see the part of this that is troublesome.
if not summands: return 0
When you run into this pattern in any codebase of moderate complexity or higher, you'll also find that this pattern is fractal, where one method call at the top level may trigger lots of such branches.
When running this fuction on a Composite
object, many times getting the answer of
a computation at the top level requires performing such a calculation at more
than one level of nesting, and each sub-level may choose to handle the empty
case explicitly. To properly render a
Widget, you may need to query for all the associated
Sprockets, which may or may not have attached
Splines that need reticulating.
Often, if the codebase was worked on mainly by one developer, or by a team which has adopted this practice as their norms, you will encounter special behaviour for the empty case liberally, though not always uniformly, applied to functions.
This creates a weight of code distributed throughout your application, all set up to handle flow through special branch statements. This increases the amount of code that your team needs to read, understand, test and maintain. This also increases the area where bugs, even small silly typos, can hide. Among the hiking community there's a saying that "Ounces are pounds, and pounds are pain". The same holds true for source code, and every line of source code we can safely delete is a line of source code that will maintain itself forever.
Motivations for handling the empty case on its own
There could be an infinite number of reasons why a team might decide to handle the empty case specifically. Here are just three that I want to address as being the ones I have seen most often.
There are definitely functions out there that are undefined for an empty
collection. For example, in Python 3,
statistics.mean will raise a
StatisticsError if called with an empty data list. After accidentally
triggering a few of these cases early on in their career, many a programmer
might decide that software quality involves "always coding defensively".
This was a practice I adopted myself just after I was promoted from junior engineer. With my increased responsibility, I felt an increase in caution. Because I had yet to learn the value of a concise codebase, my fears manifested in code that looked like:
def func(arg1, arg2, arg3...argn): if not isinstance(arg1, ArgOneExpectedType): raise ValueError("Arg1 was of the wrong type") if not re.match(ARG_1_RE, arg1): # Some example check. raise ValueError("arg1 must match EMAIL_RE") # repeat for arg2, arg3...argn # Finally, the real function begins.
Verifying pre-conditions before a function is not always bad. But it's more important if the function is going to be called in a wide variety of places--and if you can't trust those callers to read your documentation correctly. Finally, while adding these kinds of checks, it's always important to balance that against the cost of carrying that code.
Another justification for handling the empty case specifically is that it is more efficient. I am willing to accept that this justification is very often true, although it is not universally.
For example, let us consider a function that takes care to check for the empty case specifically, but in practice never experiences the empty case. On every invocation, that function will fruitlessly check to see if it can return early. We will end up doing a fixed amount of useless work on each invocation, and as such it will be better to remove that check. More generally, the efficiency gained by having a empty check depends not only on the cost of the operation it lets you skip, but also on the frequency that this empty check is likely to occur.
This is why I will often delay writing such a check for function calls, even if they require API calls or database accesses. If you're willing to pay the overheard of a network call for a single row, what makes avoiding it for an empty set worthwhile?
Although it seems like it should be obvious to short-circuit a database lookup for rows by id if the list of ids is empty, you should probably investigate why that is a common search pattern in the first place. Leaving these empty queries in place also surfaces them in the database's query log, where they might help you form a complete picture of your application's performance.
Test Driven Development
Teams that follow methodologies based on Test Driven Development may often end up with code that handles the empty case since the tests for empty input are the first ones written. A common mantra in the TDD world is Red/Green/Refactor, comprising the following steps:
- Red: Write a failing test
- Green: Write code to make the test pass
- Refactor: Remove duplication
The most strict proponents insist that you need to write only one test at a time, and then write the simplest possible code that will make it pass. Following such a practice rigidly will generate code for handling the empty case. However, the refactor step where duplicated code is removed must then be observed as devotedly. Don't forget to check if the code you wrote already handles the empty case gracefully, or could be easily extended to do so.
How to take away the specialness of the empty case
Here is a list of some of the practices that I've adopted when programming. They are helpful in general, and can support you in reducing the code you need to handle the empty case.
Always test for empty input
The importance of testing to the delivery of correct, maintainable software cannot be overstated. Ideally, all software should boast a thorough level of code coverage. This is a minimum necessity before you can delete lines from any useful codebase.
Even for brand new code, covering the empty case with tests is important. It may be true that your function should never be called with empty input in practice—however, it is possible that future changes could invalidate that assertion, or that another programmer might jump through hoops to pass an empty collection to your function.
Even though your goal is to remove the one or two lines of code that handling the empty case represents, you are still responsible for delivering the tests that prove they were unnecessary.
It is okay for your code to fail in these cases, as long as it fails in a way that is known and safe. Consider a function that deletes rows from a database that match one of a list of values. It is okay if passing an empty list of values throws a noisy exception1. It is not okay if doing so cheerfully deletes the entire table.
Know when your underlying algorithms correctly handle the empty case
Many standard library algorithms in most programming languages are already designed to handle the empty case naturally. Many library authors are also on board, and take care of this for you.
This means that your application code can have one less responsibility, and you should try to take advantage of it, when possible.
Design your algorithms to handle the empty case gracefully
Make the Zero Value Useful is a maxim from the Go programming community, which roughly states that you should design your algorithms to take advantage of the uninitialized values of your types. The same approach can be of help when designing your algorithms to gracefully handle the empty case.
If the result of running your algorithm on an empty collection is defined to be different than running your main algorithm on a collection of one or more elements, think about whether that definition is correct, or if it should be changed. An analogy from mathematics would be the concept of Continuity. Intentionally adding discontinuities to your software functions increases the number of cases that you need to handle, and test, and maintain. It is better to try to avoid it.
For a trivial example, you may have a function that when given a list of numbers
returns the string
"[1 2 3]" if the list is non-empty, but is defined, when
given an empty list, to return
rather than""`. A case like
this is strong signal that the definition should be amended.
Accept that computers are fast
After making sure that your base case is covered, and adjusting your functions definitions to be smooth as much as possible, the next thing to consider is performance. There's an oft-repeated statement of Donald Knuth, that states that premature optimization is the root of all evil 2.
This advice follows for the 97% of cases that don't need optimization. Get comfortable with saying, "Computers are Fast". When I was an early programmer, I would often focus specifically on finding the right tree or data structure, believing that always finding the algorithm with the lowest complexity was what identified a competent developer. Now that I am more comfortable with the ability to say, "Computers are Fast", I'm much happier and more productive because of it.
Use an O(n2) algorithm instead of a hashmap? Well, n is small, and computers are fast. Call the same instance method ten times, rather than implementing and testing a bulk method? Computers are fast. Brute-force solution? Computers are fast.
When it comes to handling the empty case, it may be necessary when writing some CPU-bound performance-critical software within tight loops. But for everything else, computers are fast.
Make use of lazy instantiation and connection pools
The main benefit of handling the empty case on its own is that it allows you to skip the O(1) setup costs of your algorithm. If all of that setup cost is in-memory operations, you can usually just accept it and move on.
On the other hand, that setup operation might actually be expensive because it involves operations like authenticating with a remote server, setting up a database connection or opening a file. In these cases a better approach would be to use lazy instatiation or connection pooling.
With lazy instantiation, the expensive part of the setup operation is encapsulated, and only executed if actually accessed. Connection pools pre-declare a set of resources for all your functions to share, amortizing the setup costs over the long run.
You do introduce an overhead by using these strategies, and it seems counter to the maxim that less code is better. However, these strategies often are already implemented for situations where decades of prior work has found them necessary, such as talking to databases. By using these solutions, you keep the extra branches out of the meat of your algorithm, making it easier to understand. You may also greatly improve the performance of your function for all workloads3.
Memoize your functions
Memoization is the process of trading memory space for performance, by caching or storing the results of a previous computation within your program. You create a tiny (or not-so-tiny) look-up table of some frequent inputs of your program, and you keep the results there in case they are needed later. If someone asks for the results to one of those inputs again, you don't need to do the work of computing it, and you can serve up the cached value.
Think of handling the empty case as ad-hoc, in-code memozation for just one input. If you have identified that a function is performance-critical, why handle just one case, when with a little bit of effort, you can handle all the most-often used cases?
My favourite tool for this task in Python is the
@lru_cache decorator in the
functools module. If you are using a different programming language, you may
find a similar library.
Again, your mileage my vary, so before you do this, make sure you measure to gather evidence if it's worth caching many inputs, just one, or none.
Handle it when you need to
Sometimes, you will not be able to avoid handling the empty case. It may be that
the underlying function is undefined for empty collections, like
statistics.mean. Perhaps performance is critical, and your function is called
another one, so that your O(1) is some other function's O(n).
Whatever the situation, you have studied and understood your specific design space, and in that case, treating the empty case as a special case is as valid a tool as any other. The intent of this article is not prohibit handling the empty case in all situations. The intent was to point out that many times, special-casing the empty case is a result of algorithm design choices that can be changed, or premature optimization. By only using it when necessary, you will end up with less and simpler code, which hopefully means more maintainable code, and happier programmers.
Thank you for reading. If you have some comments or feedback, you can share it with me at this article's link on Hacker News
- Assuming that this exception is handled before being shown to a user, or that such behaviour is acceptible in your use case. [return]
- I think the precise formulation of this is:
(premature_optimization ** 2) / .97 = evil;) [return]
- You mileage may vary. When doing any changes strictly for performance, profiling and benchmarking is a necessity. [return]
Other articles you may like
- Standardize the Way You Write Your Range Conditionals By always spelling your range conditionals start <= index < capacity, you make your code more easily grokkable and reduce the chance of errors.
- Connect Four implemented in 3 lines of Python A tiny implementation of Connect 4 on the terminal, with explanation.
- Hangman implemented in 3 lines of Python (2) Just because you can, doesn't mean you shrdlu.