Library Design Gotchas: Configuration loading

Posted on Nov 14, 2019

A friend was complaining about this library they were trying to use that was failing to load a configuration from a file. The resulting dive into the code inspired this post about inappropriate choices made when designing how a library is configured.

It isn’t my intention to pick on pyart. I appreciate the hard work the developers did to create it and open source it. It is just the example at hand.

In the instance above, simply executing this line within the production environment at my friend’s job was failing:

import pyart

and the backtrace pointed to the configuration file loader. The library tries to load its configuration from a file. Unfortunately, in the production environment, the Python code is bundled into a custom archive. Their environment adds additional module loaders to the Python import machinery so that simple import foo statements can import from the archive. pyart is directly using SourceFileLoader, which is not aware of this. It tries to treat the archive as a directory and fails. Reading more of that code, pyart is violating several conventions that make configuring the library difficult.

Keep configuration within the API

By its nature, a library cannot control the execution environment users will use it in. This means the sole method of configuring the library must not depend on the execution environment. In this case it is loading a file and retrieving constants from it. In another case it might be reading values from environment variables. These methods constrain the places your library can be used.

At the lowest level a library should offer a configuration option that is entirely within the bounds of the language it is written in, as that is the only constraint to which the user of your library is also bound. In Python this can mean letting a user pass a Configuration object to the library initialization routine. In C, this may involve a struct. This allows the user to retrieve configuration from wherever they want, create the configuration object, then start the library. They might be retrieving one setting from the user’s location, another from the current temperature, and a third from a radio transmission! They might be running code on embedded hardware that has no filesystem! This will still work because clearly they have enough infrastructure to get the language runtime executing.

If you’d like to expose a default configuration, it is trivial to export a DefaultConfiguration object or constant that creates a Configuration with the right values.

Configuration should not be global state

When the user passes some configuration to your library, avoid modifying global library state with it. There are rare situations when an application needs to store configuration in global state, but I’ve yet to encounter a situation where a library has to do this.

Global state is bad for a bunch of other reasons. My top two – forced singletons and difficulty testing.

If an application wants to use the same library in multiple places in the code base, global state makes it impossible to use different configurations at those 2 points.

Testing also becomes annoying. Any tests the user of your library writes can no longer be run in parallel. If they want to speed up test execution, they’ve to now jump through a lot of hoops. If the user is not aware of internal global state, they will encounter hard to reproduce race conditions that will leave them scratching their head.

Provide helpers

Once your basic configuration is moved into a data structure, it is definitely nice to provide utility functions that can read a configuration from a byte stream or a file. This is particularly useful if your configuration is complicated, or you need some validation beyond just “this is a string, this is a bool”. When the user is in a circumstance where they can use these helpers, they will be glad you provided them. Just don’t let it be the only way to configure the library. Don’t give in to the temptation to mix I/O and parsing!

Keep imports and initialization separate

I feel the situation is particularly bad in Python-land because arbitrary code execution is allowed during import. Avoid running code during an import. The import exists to declare classes and functions! Initialization must be a separate process. The library user may not want to use your library for several days after their app first starts running!

If pyart had not tried to initialize itself as soon as it was imported, things would have been usable even with the previous two violations. The application could’ve imported pyart along with all its other imports. Then it could read the default_configs.py file using some code that knew how to find the file in a production environment, write it out to a normal directory, modify the environment variable, then initialize pyart. Now the application code has to do all this before it can even import pyart. This makes the application code ugly. It has the import statement somewhere two levels deep in some function.

If you need global state, have a boolean (use threading.Event for thread-safety) that determines if the init() or similar function is called, and fails other parts of the library API if it hasn’t.

You may have a legitimate use case where you’d like users to be able to use parts of your library as an application to do some basic tasks. For example your library may be able to process a CSV or something.

The library should ship with separate scripts that use the library as a library, and ask the user to run those scripts as binaries. In languages like Python, you can even hide the script like behavior behind

if __name__ == '__main__':
  do_some_work()

Ship it!

Library design is always challenging as authors have to balance usability and configurability. Exposing a nice way to configure the library leads to a pleasant initial experience for the user as they learn to use your library, and also lets them adapt a library to the needs of their application without jumping through hoops. It makes them happy!