Delivering perf reports

A teammate and I are developing a generic scheduling system to replace the built-in Kubernetes CronJob scheduler for triggering thousands of periodic jobs. The built-in Kubernetes scheduler is too slow for our needs – it takes about ten minutes for it to iterate through the list of schedules, but some of our jobs need to run more frequently than once every ten minutes.

This post isn’t about schedulers, though. It’s about considering a performance characteristic report to be a deliverable. When future developers eventually reach the point of needing to scale, it will be valuable to know which scaling knobs can be turned and how. And presenting the report early in the project lifecycle helps spread understanding to outside stakeholders.

By performance report I mean a quantitative exploration of how the proposed system would behave in response to variable constraints and loads. For example, we had to first demonstrate that our new scheduler would be able to handle the current workload (otherwise, there’d be no reason to consider it). Then we simulated changing factors outside of our control, like increasing the number of jobs and the duration of those jobs. We also simulated changing aspects of the system itself, like the number of worker threads. Aside from describing the service’s response to stress, this exercise directly influenced how we architected the system with an eye toward scalability.

I wouldn’t recommend doing this kind of stress test for all projects all the time, though. It might be suitable when the load on the system is anticipated to grow substantially in the medium term. Or if the expected load is going to be stable, but something about the proposed system makes the performance profile hard to estimate, like if it’s implemented on a different stack from what the team is familiar with. Conducting scaling experiments could be overkill for systems that are based on familiar technologies and that are expected to have stable/unremarkable load. In our case, the project’s existence was spurred by degrading performance, so it made sense to keep scaling at top of mind.

We’re shipping this performance characteristic report right alongside the software. Results are captured and explained in a Jupyter notebook that we demoed at one of our company’s internal weekly demo days. We think that the new scheduling system can support 20x our current load – twice the 10x goal we had for the project. If and when future engineers anticipate the need to scale, this report will be there for them. It will show them approximately when the current implementation’s limits will be reached, and may provide insight into what stop-gap measures can be used before an entirely new solution is needed.