This week we are taking a small break from discussions about Explaining Bazel Cache. No worry, that series is still on-going. But I want to experiment with starting a second blog series in parallel. The hypothesis I want to validate is: By introducing a new blog series before the previous series ended, I could retain more frequent readers during the transition between the two series.
In that spirit, let’s start a series specifically about tips and tricks related to running Bazel in CI.
Series Introduction
In this blog series, I want to achieve 2 goals. First, I hope to discuss the reasoning and motivation behind each improvement you could be making toward CI in general. These decisions should originate from your business needs, which is then translated to a set of goals and targets. Executions of these improvements should be driven by experimentations and metrics as they should help you continuously validate the hypotheses, creating real business impacts.
Secondly, I want to dive into some common patterns that I have encountered setting up Bazel in CI. There are various topics that I want to brush through such as:
- Incremental steps in setting up Bazel cache layers in CI
- Flaky-tastic tests and where to find them (Spoiler alert: It's in CI)
- Selecting revision under test to stay at HEAD
- Why / When / How would you go about implementing a cache for your git-pull
- Explaining the implications of different execution isolation technology
- How to evaluate different CI-as-a-service vendors
These are topics that may / may not be related to Bazel directly. But if you ever set up Bazel CI at scale, you would definitely have encountered them in some way or another. These are articles or Google Docs which I have written about internally at my old jobs, or for my customers, to help explain different aspects of the mordern software development practices to an internal audience. So the mysticism among these topics are real: few people understand git, fewer people understand how to set up git in CI, very few know what I takes to provide SLAs/SLOs over git performance.
By rewriting about these topics under the public context, I hope that I could help clear things up for a wider audience benefits.
DevX and Business Value
As a consultant who specialized in the Developer Experience (DevX) field, I have an opinionated mental model about how a dedicated DevX team in an organization should function. Although, there are no organizations that are like another: different culture due to different up-bringing, different technical capability, different talents development plan etc... As a consultant, I would still try to keep a fresh mental model on how the role of a DevX team should evolve overtime given any of these maturity axes. Using this model as the North Star, I would then use it to craft up a custom solution that is catered toward unique customers’ needs.
But there are always recurring themes when it comes to DevX that are worth highlighting. So folks who are looking for a quick summary, or early introduction to the field could get an overview of what it’s like to build these from the ground up.
Given a business in modern days of the year 2022.
Starting from the customer of the business and their needs, we can trace back to the product features being queued up for development. Speeding up the feature pipeline while keeping shipped features running reliably would have direct positive impacts on the business.
But how do we speed up development?
How do we retain talents and knowledge who have contributed to the shipped features to keep them running reliably?
The answer is: engineering productivity or, in a more well known phrase: Developer Experience. By keeping the developer happy, providing them the right tools for the job, we can increase their productivity. This, in turn, becomes the first movement for the business flywheel to take off.
DevX cannot replace the missing headcounts in your hiring pipeline. However, DevX provides a small productivity multiple that could enable a team of 5 to produce the same value as a team of 6, or 7, without. The same as giving a sushi chef with a set of sharp knives, or providing a carpenter with a capable router: It’s all about giving the builders the right tools for the job. So that they would not have to skin the fish with a spoon, or carve into the wood with a broomstick.
Software Development Funnel
There are a lot of components that make up the software development life cycle.
- Closest to the customer is Production(Prod). That’s where your app runs. That’s where your features meet your customers and revenue is generated.
- Further back, we have Deployment or Continuous Deployment (CD) which is the process, methodology of shipping your features to Prod.
- Before CD, we have Artifact releases, where the deployment bundles are kept, scanned for security, backup etc...
- Before Artifact releases, we have to Build them from our source code.
- Before Building the release artifacts, we would want to validate whether the source code has met a certain quality standard, and that is where Testing comes in.
- Before Testing, we need to write and keep track of different source code revisions.
- And the data exploration, information gathering process, which we use to guide the code creation process, comes before all that.
You can picture all these using a familiar mental exercise in e-commerce called: the Conversion Funnel. E-Commerce platforms divide the process between a customer first opening their website until a sale is generated as a Funnel.
- A lot of customers may arrive at the top of the Funnel but only a portion of that would add items to cart.
- Among the customers with items in cart, only a subset would proceed to the payment and shipping page.
- And even at the final payment step, many could have canceled instead of proceeding with the sale.
The Conversion Funnel represents the amount of friction that customers have to experience to generate sales in an ecommerce website. By building for a shorter checkout experience, or a better search engine, you enable a better customer experience and thus, shrink the funnel depth. By enabling your customer to navigate from search through check out faster, you decrease the funnel friction, increase the Conversion Rate and generate more sales.
Similarly, here is how a Software Development Funnel would look like.
The goal of a dedicated DevX team in an organization is to track and groom this funnel, decrease friction, remove obstacles in the software development lifecycle. If engineers could write code easier, there will be more tests run. If engineers could run tests faster, there would be more builds. If builds happen more frequent, there would be more artifacts and business could deploy more often. This enables for faster bug fixes. Customers should be able to experience newer features sooner, which in turn, creates more revenue.
So given such a funnel, where would you start when you have to build a DevX team?
It depends on the business' interests.
The first step I often recommend to customer is to implement a measurement system to understand the current state of the company. Such a metric system does not need to be super detailed, high resolution to begin with. But it should function as a guide to highlight: where is it within the funnel is the “conversion rate”, or “feature development velocity”, is the slowest? At minimum, the collected data should help you identify the top 3 sections of the funnel that would have the highest return on investment to spend your engineering hours into improving next.
These choices are always different between different organizations and should be tailored to be cost effective. Each portion of the funnel could be improved differently using different solutions. Each solution comes with a different set of implications: some are as simple as spending X amount of dollars more per engineer, some are as complicated as maturing your talent development plan to target for a specific section in the engineering job market.
A case for CI and Bazel
Right, but we are here to talk about CI. So in which situation does an org pick CI?
Continuous Integration (CI), in my experience, is the part within the funnel that is often represent as the one that:
- Easier to measure
- Has the most troubles
- Has several known solutions that could solve the troubles
- These solutions are often come at a cheaper price
For these reasons, for many businesses who I have interacted with, after initial measurements, CI seems to always have landed itself among the top funnel segments that are up for improvements.
Think about it, if you are an organization that values your engineering hours that are being spent running tests, compiling software, building container images, there are only a small set of environments where running all of them is possible:
- Local Development environment
- Remote Development environment
- CI environment
If you could rank these in term of how much engineering users have control over vs how much the DevX team has control over, you would get a matrix like this
If you are building a DevX team, who is set out to solve the Build and Test section of the funnel, and you would want to put your resources into what brings the most impact, the highest value; you would want to pick CI among these 3. With a tight budget of a small and new team, picking CI where you have the most control over to iterate faster and have the highest chance of success. In other words, improving CI environment comes at lower risks, easier wins and thus, higher returns on investment. Once you have been successful with improving CI and have proven to the organization that DevX is something that is worth the investment into, you could gain more resources to expand effort toward different environments, or different sections of this funnel.
And why use Bazel for CI?
Right, this is a Bazel blog after all.
First of all, let me just go out and say that: Bazel is most likely NOT the right solution for your startup, if you are building one.
As a solution of CI, Bazel comes with a VERY expensive price tag:
- You need to have a recruitment plan to recruit Bazel experts
- You need to have a talent development plan to train your existing employees about Bazel
- You need special infrastructure to run Bazel at scale
- You will need to build custom software to support Bazel
So... don’t use Bazel?
Yes... and no.
The thing about Bazel is that it’s a specialized tool that helps you solve problems that you would only encounter when your organization has reached a certain ceiling limit. Once you have reached one of these ceiling limits, there is no other better solution than Bazel.
Example 1: Selective testing.
When your tech stack has reached a certain scale, it’s really hard to devise a comprehensive test plan for it given a set of changes. Some organizations have to test ‘everything’ all at once, which, in some cases, would cost more compute power than running your production servers. Some organizations decided to test a ‘subset’ around the self-contained change and expose themselves to the risk of things breaking at a later stage. This seems to be ok at first, until you realize that it is pushing the costs of fix to a later section in the funnel, i.e. deployment or production, where the cost has been multiplied 10-20 times compared to the cost of fixing it during build and test. Bazel solves this by providing a mature way to declaratively define dependencies between components in your system. Using such a dependency map, you can figure out a precise and cost effective way to do selective testing across the stack.
But build-tool-x also provide dependency map
That brings me to example 2: Support for Multi-tenancy.
As your company matures, the Python monolith is no slowly adding in some ReactJS for faster frontend iteration. Then you decide that some components are a bit too slow, let’s rewrite critical things in Go microservices. The finance guys are now asking for Java Spring Boot to be included while the data guys are doing some experimentation with Scala. Those guys in Infra are asking to replace the load balancer with a new solution written in Rust. These fragmentations put a heavy burden on a centralized DevX to support.
Bazel’s rules ecosystem provides a platform, a common language for a centralized DevX team to provide support to different tenants of the funnel. With Bazel Starlark rules acting as the contract interface, DevX team has a much easier time defining its boundary of support toward the customers. It also lets different tenants define their own support boundaries for each other via custom rules and custom API visibility.
Equipped with a matured toolchain / platform model, Bazel is the best build tools for cross compilation from one OS to another. It also provides the best support for static compilation: linking a binary against different sysroot is a lot better with hermetic toolchains and sandbox execution. If you've ever been through the pain of fixing how a CGo code base would work differently on your laptop vs production servers. Or the pain of your Python app was actually depending on a different system C-library that was only available on an older version of CentOS. You would have wanted to use Bazel.
So... use Bazel?
Bazel cost of adoption is high. Thanks to big organizations such as Google, Apple, Twitter, VMWare, Alibaba, Tencent, Adobe Cloud, Uber, Lyft, SpaceX, Spotify, Pinterest, Tinder, Reddit... who have adopted the technology and are contributing back improvements upstream, this cost is getting lower at a rapid rate. But by no means it’s something suitable for a start up of 5 engineers to start using.
What’s more interesting is that the cost of migrating from an existing tech stack to Bazel is also getting lower. There are more open source toolings available. There are documentations, tech talks, blogs about migrating to Bazel now more than ever. There are professional services that would help you run the Bazel infrastructure, write custom build rules to cater for your business’ needs and help train your employees on Bazel’s arcane magic runes.
This means that the choice of adopting Bazel should be, and now, could be, a pragmatic choice. Don’t blindly pick the technology just for the hype, or because of Fear Of Missing Out. Gain insights into your business’ funnel, your development funnel. Identify bottlenecks, develop hypotheses on solutions and if Bazel fits your puzzle, then give it a try.
In future posts in this series, I will start to explore some of the more technical concepts around running Bazel in CI and link these solutions to their associated costs as well as their business values. Before that, I hope that this post could help explain the methodology and mental models that I would apply during the evaluation of those solutions.