From 6aefc477d74babfc093a5721510af9a0ec12a790 Mon Sep 17 00:00:00 2001 From: Mike Kruskal Date: Fri, 10 Feb 2023 09:30:14 -0800 Subject: [PATCH] Add documentation for our new GHA infrastructure PiperOrigin-RevId: 508680514 --- .github/README.md | 196 ++++++++++++++++++++++++++++++++++++++++++++++ ci/README.md | 17 ++++ 2 files changed, 213 insertions(+) create mode 100644 .github/README.md create mode 100644 ci/README.md diff --git a/.github/README.md b/.github/README.md new file mode 100644 index 0000000000..b1c780323c --- /dev/null +++ b/.github/README.md @@ -0,0 +1,196 @@ +This directory contains all of our automatically triggered workflows. + +# Test runner + +Our top level `test_runner.yml` is responsible for kicking off all tests, which +are represented as reusable workflows. This is carefully constructed to satisfy +the design laid out in go/protobuf-gha-protected-resources (see below), and +duplicating it across every workflow file would be difficult to maintain. As an +added bonus, we can manually dispatch our full test suite with a single button +and monitor the progress of all of them simultaneously in GitHub's actions UI. + +There are five ways our test suite can be triggered: + +- **Post-submit tests** (`push`): These are run over newly submitted code +that we can assume has been thoroughly reviewed. There are no additional +security concerns here and these jobs can be given highly privileged access to +our internal resources and caches. + +- **Pre-submit tests from a branch** (`push_request`): These are run over +every PR as changes are made. Since they are coming from branches in our +repository, they have secret access by default and can also be given highly +privileged access. However, we expect *many* of these events per change, +and likely many from abandoned/exploratory changes. Given the much higher +frequency, we restrict the ability to *write* to our more expensive caches. + +- **Pre-submit tests from a fork** (`push_request_target`): These are run +over every PR from a forked repository as changes are made. These have much +more restricted access, since they could be coming from anywhere. To protect +our secret keys and our resources, tests will not run until a commit has been +labeled `safe to submit`. Further commits will require further approvals to +run our test suite. Once marked as safe, we will provide read-only access to +our caches and Docker images, but will generally disallow any writes to shared +resources. + +- **Continuous tests** (`schedule`): These are run on a fixed schedule. We +currently have them set up to run daily, and can help identify non-hermetic +issues in tests that don't get run often (such as due to test caching) or during +slow periods like weekends and holidays. Similar to post-submit tests, these +are run over submitted code and are highly privileged in the resources they +can use. + +- **Manual testing** (`workflow_dispatch`): Our test runner can be triggered +manually over any branch. This is treated similarly to pre-submit tests, +which should be highly privileged because they can only be triggered by the +protobuf team. + +# Staleness handling + +While Bazel handles code generation seamlessly, we do support build systems that +don't. There are a handful of cases where we need to check in generated files +that can become stale over time. In order to provide a good developer +experience, we've implemented a system to make this more manageable. + +- Stale files should have a corresponding `staleness_test` Bazel target. This +should be marked `manual` to avoid getting picked up in CI, but will fail if +files become stale. It also provides a `--fix` flag to update the stale files. + +- Bazel tests will never depend on the checked-in versions, and will generate +new ones on-the-fly during build. + +- Non-Bazel tests will always regenerate necessary files before starting. This +is done using our `bash` and `docker` actions, which should be used for any +non-Bazel tests. This way, no tests will fail due to stale files. + +- A post-submit job will immediately regenerate any stale files and commit them +if they've changed. + +- A scheduled job will run late at night every day to make sure the post-submit +is working as expected (that is, it will run all the staleness tests). + +The `regenerate_stale_files.sh` script is the central script responsible for all +the re-generation of stale files. + +# Caches + +We have a number of different caching strategies to help speed up tests. These +live either in GCP buckets or in our GitHub repository cache. The former has +a lot of resources available and we don't have to worry as much about bloat. +On the other hand, the GitHub repository cache is limited to 10GB, and will +start pruning old caches when it exceeds that threshold. Therefore, we need +to be very careful about the size and quantity of our caches in order to +maximize the gains. + +## Bazel remote cache + +As described in https://bazel.build/remote/caching, remote caching allows us to +offload a lot of our build steps to a remote server that holds a cache of +previous builds. We use our GCP project for this storage, and configure +*every* Bazel call to use it. This provides substantial performance +improvements at minimal cost. + +We do not allow forked PRs to upload updates to our Bazel caches, but they +do use them. Every other event is given read/write access to the caches. +Because Bazel behaves poorly under certain environment changes (such as +toolchain, operating system), we try to use finely-grained caches. Each job +should typically have its own cache to avoid cross-pollution. + +## Bazel repository cache + +When Bazel starts up, it downloads all the external dependencies for a given +build and stores them in the repository cache. This cache is *separate* from +the remote cache, and only exists locally. Because we have so many Bazel +dependencies, this can be a source of frequent flakes due to network issues. + +To avoid this, we keep a cached version of the repository cache in GitHub's +action cache. Our full set of repository dependencies ends up being ~300MB, +which is fairly expensive given our 10GB maximum. The most expensive ones seem +to come from Java, which has some very large downstream dependencies. + +Given the cost, we take a more conservative approach for this cache. Only push +events will ever write to this cache, but all events can read from them. +Additionally, we only store three caches for any given commit, one per platform. +This means that multiple jobs are trying to update the same cache, leading to a +race. GitHub rejects all but one of these updates, so we designed the system so +that caches are only updated if they've actually changed. That way, over time +(and multiple pushes) the repository caches will incrementally grow to encompass +all of our dependencies. A scheduled job will run monthly to clear these caches +to prevent unbounded growth as our dependencies evolve. + +## ccache + +In order to speed up non-Bazel builds to be on par with Bazel, we make use of +[ccache](https://ccache.dev/). This intercepts all calls to the compiler, and +caches the result. Subsequent calls with a cache-hit will very quickly +short-circuit and return the already computed result. This has minimal affect +on any *single* job, since we typically only run a single build. However, by +caching the ccache results in GitHub's action cache we can substantially +decrease the build time of subsequent runs. + +One useful feature of ccache is that you can set a maximum cache size, and it +will automatically prune older results to keep below that limit. On Linux and +Mac cmake builds, we generally get 30MB caches and set a 100MB cache limit. On +Windows, with debug symbol stripping we get ~70MB and set a 200MB cache limit. + +Because CMake build tend to be our slowest, bottlenecking the entire CI process, +we use a fairly expensive strategy with ccache. All events will cache their +ccache directory, keyed by the commit and the branch. This means that each +PR and each branch will write its own set of caches. When looking up which +cache to use initially, each job will first look for a recent cache in its +current branch. If it can't find one, it will accept a cache from the base +branch (for example, PRs will initially use the latest cache from their target +branch). + +While the ccache caches quickly over-run our GitHub action cache, they also +quickly become useless. Since GitHub prunes caches based on the time they were +last used, this just means that we'll see quicker turnover. + +## Bazelisk + +Bazelisk will automatically download a pinned version of Bazel on first use. +This can lead to flakes, and to avoid that we cache the result keyed on the +Bazel version. Only push events will write to this cache, but it's unlikely +to change very often. + +## Docker images + +Instead of downloading a fresh Docker image for every test run, we can save it +as a tar and cache it using `docker image save` and later restore using +`docker image load`. This can decrease download times and also reduce flakes. +Note, Docker's load can actually be significantly slower than a pull in certain +situations. Therefore, we should reserve this strategy for only Docker images +that are causing noticeable flakes. + +## Pip dependencies + +The actions/setup-python action we use for Python supports automated caching +of pip dependencies. We enable this to avoid having to download these +dependencies on every run, which can lead to flakes. + +# Custom actions + +We've defined a number of custom actions to abstract out shared pieces of our +workflows. + +- **Bazel** use this for running all Bazel tests. It can take either a single +Bazel command or a more general bash command. In the latter case, it provides +environment variables for running Bazel with all our standardized settings. + +- **Bazel-Docker** nearly identical to the **Bazel** action, this additionally +runs everything in a specified Docker image. + +- **Bash** use this for running non-Bazel tests. It takes a bash command and +runs it verbatim. It also handles the regeneration of stale files (which does +use Bazel), which non-Bazel tests might depend on. + +- **Docker** nearly identical to the **Bash** action, this additionally runs +everything in a specified Docker image. + +- **ccache** this sets up a ccache environment, and initializes some +environment variables for standardized usage of ccache. + +- **Cross-compile protoc** this abstracts out the compilation of protoc using +our cross-compilation infrastructure. It will set a `PROTOC` environment +variable that gets automatically picked up by a lot of our infrastructure. +This is most useful in conjunction with the **Bash** action with non-Bazel +tests. diff --git a/ci/README.md b/ci/README.md new file mode 100644 index 0000000000..01c8373145 --- /dev/null +++ b/ci/README.md @@ -0,0 +1,17 @@ +This directory contains CI-specific tooling. + +# Clang wrappers + +CMake allows for compiler wrappers to be injected such as ccache, which +intercepts compiler calls and short-circuits on cache-hits. This can be done +by specifying `CMAKE_C_COMPILER_LAUNCHER` and `CMAKE_CXX_COMPILER_LAUNCHER` +during CMake's configure step. Unfortunately, X-Code doesn't provide anything +like this, so we use basic wrapper scripts to invoke ccache + clang. + +# Bazelrc files + +In order to allow platform-specific `.bazelrc` flags during testing, we keep +3 different versions here along with a shared `common.bazelrc` that they all +include. Our GHA infrastructure will select the appropriate file for any test +and overwrite the default `.bazelrc` in our workspace, which is intended for +development only.