Technology

Linux Kernel Cgroup Fork Contention: Yi Tao's Patch Series Targets a Deep Scalability Bottleneck

Martin HollowayPublished 9h ago6 min readBased on 3 sources
Reading level
Linux Kernel Cgroup Fork Contention: Yi Tao's Patch Series Targets a Deep Scalability Bottleneck

Linux Kernel Cgroup Fork Contention: Yi Tao's Patch Series Targets a Deep Scalability Bottleneck

On September 9, 2025, kernel developer Yi Tao submitted a fourth-revision patch series — PATCH v4 1/3 — to the Linux kernel mailing list (LKML), targeting a well-known but stubborn source of contention in the cgroup subsystem: the collision between cgroup process migration and fork/exec operations under load.

The fix is structural. Rather than tuning around the existing locking design, the patch proposes replacing a global percpu_rwsem in the cgroup core with a more granular mechanism, with the stated goal of reducing lock contention that currently surfaces during both fork() and exec() hot paths.

The Problem: A Global Lock in a High-Throughput World

The cgroup subsystem uses a percpu_rwsem — a per-CPU read-write semaphore — as a system-wide gate on certain process lifecycle operations. The read side is taken during fork() and exec(); the write side is taken when a process is migrated between cgroups. On systems that spawn processes at high rates — container orchestrators, build farms, parallel test harnesses, CI/CD pipelines — this asymmetry creates a measurable bottleneck: frequent cgroup migrations can stall the write-side acquisition, which in turn causes the read-side fork()/exec() traffic to queue behind it.

The v4 designation on Yi Tao's submission indicates this work has been through at least three prior rounds of review, a standard sign that the problem is technically non-trivial and the solution has been refined iteratively through mailing list feedback.

Why fork() Carries This Cost

To appreciate why this contention matters, it helps to be precise about what fork() actually does at the kernel level. A fork() call causes the kernel to duplicate the parent's page tables, set up a copy-on-write mapping of the address space, and initialize a new task_struct — all before a subsequent exec() tears most of that down to load the new process image. Even with copy-on-write semantics, the page table allocation and teardown are not free, particularly at scale or on processes with large virtual address spaces.

The POSIX standard does offer an alternative: posix_spawn() and posix_spawnp() are designed precisely to create a new child process from a specified process image without duplicating the parent's address space. The intent is to avoid the overhead that fork-then-exec incurs. On platforms with a native kernel implementation — certain microkernel or RTOS environments — this delivers a real speedup. On Linux, however, posix_spawn() is not implemented as a true syscall; glibc maps it back to fork() followed by exec(), meaning Linux workloads get none of the theoretical benefit. The page table creation and teardown remain on the critical path.

This context matters for Yi Tao's patch because it frames why the percpu_rwsem contention is particularly painful on Linux: the kernel cannot shortcut the fork-exec sequence the way a native posix_spawn() implementation could, so every spawned process touches the lock.

What the Patch Changes

The patch series replaces the single global percpu_rwsem with a design intended to decouple the concurrency domains of cgroup migration and process creation. The precise replacement mechanism — whether a per-cgroup lock, a finer-grained hierarchy of semaphores, or another primitive — is detailed in the patch itself and subject to maintainer review, which is characteristic of LKML's iterative process. The v4 revision suggests the approach has already survived scrutiny on the broad strokes and is now being refined at the implementation level.

The practical target is throughput in environments that combine high process-spawn rates with active cgroup membership changes — exactly the workload profile of a Kubernetes node processing pod churn, or a systemd-managed system under aggressive service cycling. On such systems, the write-side pressure from cgroup migration can cause fork() latency to spike in ways that are difficult to diagnose without kernel-level tracing, because the contention is invisible at the application layer.

Historical Pattern: Locking Granularity as a Recurring Kernel Theme

We have seen this pattern before. The transition from the Big Kernel Lock (BKL) to finer-grained per-subsystem locking was the defining internal story of the Linux 2.5/2.6 era — a multi-year project that unlocked SMP scalability on server hardware and, indirectly, made Linux viable for the workloads that would eventually define cloud infrastructure. The cgroup percpu_rwsem story is smaller in scope but follows the same arc: a coarse synchronization primitive that made sense at design time becomes a bottleneck as workload intensity scales up, and a motivated developer does the careful work of replacing it with something more surgical.

The difference today is velocity. What took years of BKL removal now happens through structured patch series reviewed asynchronously on LKML, with automated testing infrastructure providing regression signals that simply did not exist in the early 2000s. Yi Tao's v4 submission is already a product of that tightened loop.

Broader Implications for Container and Cloud Workloads

The populations most immediately affected are operators running container runtimes — containerd, CRI-O, and similar — on kernels that have not yet integrated this fix, and the downstream distributions that package those kernels. For them, the current behavior is a known tax on process-spawn throughput that requires either rate-limiting container churn or accepting latency variance on the fork/exec path.

Worth flagging: the patch is at v4 on LKML, not yet merged into Linus Torvalds' tree or any -next branch. The timeline to a stable kernel release depends on maintainer sign-off, merge window timing, and regression testing — a process that routinely takes one to several kernel cycles. Operators looking for relief should watch the cgroup and sched subsystem trees for merge confirmation before planning backport or upgrade timelines.

The posix_spawn Gap

The absence of a native posix_spawn() syscall on Linux is worth noting as a separate, longer-horizon gap. The POSIX interface was standardized specifically to allow implementations to avoid fork-exec overhead, and on Linux it remains an unfulfilled promise — glibc's fallback to fork-then-exec means the theoretical efficiency is never realized. This is not a criticism of the current patch work, which is practical and targets a real bottleneck; it is simply context for why the fork/exec path remains under optimization pressure even decades into Linux's lifespan. A true kernel-level posix_spawn() would address a different layer of the same underlying cost.

What to Watch

The near-term signal to track is the response from cgroup subsystem maintainers — Tejun Heo in particular has historically shaped the direction of cgroup locking architecture. If the v4 series receives reviewed-by and acked-by tags from the relevant maintainers, it is a strong indicator of merge in an upcoming window. Beyond this patch, the broader trajectory is toward finer-grained concurrency primitives throughout the process lifecycle code, driven by the reality that modern cloud infrastructure treats fork() and exec() as throughput operations, not occasional events.

The kernel community's willingness to revisit and refine foundational locking decisions — even in subsystems that have been stable for years — is one of the less-celebrated but consequential reasons Linux continues to scale into workloads its original authors could not have anticipated.