Skip to content

Batch stat calls in find -exec by replacing \; with +#340

Open
jeanschmidt wants to merge 2 commits intoactions:mainfrom
jeanschmidt:find_exec_batching
Open

Batch stat calls in find -exec by replacing \; with +#340
jeanschmidt wants to merge 2 commits intoactions:mainfrom
jeanschmidt:find_exec_batching

Conversation

@jeanschmidt
Copy link
Copy Markdown

@jeanschmidt jeanschmidt commented Apr 22, 2026

Summary

Batch stat calls in find -exec by replacing \; with + in listDirAllCommand(), reducing process spawns from one-per-file to one-per-batch.

Problem

The listDirAllCommand() function in packages/k8s/src/k8s/utils.ts generates a shell command used to list every file (with its size) under a directory. It is invoked during workspace copy verification in both execCpToPod and execCpFromPod — each of which runs the command up to 15 times in a retry loop on both the runner side (local spawn) and the job pod side (K8s exec). That's potentially 60 invocations of this command per job.

The original command:

find . -type f -not -path '*/_runner_hook_responses*' -exec stat -c '%s %n' {} \;

The \; terminator tells find to spawn a separate stat process for every single file it discovers. For a workspace with 10,000 files, that means 10,000 fork+exec cycles — each creating a new process, loading the stat binary, running it on one file, and exiting. The overhead is almost entirely in process creation, not in the actual stat syscall.

This is especially painful inside Kubernetes job pods, where:

  • The runner pod is memory-constrained (512Mi), and thousands of short-lived processes spike RSS and page cache churn.
  • The K8s exec path has per-call latency from the WebSocket round-trip, amplifying the wall-clock cost.
  • The output is accumulated in a Node.js Writable stream buffer (execCalculateOutputHashSorted), and the drip-feed of one-line-at-a-time from individual stat calls increases GC pressure compared to receiving larger chunks.

Solution

find . -type f -not -path '*/_runner_hook_responses*' -exec stat -c '%s %n' {} +

The + terminator tells find to batch as many filenames as possible into each stat invocation, up to the OS argument-length limit (ARG_MAX, typically 2MB on Linux). For a 10,000-file workspace, this typically results in 1–3 stat processes instead of 10,000.

This is a POSIX-standard feature (find -exec {} + has been in POSIX since 2004 / IEEE Std 1003.1-2004) and is supported by every find implementation used in GitHub Actions runner images (GNU findutils, BusyBox find, macOS find).

Behavioral equivalence

The output is identical: one %s %n (size + filename) line per file. The only difference is how many filenames are passed per stat invocation. Since the downstream consumer (execCalculateOutputHashSorted / localCalculateOutputHashSorted) splits on newlines, sorts, and hashes — the batching is invisible to the hash comparison logic.

What this does NOT change

  • The output format (unchanged — same stat -c '%s %n' format string)
  • The hash calculation (unchanged — lines are sorted before hashing, so ordering differences from batching are irrelevant)
  • The retry/verification logic (unchanged — same 15-attempt loop with 1s delay)
  • The find filter (unchanged — same -type f -not -path exclusion)

Performance impact

Workspace size \; (before) + (after) Speedup
100 files ~100 processes 1 process ~10x
1,000 files ~1,000 processes 1–2 processes ~50–100x
10,000 files ~10,000 processes 1–3 processes ~100x+

The savings multiply across the retry loop (up to 15 iterations × 2 sides × 2 copy directions = 60 invocations per job in the worst case).

Files changed

  • packages/k8s/src/k8s/utils.ts — one-character change in listDirAllCommand(): \\;+

Test plan

- Replace `\;` with `+` in find -exec for stat

Using `{} +` batches filenames into fewer stat
invocations instead of spawning one process per
file, reducing fork/exec overhead in large dirs.

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
Copilot AI review requested due to automatic review settings April 22, 2026 21:06
@jeanschmidt jeanschmidt requested review from a team and nikola-jokic as code owners April 22, 2026 21:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves performance of workspace copy verification by batching stat invocations produced by listDirAllCommand() (used when hashing directory contents on both runner and pod sides).

Changes:

  • Replace find -exec ... {} \; with find -exec ... {} + to batch multiple files per stat process.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/k8s/src/k8s/utils.ts Outdated
Comment thread packages/k8s/src/k8s/utils.ts Outdated
- Add `--` before `{}` in stat command to prevent filenames starting
  with `-` from being interpreted as options
- Add tests for listDirAllCommand covering batched exec, end-of-options
  marker, directory quoting, path exclusion, and file type filtering

Notes:
Without the `--` end-of-options marker, files whose names begin with a
dash (e.g. `-rf`) could be misinterpreted as flags by `stat`, causing
silent failures or incorrect output during workspace file listing.

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants