-
Notifications
You must be signed in to change notification settings - Fork 96
Description
I'm writing this up because on the Zulip, simulacrum suggested nobody had done this before. If this is already known I hope i odn't seem patronizing by writing it up.
There were a number of regressions related to a recent LLVM version bump which resulted in rampant resource utilization for a number of crates. To a non-expert this is confusing, shouldn't crater have detected that? Unfortunately, it looks like crater has a serious problem with OOMs generally: #564 #562 #544 #516 #490 #484. I'm hoping that by looking into a lot of the available logs I can make some suggestions, maybe put in a PR (though I've never worked in this codebase) and generally improve the quality of crater runs.
I did some very basic poking around all of 1.54, 1.55, 1.56, 1.57, and 1.58 runs, only considering published crates. It looks like the sort of failures we see is basically the same from version to version. So I focused on just 1.58 because I'm much more concerned about systematic behavior among the build failures and what can be done about it.
The 1.58 run has 14,921 build-fail/reg
crates. Of those...
- 7,545 have some sort of rustc error with an error code which could conceivably be parsed
- 2,560 have some kind of custom build script error. Mostly these are failed attempts to locate or build a C/C++ dependency
- 1,409 blindly try to link against some library that doesn't exist on the system
- 389 contain
#!{experimental]
- 370 encounter some other kind of OOM, most often in the linker, then in compiling a C/C++ dependency
- 193 try to
include_bytes!
/include_str!
some file that's not in the repo - 162 time out with no output
- 114 experience some other kind of linker error (one crate even tries to build itself with asan and fails)
- 24 have a truncated log
There are also a lot that I didn't categorize, such as attempts to compile with macOS frameworks, use ofllvm_asm!
, missingeh_personality
, and a lot of crates that require the user to turn on a non-default feature to build.
My biggest concern with the current setup is that the number of CPUs that a build is spawned with is sporadic, and this alone causes a significant number of OOMs. The most hilarious case of this that I've found is memx
. The author has quite diligently written 35 integration test binaries, which means on a 64-core machine each integration test has only 44 MB to work with. That's enough for rustc actually, but not for all the ld
processes. regex
fails most crater runs for the same reason, but its codebase is much more memory-intensive. With only 4 CPUs, regex
will OOM building tests.
This is why the vast, vast majority of spurious-fixed and spurious-regressed crates are regressed to or fixed from an OOM. They OOM when they're randomly assigned to an environment that happens to have too many CPUs, then most likely are assigned to an environment next time that has far fewer.
The build timeouts (no output for 300 seconds
) are also interesting. Since there are only 162 of them, I tried to reproduce all of them myself. Most of them are not reproducible. But I did find a few true positives lurking in there:
savage
, andsdc-parser
push the 1.5 GB limit even with a single job. They probably look like a timeout but only on account of the memory limit.fungui
could possibly be considered compiler hangs on Rust 1.57, but not on the current nightly. It's not clear to me if crater could have spotted a compile time regression that it otherwise missed if this were noticed.ilvm
needs 30 minutes to compile on Rust 1.57, the 1.58 beta, and current nightly. I think it qualifies as a compiler hang, the codebase is pretty small and simple for that long of a compilation.
If we only saw 4 build timeouts instead of 162, perhaps they could have been manually inspected on every crater run. So perhaps there's an opportunity here?
Some ideas:
The root problem with all the spurious OOMs is that the peak memory usage of a build scales with the number of CPUs available, but crater doesn't scale the available memory up even as it scales the number of available CPUs randomly by a factor of 10 or more. Setting a job limit on cargo would only be a partial solution because there are plenty of build scripts that compile C libraries that fan out parallelism to the number of CPUs detected. I think it would be a huge improvement to limit the number of CPUs or provide a memory limit that scales up with the number of CPUs.
The build timeouts as well as those things that crater summaries already categorize as are also quite interesting. Quite a few just look like this:
[INFO] fetching crate asfa 0.9.0...
[ERROR] this task or one of its parent failed!
[ERROR] no output for 300 seconds
[ERROR] note: run with `RUST_BACKTRACE=1` to display a backtrace.
This crate with the same version was test-pass in 1.56, test-fail in 1.57, and error in 1.58. This sort of output smells like a transient network error. Is there a retry mechanism for crate builds? And even if there isn't, it would be good to get a lot more logging related to downloads so that we could have more hope of diagnosing these. The error
crates aren't build-fail
(which is what the title says) but this seems like the same pathology as the timeout crates suffer from; almost like a sudden loss of networking.