Triaging/improving the number of crates classified as build-fail

I'm writing this up because on the Zulip, simulacrum suggested nobody had done this before. If this is already known I hope i odn't seem patronizing by writing it up.

There were a number of regressions related to a recent LLVM version bump which resulted in rampant resource utilization for a number of crates. To a non-expert this is confusing, shouldn't crater have detected that? Unfortunately, it looks like crater has a serious problem with OOMs generally: https://github.com/rust-lang/crater/issues/564 https://github.com/rust-lang/crater/issues/562 https://github.com/rust-lang/crater/issues/544 https://github.com/rust-lang/crater/issues/516 https://github.com/rust-lang/crater/issues/490 https://github.com/rust-lang/crater/issues/484. I'm hoping that by looking into a lot of the available logs I can make some suggestions, maybe put in a PR (though I've never worked in this codebase) and generally improve the quality of crater runs.

I did some _very basic_ poking around all of 1.54, 1.55, 1.56, 1.57, and 1.58 runs, only considering published crates. It looks like the sort of failures we see is _basically_ the same from version to version. So I focused on just 1.58 because I'm much more concerned about systematic behavior among the build failures and what can be done about it. 

The 1.58 run has 14,921 `build-fail/reg` crates. Of those...
* 7,545 have some sort of rustc error with an error code which could conceivably be parsed
* 2,560 have some kind of custom build script error. Mostly these are failed attempts to locate or build a C/C++ dependency 
* 1,409 blindly try to link against some library that doesn't exist on the system
* 389 contain `#!{experimental]`
* 370 encounter some other kind of OOM, most often in the linker, then in compiling a C/C++ dependency
* 193 try to `include_bytes!`/`include_str!` some file that's not in the repo
* 162 time out with no output
* 114 experience some other kind of linker error (one crate even tries to build itself with asan and fails)
* 24 have a truncated log
There are also a lot that I didn't categorize, such as attempts to compile with macOS frameworks, use of `llvm_asm!`, missing `eh_personality`, and a lot of crates that require the user to turn on a non-default feature to build.

My biggest concern with the current setup is that the number of CPUs that a build is spawned with is sporadic, and this alone causes a significant number of OOMs. The most hilarious case of this that I've found is [`memx`](https://crates.io/crates/memx). The author has quite diligently written 35 integration test binaries, which means on a 64-core machine each integration test has only 44 MB to work with. That's enough for rustc actually, but not for all the `ld` processes. `regex` fails most crater runs for the same reason, but its codebase is much more memory-intensive. With only 4 CPUs, `regex` will OOM building tests.

This is why the vast, vast majority of spurious-fixed and spurious-regressed crates are regressed to or fixed from an OOM. They OOM when they're randomly assigned to an environment that happens to have too many CPUs, then most likely are assigned to an environment next time that has far fewer.

The build timeouts (`no output for 300 seconds`) are also interesting. Since there are only 162 of them, I tried to reproduce all of them myself. Most of them are not reproducible. But I did find a few true positives lurking in there:
* `savage`, and `sdc-parser` push the 1.5 GB limit even with a single job. They probably look like a timeout but only on account of the memory limit.
*  `fungui` could possibly be considered compiler hangs on Rust 1.57, but not on the current nightly. It's not clear to me if crater could have spotted a compile time regression that it otherwise missed if this were noticed.
* `ilvm` needs 30 minutes to compile on Rust 1.57, the 1.58 beta, and current nightly. I think it qualifies as a compiler hang, the codebase is pretty small and simple for that long of a compilation.

If we only saw 4 build timeouts instead of 162, perhaps they could have been manually inspected on every crater run. So perhaps there's an opportunity here?

#### Some ideas:

The root problem with all the spurious OOMs is that the peak memory usage of a build scales with the number of CPUs available, but crater doesn't scale the available memory up even as it scales the number of available CPUs randomly by a factor of 10 or more. Setting a job limit on cargo would only be a partial solution because there are plenty of build scripts that compile C libraries that fan out parallelism to the number of CPUs detected. I think it would be a huge improvement to limit the number of CPUs or provide a memory limit that scales up with the number of CPUs.

The build timeouts as well as those things that crater summaries already categorize as are also quite interesting. Quite a few just look like this:
```
[INFO] fetching crate asfa 0.9.0...
[ERROR] this task or one of its parent failed!
[ERROR] no output for 300 seconds
[ERROR] note: run with `RUST_BACKTRACE=1` to display a backtrace.
```
This crate with the same version was test-pass in 1.56, test-fail in 1.57, and error in 1.58. This sort of output smells like a transient network error. Is there a retry mechanism for crate builds? And even if there isn't, it would be good to get a lot more logging related to downloads so that we could have more hope of diagnosing these. The `error` crates aren't `build-fail` (which is what the title says) but this seems like the same pathology as the timeout crates suffer from; almost like a sudden loss of networking.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Triaging/improving the number of crates classified as build-fail #589

Some ideas:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Triaging/improving the number of crates classified as build-fail #589

Description

Some ideas:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions