| 76911c29 | 07-Jan-2026 |
SSD <[email protected]> |
Partial support for no_std in cranelift_codegen (#12222)
* Move most things from std to core and alloc
* Port assembler_x64 to no_std
* before adding prelude to each file
* Most of the files now
Partial support for no_std in cranelift_codegen (#12222)
* Move most things from std to core and alloc
* Port assembler_x64 to no_std
* before adding prelude to each file
* Most of the files now work with no_std
* update isle to use alloc and core
* some instances shouldn't have been renamed, fixes cargo test
* add cranelift-assembler-x64 (no_std) to CI
* fix codegen_meta, missed one spot with std::slice
* automatically remove prelude with cargo fix
* update isle changes
* update assembler changes
* update assembler changes
* use latest codegen changes + fix FxHash problem
* add imports
* fix floating issues with libm
* remove unused import
* temporarily remove OnceLock
* add no_std arm support and add it into CI
* Move most things from std to core and alloc
* Port assembler_x64 to no_std
* before adding prelude to each file
* Most of the files now work with no_std
* update isle to use alloc and core
* some instances shouldn't have been renamed, fixes cargo test
* add cranelift-assembler-x64 (no_std) to CI
* automatically remove prelude with cargo fix
* update isle changes
* update assembler changes
* update assembler changes
* use latest codegen changes + fix FxHash problem
* add imports
* fix floating issues with libm
* remove unused import
* temporarily remove OnceLock
* add no_std arm support and add it into CI
* Move most things from std to core and alloc
* Port assembler_x64 to no_std
* before adding prelude to each file
* Most of the files now work with no_std
* update isle to use alloc and core
* add cranelift-assembler-x64 (no_std) to CI
* automatically remove prelude with cargo fix
* update isle changes
* update assembler changes
* use latest codegen changes + fix FxHash problem
* add imports
* fix floating issues with libm
* temporarily remove OnceLock
* add no_std arm support and add it into CI
* revert Cargo.toml formating
* remove prelude and fix cargo.toml
* cargo fmt
* remove empty lines
* bad renames
* macro_use only on no_std
* revert OnceLock change
* only use stable libm features
* update regalloc2
* update comment
* use continue instead
* Update vets
---------
Co-authored-by: Alex Crichton <[email protected]>
show more ...
|
| 82c0a09b | 26-Mar-2025 |
Chris Fallin <[email protected]> |
Simplify aegraphs by removing union-find and canonical eclass IDs. (#10471)
I was recently re-thinking through some of the core data structure design in our aegraph implementation, and wondered: do
Simplify aegraphs by removing union-find and canonical eclass IDs. (#10471)
I was recently re-thinking through some of the core data structure design in our aegraph implementation, and wondered: do we really need the union-find data structure, the notion of "canonical" ID for an eclass separate from its latest ID (root of union-node tree), and the hashcons-key canonicalization using all of this? It's an awful lot of complexity and has led to some fairly subtle bugs (e.g., #6126), and is generally unsatisfying.
I had the realization: the only case where the distinction between canonical and latest ID matters is when we expand an eclass after its initial (eager) rewriting, which happens before we see its uses. If we hypothesize that this happens rarely, then it should be fine to canonicalize based only on latest ID -- we shouldn't lose much (and we can measure this loss empirically).
The chief case where this kind of "late eclass expansion" still happens is: if we have some expression E1 that eventually rewrites to E2 via some simplification, and E2 already exists earlier in the program, then E1 will join the eclass. If we then have some `E3 := E1 + 1`, and later `E4 := E2 + 1`, we need the union-find canonicalization for E4 to GVN to E3. Otherwise, the latest ID for the eclass that eventually contains E1 and E2 is different at the time that E3 uses it (and is GVN'd and rewritten) and when E4 does. Put another way: E3 captures a snapshot of its operand's eclass before a new node joins it, and is never reprocessed when that happens, so E3 remains distinct.
But if the `E2 -> E1` rewrite is truly "directional" toward a better representation that we will always want to choose -- say, `x + 0 -> x`, or any constant-propagation in general -- then if the eager rewriting for E2 produces E1's eclass ID directly *without* adding E2 to its nodes, then all users will still canonicalize as before. This "only return the rewrite target, don't union with it" before is exactly our `subsume` operator.
Put another way: subsumption prevents growing eclasses later, so snapshots in time remain the latest, and everyone canonicalizes with the same ID. We move to a true immutable data structure, with simple hashconsing with no magic.
The rewrite semantics are then much simpler too: if any value is marked "subsume", we pick it (pick an arbitrary one if multiple) as our rewrite result; otherwise, create an eclass with the original rewrite input and all rewrite outputs (by creating union-nodes in the DFG). No auto-subsume or factoring in availability -- that's it. "Subsume" means: always pick the rewritten value, don't keep the original, and we use it whenever that's unambiguously true for better canonicalization.
Given that, it turns out that we can remove the union-find mechanism entirely. It also turns out that we can unify the scoped-hashmap for "effectful/idempotent ops" with the ordinary hashmap for pure ops; everything can be scoped. To get working LICM we do need to retain our "available block" mechanism, but only to insert values at a higher scope level in the scoped hashmap -- it is now heuristic, not load-bearing for correctness.
I suspect that the fixpoint loop computing analysis results can go away too, now that we truly don't update arguments -- we have a simple immutable data structure with everything snapshotted at creation -- but I haven't made that change yet.
This change appears to be "in the noise" in both runtime and compile time -- a Sightglass run on the default suite shows only
``` compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm
Δ = 551234.50 ± 514580.62 (confidence = 99%)
new.so is 1.00x to 1.01x faster than old.so!
[61669181 72513567.85 98139932] new.so [60991071 73064802.35 120044089] old.so
execution :: cycles :: benchmarks/bz2/benchmark.wasm
Δ = 232827.80 ± 204621.12 (confidence = 99%)
old.so is 1.00x to 1.01x faster than new.so!
[67208140 72812782.32 89996076] new.so [69531172 72579954.52 80530142] old.so ```
which seem like suitably small swings that are fine. Spot-checking the aegraph stats on the same function before-and-after shows the same optimizations happening in all functions I examined, and we see the compile-tests showing no movement except for a value renumbering in one case. So: no effect objectively, but deletes code and significantly simplifies the core algorithm.
show more ...
|
| b9f2a306 | 07-Nov-2023 |
Nick Fitzgerald <[email protected]> |
Cranelift: Break op cost ties with expression depth in egraphs (#7456)
* Cranelift: Switch egraph `Cost` to a struct with named fields
Mechanical change.
* Cranelift: Break op cost ties with expre
Cranelift: Break op cost ties with expression depth in egraphs (#7456)
* Cranelift: Switch egraph `Cost` to a struct with named fields
Mechanical change.
* Cranelift: Break op cost ties with expression depth in egraphs
This means that, when the opcode cost is the same, we prefer shallow and wide expressions to narrow and deep. For example, `(a + b) + (c + d)` is preferred to `((a + b) + c) + d`. This is beneficial because it exposes more instruction-level parallelism and shortens live ranges.
Co-Authored-By: Trevor Elliott <[email protected]>
* Cranelift: Bitpack the egraph `Cost` structure
Co-Authored-By: Chris Fallin <[email protected]> Co-Authored-By: Trevor Elliott <[email protected]>
* Make it so you can't construct `Cost::inifinity()` by accident
* Use fold to code golf
---------
Co-authored-by: Trevor Elliott <[email protected]> Co-authored-by: Chris Fallin <[email protected]>
show more ...
|
| 77d030cc | 20-Oct-2023 |
Chris Fallin <[email protected]> |
egraphs: don't let rematerialization override LICM. (#7306)
This reworks the way that remat and LICM interact during aegraph elaboration. In principle, both happen during the same single-pass "code
egraphs: don't let rematerialization override LICM. (#7306)
This reworks the way that remat and LICM interact during aegraph elaboration. In principle, both happen during the same single-pass "code placement" algorithm: we decide where to place pure instructions (those that are eligible for movement), and remat pushes them one way while LICM pushes them the other.
The interaction is a little more subtle than simple heuristic priority, though -- it's really a decision ordering issue. A remat'd value wants to sink as deep into the loop nest as it can (to the use's block), but we don't know *where* the uses go until we process them (and make LICM-related choices), and we process uses after defs during elaboration. Or more precisely, we have some work at the use before recursively processing the def, and some work after the recursion returns; and the LICM decision happens after recursion returns, because LICM wants to know where the defs are to know how high we can hoist. (The recursion is itself unrolled into a state machine on an explicit stack so that's a little hard to see but that's what is happening in principle.)
The solution here is to make remat a separate just-in-time thing, once we have arg values. Just before we plug the final arg values into the elaborated instruction, we ask: is this a remat'd value, and if so, do we have a copy of the computation in this block yet. If not, we make one. This has to happen in two places (the main elab loop and the toplevel driver from the skeleton).
The one downside of this solution is that it doesn't handle *recursive* rematerialization by default. This means that if we, for example, decide to remat single-constant-arg adds (as we actually do in our current rules), we won't then also recursively remat the constant arg to those adds. This can be seen in the `licm.clif` test case. This doesn't seem to be a dealbreaker to me because most such cases will be able to fold the constants anyway (they happen mostly because of pointer pre-computations: a loop over structs in Wasm computes heap_base + p + offset, and naive LICM pulls a `heap_base + offset` out of the loop for every struct field accessed in the loop, with horrible register pressure resulting; that's why we have that remat rule. Most such offsets are pretty small.).
Fixes #7283.
show more ...
|