summaryrefslogtreecommitdiff
path: root/src/passes
Commit message (Collapse)AuthorAgeFilesLines
...
* Fix the type of reused RefFunc in Precompute (#6976)Alon Zakai2024-09-301-1/+2
| | | | | | | | | | | When we precompute something, we try to avoid allocating a new copy. That's important to avoid many allocations each time we run Precompute - otherwise, each time we see a br we'd allocate a fresh one, and for its values. But we had a bug where we reused a RefFunc as the value of a br without updating the type. It's actually tricky to reach a situation where we find a RefFunc to reuse and it is different from the actual one we want, but the fuzzer found one. Fixes the fuzz bug reported on #6845 (but unrelated to that PR).
* [NFC] Print type names in more places when logging (#6975)Alon Zakai2024-09-301-0/+5
|
* [FP16] Implement conversion operations. (#6974)Brendan Dahl2024-09-261-0/+12
| | | | | | | | | | Note: FP16 is a little different from F32/F64 since it can't represent the full 2^16 integer range. 65504 is the max whole integer. This leads to some slightly strange behavior when converting integers greater than 65504 since they become infinity. Specified at https://github.com/WebAssembly/half-precision/blob/main/proposals/half-precision/Overview.md
* [NFC-ish] Stop creating unneeded blocks around calls when inlining (#6969)Alon Zakai2024-09-261-6/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Inlining was careful about nested calls like this: (call $a (call $b) ) If we inlined the outer call first, we'd have (block $inlined-code-from-a ..code.. (call $b) ) After that, the inner call is a child of a block, not of a call. That is, we've moved the inner call to another parent. To replace that inner call when we inline, we'd need to update the new parent, which would require work. To avoid that work, the pass simply created a block in the middle: (call $a (block (call $b) ) ) Now the inner call's immediate parent will not change when we inline the outer call. However, it turns out that this was entirely unnecessary. We find the calls using a post-order traversal, and we store the actions in a vector that we traverse in order, so we only ever process things in the optimal order of children before parents. And in that order there is no problem: inlining the inner call first leads to (call $a (block $inlined-code-from-b (..code..) ) ) That does not affect the outer call's parent. This PR removes the creation of the unnecessary blocks. This doesn't improve the final output as optimizations remove the unneeded blocks later anyhow, but it does make the code simpler and a little faster. It also makes debugging less confusing. But this is not truly NFC because --inlining (but not --inlining-optimizing) will actually emit fewer blocks now (but only --inlining-optimizing is used by default in production). The diff on tests here is very small when ignoring whitespace. The remaining differences are just emitting fewer obviously-unneeded blocks. There is also one test that needed manual changes, inlining-eh-legacy, because it tested that we do Pop fixups, but after emitting one fewer block, those fixups were not needed. I added a new test there with two nested calls, which does end up needing those fixups. I also added such a test in inlining_all-features so that we have coverage for such nested calls (we might remove the eh-legacy file some day, and other existing tests with nested calls that I found were more complex).
* [NFC] Early-exit PickLoadSigns if there are no memories (#6971)Alon Zakai2024-09-261-0/+5
| | | | In WasmGC modules there is often no memory at all, and we can skip walking the code in this pass in such cases.
* [NFC-ish] Avoid repeated ReFinalize etc. when inlining (#6967)Alon Zakai2024-09-241-6/+24
| | | | | | | | | | | | | | | | | | | | | | | | | We may inline multiple times into a single function. Previously, if we did so, we did the "fixups" such as ReFinalize and non-nullable local fixes once per such inlining. But that is wasteful as each ReFinalize etc. scans the whole function, and could be done after we copy all the code from all the inlinings, which is what this PR does: it splits doInlining() into one function that inlines code and one that does the updates after, and the update is done after all inlinings. This turns out to be very important, a 5x speedup on two large real-world wasm files I am looking at. The reason is that we actually inline more than once in half the cases, and sometimes far more - in one case we inline over 1,000 times into a function! (and ReFinalized 1,000 times too many) This is practically NFC, but it turns out that there are some tiny noticeable differences between running ReFinalize once at the end vs. once after each inlining. These differences are not really functional or observable in the behavior of the code, and optimizations would remove them anyhow, but they are noticeable in two tests here. The changes to tests are, in order: * Different block names, just because the counter we use sees more things. * In a testcase with unreachable code, we inline twice into a function, and the first inlining brings in an unreachable, and ReFinalizing early will lead to it propagating differently than if we wait to ReFinalize. (It actually leads to another cycle of inlining in that case, as a fluke.)
* [NFC] Parallelize the actual inlining part of the Inlining pass (#6966)Alon Zakai2024-09-242-30/+97
| | | | | | | | | | | | | | We already parallelized collecting info about functions and finding inlining opportunities, but the actual inlining - copying the code into the target function - was done sequentially. It turns out that a lot of work happens there: this PR makes the pass over 2x faster. Details: * Move nameHint to InliningAction, so it is there when we apply the actions. * Add a DoInlining internal pass which calls doInlining(). * Refactor OptUtils a little to make it easy to run DoInlining before opts. * Also remove the return value of doInlining which was not used.
* [NFC] Eagerly create segments when parsing datacount (#6958)Thomas Lively2024-09-191-0/+5
| | | | | | | | | The purpose of the datacount section is to pre-declare how many data segments there will be so that engines can allocate space for them and not have to back patch subsequent instructions in the code section that refer to them. Once we use IRBuilder in the binary parser, we will have to have the data segments available by the time we parse instructions that use them, so eagerly construct the data segments when parsing the datacount section.
* [NFC] Eagerly create Functions in binary parser (#6957)Thomas Lively2024-09-191-0/+3
| | | | | | | | In preparation for using IRBuilder in the binary parser, eagerly create Functions when parsing the function section so that they are already created once we parse the code section. IRBuilder will require the functions to exist when parsing calls so it can figure out what type each call should have, even when there is a call to a function whose body has not been parsed yet.
* [NFC] Add isSSA to LazyLocalGraph, and use it in OptimizeAddedConstants (#6952)Alon Zakai2024-09-181-7/+5
| | | This makes the pass 15% faster.
* [NFC] Avoid collecting unnecessary parents in OptimizeAddedConstants (#6953)Alon Zakai2024-09-181-2/+24
| | | This makes the pass 20% faster.
* Fix selects of packed fields in GlobalStructOptimization (#6947)Alon Zakai2024-09-171-2/+4
| | | | | We emit a select between two objects when only two objects exist of a particular type. However, if the field is packed, we did not handle truncating the written values.
* Require string-style identifiers to be UTF-8 (#6941)Thomas Lively2024-09-161-2/+5
| | | | | | | | | | | In the WebAssembly text format, strings can generally be arbitrary bytes, but identifiers must be valid UTF-8. Check for UTF-8 validity when parsing string-style identifiers in the lexer. Update StringLowering to generate valid UTF-8 global names even for strings that may not be valid UTF-8 and test that text round tripping works correctly after StringLowering. Fixes #6937.
* Fix Heap2Local on pops inside of newly-created blocks (#6938)Alon Zakai2024-09-161-0/+17
|
* [NFC] Make Precompute use a lazy LocalGraph (#6934)Alon Zakai2024-09-121-4/+4
| | | | | | | To do this, add locations and getInfluences to LazyLocalGraph. Both cannot really be computed in a fine-grained manner, so just compute them all on the first request. That is not as efficient as our lazy computation of getSets and setInfluences, but they are also less important, and this change makes the pass 20% faster.
* [NFC] Ignore invalid precomputed fallthrough values (#6933)Alon Zakai2024-09-111-19/+13
| | | | | | Followup to #6928. If the fallthrough precomputes to an invalid type, it makes no sense to precompute the non-fallthrough, as it can't return anything useful.
* [NFC] Optimize propagateLocals() (#6928)Alon Zakai2024-09-111-116/+156
| | | | | | | | | | | | | | That is the slowest part of --precompute-propagate, which is one of our slowest passes. This makes that pass 25% faster. The main change is to make the maps consider missing elements as non-constant, rather than storing a None element in them. That saves allocating entries for the common case of a non-constant set/get. Also switch to a simple vector for the work queue, which is possible if we only add work when a get/set becomes constant (which can only happen once). Another benefit to adding work in that manner is that we don't start by adding every single get/set as "work" at the start.
* [EH] Fix pop enclosed within a block in DCE (#6922)Heejin Ahn2024-09-101-12/+13
| | | | | | | | | | | #6400 fixed this case but that handled only when a `pop` is an immediate child of the current expression, but a `pop` can be nested deeper down. We conservatively run the EH fixup whenever we have a `pop` and create `block`s in the optimization. We considered using `FindAll<Pop>` to make it precise, but we decided the quadratic time plexity was not worth it. Fixes #6918.
* Replace the old topological sort everywhere (#6902)Thomas Lively2024-09-102-21/+16
| | | | | | | | | To avoid having two separate topological sort utilities in the code base, replace remaining uses of the old DFS-based, CRTP topological sort with the newer Kahn's algorithm implementation. This would be NFC, except that the new topological sort produces a different order than the old topological sort, so the output of some passes is reordered.
* [NFC] Remove excessive debug logging from binary reading (#6927)Alon Zakai2024-09-101-0/+1
| | | | | | | | We were doing a debug logging for every LEB byte. It turns out that the isDebugEnabled() calls are expensive when called so frequently: in a release+assertion build, even with debug disabled, these checks are the highest thing in the profile. This PR removes the checks, which makes binary reading 12% faster.
* [NFC] Use a lazy LocalGraph in Heap2Local (#6925)Alon Zakai2024-09-101-8/+6
| | | | | That pass only cares about reference locals, and even among them, only ones that we see a struct.new/array.new that flows through locals. This makes the pass 40% faster.
* [NFC] OptimizeAddedConstants: Early exit if there are no memories (#6926)Alon Zakai2024-09-101-0/+5
| | | | | | | | The pass optimizes loads and stores, so without a memory there is nothing to do. This only helps if the user set --low-memory-unused and also has no memory, which is likely rare, but it's a trivial change so it seems worthwhile. In particular this pass constructs a LocalGraph, so if we can avoid work it can be substantial.
* Add a --preserve-type-order option (#6916)Thomas Lively2024-09-102-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unlike other module elements, types are not stored on the `Module`. Instead, they are collected by traversing the IR before printing and binary writing. The code that collects the types tries to optimize the order of rec groups based on the number of times each type is used. As a result, the output order of types generally has no relation to the input order of types. In addition, most type optimizations rewrite the types into a single large rec group, and the order of types in that group is essentially arbitrary. Changes to the code for counting type uses, sorting types, or sorting rec groups can yield very large changes in the output order of types, producing test diffs that are hard to review and potentially harming the readability of tests by moving output types away from the corresponding input types. To help make test output more stable and readable, introduce a tool option that causes the order of output types to match the order of input types as closely as possible. It is implemented by having the parsers record the indices of the input types on the `Module` just like they already record the type names. The `GlobalTypeRewriter` infrastructure used by type optimizations associates the new types with the old indices just like it already does for names and also respects the input order when rewriting types into a large recursion group. By default, wasm-opt and other tools clear the recorded type indices after parsing the module, so their default behavior is not modified by this change. Follow-on PRs will use the new flag in more tests, which will generate large diffs but leave the tests in stable, more readable states that will no longer change due to other changes to the optimizing type sorting logic.
* [NFC] Standardize Super:: over super:: (#6920)Alon Zakai2024-09-1021-24/+24
| | | | As the name of a class, uppercase seems better here.
* [NFC-ish] Remove LocalGraph from LocalSubtyping (#6921)Alon Zakai2024-09-101-69/+42
| | | | | | | | | | | | | | | | The LocalGraph there was used for two purposes: 1. Get the list of gets and sets. 2. Get only the reachable gets and sets. It is trivial to get all the gets and sets in a much faster way, by just walking the code as this PR does. The downside is that we also consider unreachable gets and sets, so unreachable code can prevent us from optimizing, but that seems worthwhile as many passes make that assumption (and they all become maximally effective after --dce). That is the only non-NFC part here. Removing LocalGraph + the fixup code for unreachability makes this significantly shorter, and also 2-3x faster.
* [NFC] LazyLocalGraph: Add getSetInfluences() (#6909)Alon Zakai2024-09-091-1/+7
| | | | | This new API on lazy local graphs allows us to use laziness in another place, StackIR opts. This makes writing the binary (which includes StackIR opts, when those are enabled), 10% faster.
* [NFC] Rename topological_orders.h to topological_sort.h (#6915)Thomas Lively2024-09-072-2/+2
| | | | | Now that the header includes topological sort utilities in addition to the utility that iterates over all topological orders, it makes more sense for it to be named topological_sort.h
* Adds a J2CL specific pass that moves itable entries to vtables (#6888)Roberto Lublinerman2024-09-064-0/+350
| | | | | | | | This allows to remove a reference field from all Java objects reducing the per object memory and initialization overhead. The pass is designed to run direclty on the J2CL output before other optimizations since it relies on invariants that might get lost in optimization. If the invariants don't hold the pass aborts.
* Avoid conflicts with public rec groups in MinimizeRecGroups (#6911)Thomas Lively2024-09-061-7/+73
| | | | | | | | | | | | | | | | | | | As with all type optimizations, MinimizeRecGroups only changes private types, which are the only types that are safe to modify. However, it is important for correctness that MinimimizeRecGroups maintain separate type identities for all types, whether public or private, to ensure that casts that should differentiate two types cannot change behavior. Previously the pass worked exclusively on private types, so there was nothing preventing it from constructing a minimial rec group that happened to have the same shape, and therefore type identity, as a public rec group. #6886 exhibits a fuzzer test case where this happens and changes the behavior of the program. Fix the bug by recording all public rec group shapes and resolve conflicts with these shapes by updating the shape of the conflicting non-public type. Fixes #6886.
* [NFC] Avoid wasted LocalGraph work in MergeLocals (#6908)Alon Zakai2024-09-051-1/+1
| | | | | | We computed both get and set influences, but getGetInfluences() was never called, so that work was entirely pointless. This makes the pass 20% faster.
* [NFC] Add a lazy mode to LocalGraph (#6895)Alon Zakai2024-09-051-3/+4
| | | | | | | | | | LocalGraph by default will compute all the local.sets that can be read from all local.gets. However, many passes only query a small amount of those. To avoid wasted work, add a lazy mode that only computes sets when asked about a get. This is then used in a single place, LoopInvariantCodeMotion, which becomes 18% faster.
* Only generate string.consts custom section if it is needed (#6893)Goktug Gokdogan2024-09-051-7/+10
|
* [NFC] Use Index instead of size_t in topological sort util (#6903)Thomas Lively2024-09-052-41/+41
| | | | | | This saves memory and could in principle improve performance, although a quick experiment with 30 samples on ReorderGlobals did not yield a statistically significant improvement. At any rate, using Index is more consistent with other parts of the code base.
* [NFC] Compute only one dependence graph in ReorderGlobals (#6891)Thomas Lively2024-09-041-28/+18
| | | | | | We previously computed both forward and reverse dependence graphs, but one of them was only used for a single topological sort that could just as well be computed by reversing the topological sort on the other graph.
* [NFC] Take custom comparator in TopologicalSort::minSort (#6890)Thomas Lively2024-09-041-57/+35
| | | | | | | | | | | | | | | | Rather than finding the minimum sort with respect to the original order of vertices, find the minimum sort with respect to an arbitrary user-provided comparator. Users of the minSort utility previously had to sort their input graphs according to their desired ordering, but now they can simply provide their comparator instead. Take advantage of the new functionality in ReorderGlobals and also standardize on a single data type for representing dependence graphs to avoid unnecessary conversions. Together, these changes slightly speed up ReorderGlobals. Move the topological sort code previously in a .cpp file into the header so the comparator can be provided as a lambda template parameter instead of as a `std::function`. This makes ReorderGlobals about 5% faster.
* [NFC] Convert LocalGraph influences accesses to function calls (#6899)Alon Zakai2024-09-046-26/+14
| | | | | This replaces direct access of the data structure graph.*influences[foo] with a call graph.get*influences(foo). This will allow a later PR to make those calls optionally lazy.
* [EH] Rename BrTarget to Trampoline (NFC) (#6898)Heejin Ahn2024-09-041-30/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This renames "delegate_br_target" to "delegate_trampoline". So how we translate `try`-`delegate` is: - Before: ```wast (try $delegate_target ... (try (do ... ) (delegate $delegate_target) ) ... ) ``` - After: ```wast (try_table $delegate_target (throw_ref (block $delegate_br_target ... (try_table (catch_all $delegate_br_target) ... ) ... ) ) ) ``` So `delegate_br_target` is the destination we branch (via `try_table`) to, in order to rethrow the exnref using `throw_ref`. But given that the translated code does not actually have a `br`, I think this name can be confusing. This renames `br_target` to `trampoline`, given that the block is upon which we bounce the exnref off to reach the real delegate target. This is to be consistent with the variable names in the LLVM implementation (which has not been submitted yet).
* [NFC] Convert HeapStoreOptimization to use a CFG (#6896)Alon Zakai2024-09-031-5/+49
| | | | | | | | | This does not use the CFG yet, so there is no benefit (and likely some small slowdown). The next PR will actually use it to fix a correctness bug. This PR only sets up the CFG and converts the pass to operate on it, without changing any behavior or tests. Followup to #6882
* [NFC] Move optimizeSubsequentStructSet() to a new pass, ↵Alon Zakai2024-09-035-193/+250
| | | | | | | | | | | | | | | | | | | HeapStoreOptimization (#6882) This just moves code out of OptimizeInstructions to the new pass. The existing test is renamed and now runs the new pass instead. The new pass is run right after each --optimize-instructions invocation, so it should not cause any noticeable effects whatsoever, making this NFC. The motivation here is that there is a bug in the pass, see the new testcase added at the end, which shows the bug. It is not practical to fix that bug in OptimizeInstructions since we need more than peephole optimizations to do so. This PR moves the code to a new pass so we can fix it there properly, later. The new pass is named HeapStoreOptimization since the same infrastructure we will need to fix the bug will also help dead store elimination and related things.
* [NFC] Refactor LocalGraph to split up flow() for future laziness work (#6880)Alon Zakai2024-09-031-0/+1
|
* [NFC] Avoid repeated work in DeadArgumentElimination scanning (#6869)Alon Zakai2024-09-033-61/+141
| | | | | | | | | | | | | | | | | | | This pass may do multiple iterations, and before this PR it scanned the entire module each time. That is simpler than tracking stale data, but it can be quite slow. This PR adds staleness tracking, which makes it over 3x faster (and this can be one of our slowest passes in some cases, so this is significant). To achieve this: * Add a staleness marker on function info. * Rewrite how we track unseen calls. Previously we used atomics in a clever way, * now we just accumulate the data in a simple way (easier for staleness tracking). * Add staleness invalidation in the proper places. * Add a param to localizeCallsTo to allow us to learn when a function is changed. This kind of staleness analysis is usually not worthwhile, but given the 3x plus speedup it seems justified. I fuzzed it directly, and also any staleness bug can lead to validation errors, so normal fuzzing also gives us good coverage here.
* [FP16] Implement madd and nmadd. (#6878)Brendan Dahl2024-09-031-0/+6
| | | | | | | | | | | | | | Specified at https://github.com/WebAssembly/half-precision/blob/main/proposals/half-precision/Overview.md A few notes: - The F32x4 and F64x2 versions of madd and nmadd are missing spect tests. - For madd, the implementation was incorrectly doing `(b*c)+a` where it should be `(a*b)+c`. - For nmadd, the implementation was incorrectly doing `(-b*c)+a` where it should be `-(a*b)+c`. - There doesn't appear to be a great way to actually implement a fused nmadd, but the spec allows the double rounded version I added.
* [NFC] Change topological sort utilities to functions (#6889)Thomas Lively2024-09-031-3/+2
| | | | | | | Previously they were structs and their results were accessed with `operator*()`, but that was unnecessarily complicated and could lead to problems with temporary lifetimes being too short. Simplify the utilities by making them functions. This also allows the wrapper templates to infer the proper element types automatically.
* Simplify ReorderGlobals using new topological sort utils (#6885)Thomas Lively2024-08-291-141/+62
| | | | | | | | | | | | | | Use the new TopologicalSort and MinTopologicalSortOf utilities instead of the old CRTP topological sort utility and a bespoke heap-based topological sort in ReorderGlobals. Since there is no longer a heap to pop from, the direction of the custom comparator is now much more intuitive. Further simplify the code by switching from tracking the new order of globals using a sequence of new indices to tracking the order using a sequence of old indices. This change also makes the pass about 20% faster on a large real-world module.
* [NFC] Refactor LocalGraph's core getSets API (#6877)Alon Zakai2024-08-287-15/+15
| | | | | | | | | | | | | | Before we just had a map that people would access with localGraph.getSetses[get], while now it is a call localGraph.getSets(get), which more nicely hides the internal implementation details. Also rename getSetses => getSetsMap. This will allow a later PR to optimize the internals of this API. This is performance-neutral as far as I can measure. (We do replace a direct read from a data structure with a call, but the call is in a header and should always get inlined.)
* Rename relaxed SIMD fma instructions to match spec. (#6876)Brendan Dahl2024-08-271-8/+8
| | | | | | | The instructions relaxed_fma and relaxed_fnma have been renamed to relaxed_madd and relaxed_nmadd. https://github.com/WebAssembly/relaxed-simd/blob/main/proposals/relaxed-simd/Overview.md#binary-format
* [FP16] Implement unary operations. (#6867)Brendan Dahl2024-08-271-0/+21
| | | | Specified at https://github.com/WebAssembly/half-precision/blob/main/proposals/half-precision/Overview.md
* [NFC] Optimize ParamUtils::getUsedParams() (#6866)Alon Zakai2024-08-264-18/+43
| | | | | | | | | | | | | This constructed a LocalGraph, which computes the sets that reach each get. But all we need to know is which params are live, so instead we can do a liveness computation (which is just a boolean, not the list of sets). Also, it is simple to get the liveness computation to only work on the parameters and not all the locals, as a further optimization. Existing tests cover this, though I did find that the case of unreachability needed a new test. On a large testcase I am looking at, this makes --dae 17% faster.
* [NFC] Avoid quadratic time when precomputing blocks (#6862)Alon Zakai2024-08-211-0/+67
| | | | | | | | | When precomputing fails on a child block of a parent block, there is no point to precompute the parent, as that will fail as well. This makes --precompute on Emscripten's test_biggerswitch go from 1.44 seconds to 0.02 seconds (not a typo, that is 72x faster). The absolute number is not that big, but we do run this pass more than once, so it saves a noticeable chunk of time.
* Add a string lowering mode disallowing non-UTF-8 strings (#6861)Thomas Lively2024-08-213-2/+24
| | | | | | | | | | | The best way to lower strings is via the "magic imports" API that uses the names of imported string globals as their values. This approach only works for valid UTF-8 strings, though. The existing string-lowering-magic-imports pass falls back to putting non-UTF-8 strings in a JSON custom section, but this requires the runtime to support that custom section for correctness. To help catch errors early when runtimes do not support the strings custom section, add a new pass that uses magic imports and raises an error if there are any invalid strings.