| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
| |
Add a mode that splits a module into arbitrarily many parts based on a
simple manifest file. This is currently implemented by splitting out one
module at a time in a loop, but this could change in the future if
splitting out all the modules at once would improve the quality of the
output.
|
|
|
|
|
|
|
|
|
|
|
| |
In the WebAssembly text format, strings can generally be arbitrary
bytes, but identifiers must be valid UTF-8. Check for UTF-8 validity
when parsing string-style identifiers in the lexer.
Update StringLowering to generate valid UTF-8 global names even for
strings that may not be valid UTF-8 and test that text round tripping
works correctly after StringLowering.
Fixes #6937.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
methods (#6936)
This just moves code around. As a result, isRef() vanishes entirely from the
profiling traces in #6931, since now the core isRef/Tuple/etc. methods are
all inlineable.
This also required some reordering of wasm-type.h, namely to move HeapType
up front. No changes to that class otherwise.
TypeInfo is now in the header. getTypeInfo is now a static method on Type.
This has the downside of moving internal details into the header, and it may
increase compile time a little. The upside is making the --precompute benchmark
from #6931 significantly faster, 33%, and it will also help the many
Type::isNonNullable() etc. calls we have scattered around the codebase in
other passes too.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Wasm-split generally assumes that calls to secondary functions made
before the secondary module has been loaded and instantiated should go
to imported placeholder functions that can be responsible for loading
the secondary module and forwarding the call to the loaded function.
That scheme makes the loading entirely transparent from the
application's point of view, which is not always a good thing. Other
schemes would make it impossible for a secondary function to be called
before the secondary module has been explicitly loaded, in which case
the placeholder functions would never be called. To improve code size
and simplify instantiation under these schemes, add a new
`--no-placeholders` option that skips adding imported placeholder
functions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are a few heap types that are hard-coded to be considered public
and therefore allowed on module boundaries even in --closed-world mode,
specifically to support js-string-builtins. We previously considered
both open and closed (i.e. final) mutable i8 arrays to be public in this
manner, but js-string-builtins only uses the closed versions, so remove
the open versions.
This fixes a particular bug in which Unsubtyping optimized a private
array type to be equivalent to an ignorable public array type,
incorrectly changing the behavior of a cast, but it does not address the
larger problem of optimizations producing types that are equivalent to
public types. Add a TODO about that problem for now.
Fixes #6935.
|
| |
|
|
|
|
|
|
|
| |
To do this, add locations and getInfluences to LazyLocalGraph. Both cannot
really be computed in a fine-grained manner, so just compute them all on the
first request. That is not as efficient as our lazy computation of getSets and
setInfluences, but they are also less important, and this change makes the
pass 20% faster.
|
|
|
|
|
|
| |
Followup to #6928. If the fallthrough precomputes to an invalid type, it
makes no sense to precompute the non-fallthrough, as it can't return anything
useful.
|
|
|
|
|
|
|
|
|
|
| |
The error checking we had to report an error when the input contains
block parameters was in a code path that is no longer executed under
normal circumstances. Specifically, it was part of the
`ParseModuleTypesCtx` phase of parsing, which no longer parses function
bodies. Move the error checking to the `ParseDefsCtx` phase, which does
parse function bodies.
Fixes #6929.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
That is the slowest part of --precompute-propagate, which is one of our slowest
passes. This makes that pass 25% faster.
The main change is to make the maps consider missing elements as non-constant,
rather than storing a None element in them. That saves allocating entries for the
common case of a non-constant set/get.
Also switch to a simple vector for the work queue, which is possible if we only
add work when a get/set becomes constant (which can only happen once). Another
benefit to adding work in that manner is that we don't start by adding every single
get/set as "work" at the start.
|
|
|
|
|
|
|
|
|
|
|
| |
#6400 fixed this case but that handled only when a `pop` is an
immediate child of the current expression, but a `pop` can be nested
deeper down.
We conservatively run the EH fixup whenever we have a `pop` and create
`block`s in the optimization. We considered using `FindAll<Pop>` to make
it precise, but we decided the quadratic time plexity was not worth it.
Fixes #6918.
|
|
|
|
|
|
|
|
|
| |
To avoid having two separate topological sort utilities in the code
base, replace remaining uses of the old DFS-based, CRTP topological sort
with the newer Kahn's algorithm implementation.
This would be NFC, except that the new topological sort produces a
different order than the old topological sort, so the output of some
passes is reordered.
|
|
|
|
|
|
|
|
| |
We were doing a debug logging for every LEB byte. It turns out that the
isDebugEnabled() calls are expensive when called so frequently: in a
release+assertion build, even with debug disabled, these checks are the
highest thing in the profile. This PR removes the checks, which makes
binary reading 12% faster.
|
|
|
|
|
| |
That pass only cares about reference locals, and even among them, only ones
that we see a struct.new/array.new that flows through locals. This makes the
pass 40% faster.
|
|
|
|
|
|
|
|
| |
The pass optimizes loads and stores, so without a memory there is nothing to
do.
This only helps if the user set --low-memory-unused and also has no memory,
which is likely rare, but it's a trivial change so it seems worthwhile. In particular
this pass constructs a LocalGraph, so if we can avoid work it can be substantial.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unlike other module elements, types are not stored on the `Module`.
Instead, they are collected by traversing the IR before printing and
binary writing. The code that collects the types tries to optimize the
order of rec groups based on the number of times each type is used. As a
result, the output order of types generally has no relation to the input
order of types. In addition, most type optimizations rewrite the types
into a single large rec group, and the order of types in that group is
essentially arbitrary. Changes to the code for counting type uses,
sorting types, or sorting rec groups can yield very large changes in the
output order of types, producing test diffs that are hard to review and
potentially harming the readability of tests by moving output types away
from the corresponding input types.
To help make test output more stable and readable, introduce a tool
option that causes the order of output types to match the order of input
types as closely as possible. It is implemented by having the parsers
record the indices of the input types on the `Module` just like they
already record the type names. The `GlobalTypeRewriter` infrastructure
used by type optimizations associates the new types with the old indices
just like it already does for names and also respects the input order
when rewriting types into a large recursion group.
By default, wasm-opt and other tools clear the recorded type indices
after parsing the module, so their default behavior is not modified by
this change.
Follow-on PRs will use the new flag in more tests, which will generate
large diffs but leave the tests in stable, more readable states that
will no longer change due to other changes to the optimizing type
sorting logic.
|
|
|
|
| |
As the name of a class, uppercase seems better here.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The LocalGraph there was used for two purposes:
1. Get the list of gets and sets.
2. Get only the reachable gets and sets.
It is trivial to get all the gets and sets in a much faster way, by just walking the
code as this PR does. The downside is that we also consider unreachable gets
and sets, so unreachable code can prevent us from optimizing, but that seems
worthwhile as many passes make that assumption (and they all become
maximally effective after --dce). That is the only non-NFC part here.
Removing LocalGraph + the fixup code for unreachability makes this
significantly shorter, and also 2-3x faster.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Do not even construct the Flower helper class until we actually need it.
This avoids even scanning the function and building the internal CFG if
we never get any API call that needs it.
This speeds up LICM by 50% (as now we never construct the CFG if we
don't find a loop), and Stack IR-enabled binary writing by 10% (as many
functions do not have locals in positions that can be optimized using
LocalGraph).
This moves |locations| from the base class to LocalGraph. It is not
needed in the lazy version, so that makes sense for now (we can't keep
it in the base, as then it would need to be mutable, which only makes
sense for laziness).
|
|
|
| |
Fixes: https://github.com/WebAssembly/binaryen/issues/6779
|
|
|
|
|
| |
This new API on lazy local graphs allows us to use laziness in another place,
StackIR opts. This makes writing the binary (which includes StackIR opts, when
those are enabled), 10% faster.
|
|
|
|
|
| |
Now that the header includes topological sort utilities in addition to
the utility that iterates over all topological orders, it makes more
sense for it to be named topological_sort.h
|
|
|
|
| |
This will allow both the old and new topological sort utilities to be
included into the same .cpp file while we phase out the old utility.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Before this change, replace lane was converting all the F16 lanes to F32
and then replacing one lane with the F16 (I32 representation) value, but
it did not then convert all the other lanes back to F16 (I32). To fix
this we can just leave the lanes as I32 and replace the one lane.
Note: Previous replace lane tests didn't catch this since they started
with vectors with all zeros so the F32->I32 didn't matter. Also, other
operations don't run into this issue since they iterate over all lanes
and convert the F32's back to F16 (I32).
---------
Co-authored-by: Alon Zakai <alonzakai@gmail.com>
|
|
|
|
|
|
|
|
| |
This allows to remove a reference field from all Java objects reducing
the per object memory and initialization overhead.
The pass is designed to run direclty on the J2CL output before other
optimizations since it relies on invariants that might get lost in
optimization. If the invariants don't hold the pass aborts.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As with all type optimizations, MinimizeRecGroups only changes private
types, which are the only types that are safe to modify. However, it is
important for correctness that MinimimizeRecGroups maintain separate
type identities for all types, whether public or private, to ensure that
casts that should differentiate two types cannot change behavior.
Previously the pass worked exclusively on private types, so there was
nothing preventing it from constructing a minimial rec group that
happened to have the same shape, and therefore type identity, as a
public rec group. #6886 exhibits a fuzzer test case where this happens
and changes the behavior of the program.
Fix the bug by recording all public rec group shapes and resolve
conflicts with these shapes by updating the shape of the conflicting
non-public type.
Fixes #6886.
|
|
|
|
|
|
| |
We computed both get and set influences, but getGetInfluences() was
never called, so that work was entirely pointless.
This makes the pass 20% faster.
|
|
|
|
|
|
|
| |
We previous incremented the use count for a declared supertype only if
it was also a type we had never seen before. Fix the count by treating
the supertype the same as any other type used in a type definition.
Update tests accordingly, including by manually moving input types
around to better match the output.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Many passes need to know both the set of all used types and also the
sets of private or public types. Previously there was no API to get both
at once, so getting both required two API calls that internally
collected all the types twice.
Furthermore, there are many reasons to collect heap types, and they have
different requirements about precisely which types need to be collected.
For example, in some edge cases the IR can reference heap types that do
not need to be emitted into a binary; passes that replace all types
would need to collect these types, but the binary writer would not. The
existing APIs for collecting types did not distinguish between these use
cases, so the code conservatively collected extra types that were not
always needed.
Refactor the type collecting code to expose a new API that takes a
description of which types need to be collected and returns the
appropriate types, their use counts, and optionally whether they are
each public or private.
Keep this change non-functional by commenting on places where the code
could be cleaned up or improved rather than actually making the changes.
Follow-up PRs will implement the improvements, which will necessarily
come with test changes.
|
|
|
|
|
|
|
|
|
|
| |
LocalGraph by default will compute all the local.sets that can be read from all
local.gets. However, many passes only query a small amount of those. To
avoid wasted work, add a lazy mode that only computes sets when asked about
a get.
This is then used in a single place, LoopInvariantCodeMotion, which becomes
18% faster.
|
| |
|
|
|
|
|
|
| |
This saves memory and could in principle improve performance, although a
quick experiment with 30 samples on ReorderGlobals did not yield a
statistically significant improvement. At any rate, using Index is more
consistent with other parts of the code base.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Rec groups need to be topologically sorted for the output module to be
valid, but the specific order of rec groups also affects the module size
because types at lower indices requires fewer bytes to reference. We
previously optimized for code size when gathering types by sorting the
list of groups before doing the topological sort. This was brittle,
though, and depended on implementation details of the topological sort
to be correct.
Replace the old topological sort with use of the new
`TopologicalSort::minSort` utility, which is a more principled method of
achieving a minimal topological sort with respect to some comparator.
Also draw inspiration from ReorderGlobals and apply an exponential
factor to take the users of a rec group into account when determining
its weight.
|
|
|
|
|
|
| |
We previously computed both forward and reverse dependence graphs, but
one of them was only used for a single topological sort that could just
as well be computed by reversing the topological sort on the other
graph.
|
|
|
|
|
|
|
| |
This renames `Catch(All)_P3` enum to denote the old Phase 3
`catch(_all)` instructions to `Catch(All)_Legacy`, which sounds clearer.
This is also to be consistent with
https://github.com/llvm/llvm-project/pull/107187.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Rather than finding the minimum sort with respect to the original order
of vertices, find the minimum sort with respect to an arbitrary
user-provided comparator. Users of the minSort utility previously had to
sort their input graphs according to their desired ordering, but now
they can simply provide their comparator instead.
Take advantage of the new functionality in ReorderGlobals and also
standardize on a single data type for representing dependence graphs to
avoid unnecessary conversions. Together, these changes slightly speed up
ReorderGlobals.
Move the topological sort code previously in a .cpp file into the header
so the comparator can be provided as a lambda template parameter instead
of as a `std::function`. This makes ReorderGlobals about 5% faster.
|
|
|
|
|
| |
This replaces direct access of the data structure graph.*influences[foo] with a call
graph.get*influences(foo). This will allow a later PR to make those calls optionally
lazy.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This renames "delegate_br_target" to "delegate_trampoline". So how we
translate `try`-`delegate` is:
- Before:
```wast
(try $delegate_target
...
(try
(do
...
)
(delegate $delegate_target)
)
...
)
```
- After:
```wast
(try_table $delegate_target
(throw_ref
(block $delegate_br_target
...
(try_table (catch_all $delegate_br_target)
...
)
...
)
)
)
```
So `delegate_br_target` is the destination we branch (via `try_table`)
to, in order to rethrow the exnref using `throw_ref`.
But given that the translated code does not actually have a `br`, I
think this name can be confusing.
This renames `br_target` to `trampoline`, given that the block is upon
which we bounce the exnref off to reach the real delegate target.
This is to be consistent with the variable names in the LLVM
implementation (which has not been submitted yet).
|
|
|
|
|
|
|
|
|
| |
This does not use the CFG yet, so there is no benefit (and likely some small
slowdown). The next PR will actually use it to fix a correctness bug. This PR
only sets up the CFG and converts the pass to operate on it, without changing
any behavior or tests.
Followup to #6882
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
HeapStoreOptimization (#6882)
This just moves code out of OptimizeInstructions to the new pass. The existing
test is renamed and now runs the new pass instead. The new pass is run right
after each --optimize-instructions invocation, so it should not cause any
noticeable effects whatsoever, making this NFC.
The motivation here is that there is a bug in the pass, see the new testcase
added at the end, which shows the bug. It is not practical to fix that bug in
OptimizeInstructions since we need more than peephole optimizations to do
so. This PR moves the code to a new pass so we can fix it there properly,
later.
The new pass is named HeapStoreOptimization since the same infrastructure
we will need to fix the bug will also help dead store elimination and related
things.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This pass may do multiple iterations, and before this PR it scanned the entire
module each time. That is simpler than tracking stale data, but it can be quite
slow. This PR adds staleness tracking, which makes it over 3x faster (and this
can be one of our slowest passes in some cases, so this is significant).
To achieve this:
* Add a staleness marker on function info.
* Rewrite how we track unseen calls. Previously we used atomics in a clever way,
* now we just accumulate the data in a simple way (easier for staleness tracking).
* Add staleness invalidation in the proper places.
* Add a param to localizeCallsTo to allow us to learn when a function is changed.
This kind of staleness analysis is usually not worthwhile, but given the 3x plus
speedup it seems justified. I fuzzed it directly, and also any staleness bug
can lead to validation errors, so normal fuzzing also gives us good coverage here.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Specified at
https://github.com/WebAssembly/half-precision/blob/main/proposals/half-precision/Overview.md
A few notes:
- The F32x4 and F64x2 versions of madd and nmadd are missing spect
tests.
- For madd, the implementation was incorrectly doing `(b*c)+a` where it
should be `(a*b)+c`.
- For nmadd, the implementation was incorrectly doing `(-b*c)+a` where
it should be `-(a*b)+c`.
- There doesn't appear to be a great way to actually implement a fused
nmadd, but the spec allows the double rounded version I added.
|
|
|
|
|
|
|
| |
Previously they were structs and their results were accessed with
`operator*()`, but that was unnecessarily complicated and could lead to
problems with temporary lifetimes being too short. Simplify the
utilities by making them functions. This also allows the wrapper
templates to infer the proper element types automatically.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Use the new TopologicalSort and MinTopologicalSortOf utilities instead
of the old CRTP topological sort utility and a bespoke heap-based
topological sort in ReorderGlobals. Since there is no longer a heap to
pop from, the direction of the custom comparator is now much more
intuitive.
Further simplify the code by switching from tracking the new order of
globals using a sequence of new indices to tracking the order using a
sequence of old indices.
This change also makes the pass about 20% faster on a large real-world
module.
|
|
|
|
|
|
|
|
| |
Reuse the code implementing Kahn's topological sort algorithm with a new
configuration that uses a min-heap to always choose the best available
element.
Also add wrapper utilities that can find topological sorts of graphs
with arbitrary element types, not just indices.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Before we just had a map that people would access with localGraph.getSetses[get],
while now it is a call localGraph.getSets(get), which more nicely hides the internal
implementation details.
Also rename getSetses => getSetsMap.
This will allow a later PR to optimize the internals of this API.
This is performance-neutral as far as I can measure. (We do replace a direct read
from a data structure with a call, but the call is in a header and should always get
inlined.)
|
|
|
|
|
|
|
| |
The instructions relaxed_fma and relaxed_fnma have been renamed to
relaxed_madd and relaxed_nmadd.
https://github.com/WebAssembly/relaxed-simd/blob/main/proposals/relaxed-simd/Overview.md#binary-format
|
|
|
|
|
|
|
|
|
| |
The parser function for `action` returned a `MaybeResult`, but we were
treating it as returning a normal `Result` and not checking that it had
contents in several places. Replace the current `action()` with
`maybeAction()` and add a new `action()` that requires the action to be
present.
Fixes #6872.
|