| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
0x7800 as FP16 is 32,768 not 32,767 as I had in the comment.
I also added another comment to explain the somewhat confusing behavior.
|
|
|
|
|
|
|
|
| |
This just moves the code out into a function. A later PR will use it in another place.
Add a test that shows the motivation for that later PR: we fail to optimize away
a block return value at the top level of a function. Fixing that will involve calling
the new function here from another place.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
E.g.
(try_table (catch_all $catch)
(throw $e)
)
=>
(try_table (catch_all $catch)
(br $catch)
)
This can then allow other passes to remove the TryTable, if no throwing
things remain.
|
|
|
|
|
| |
The blocking bug https://issues.chromium.org/issues/332931390 has been
fixed.
|
|
|
|
|
|
|
| |
Support 5-segment source mappings, which add a name.
Reference:
https://github.com/tc39/source-map/blob/main/source-map-rev3.md#proposed-format
|
|
|
|
| |
Some versions of libcxx or clang error without this, apparently due to
Type being a forward declaration.
|
|
|
|
|
|
|
| |
This raises the number of locals accepted by the binary parser to the
absolute limit in the spec. A warning is now printed when writing a
binary file if the Web limit of 50,000 locals is exceeded.
Fixes #6968.
|
|
|
|
|
|
|
|
|
|
|
| |
When we precompute something, we try to avoid allocating a new copy. That's important to
avoid many allocations each time we run Precompute - otherwise, each time we see a br
we'd allocate a fresh one, and for its values. But we had a bug where we reused a RefFunc
as the value of a br without updating the type. It's actually tricky to reach a situation where
we find a RefFunc to reuse and it is different from the actual one we want, but the fuzzer
found one.
Fixes the fuzz bug reported on #6845 (but unrelated to that PR).
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Note: FP16 is a little different from F32/F64 since it can't represent
the full 2^16 integer range. 65504 is the max whole integer. This leads
to some slightly strange behavior when converting integers greater than
65504 since they become infinity.
Specified at
https://github.com/WebAssembly/half-precision/blob/main/proposals/half-precision/Overview.md
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Inlining was careful about nested calls like this:
(call $a
(call $b)
)
If we inlined the outer call first, we'd have
(block $inlined-code-from-a
..code..
(call $b)
)
After that, the inner call is a child of a block, not of a call. That is,
we've moved the inner call to another parent. To replace that
inner call when we inline, we'd need to update the new parent,
which would require work. To avoid that work, the pass simply
created a block in the middle:
(call $a
(block
(call $b)
)
)
Now the inner call's immediate parent will not change when we
inline the outer call.
However, it turns out that this was entirely unnecessary. We find
the calls using a post-order traversal, and we store the actions in
a vector that we traverse in order, so we only ever process things
in the optimal order of children before parents. And in that order
there is no problem: inlining the inner call first leads to
(call $a
(block $inlined-code-from-b
(..code..)
)
)
That does not affect the outer call's parent.
This PR removes the creation of the unnecessary blocks. This doesn't
improve the final output as optimizations remove the unneeded
blocks later anyhow, but it does make the code simpler and a little
faster. It also makes debugging less confusing. But this is not truly
NFC because --inlining (but not --inlining-optimizing) will actually
emit fewer blocks now (but only --inlining-optimizing is used by
default in production).
The diff on tests here is very small when ignoring whitespace. The
remaining differences are just emitting fewer obviously-unneeded
blocks. There is also one test that needed manual changes,
inlining-eh-legacy, because it tested that we do Pop fixups, but
after emitting one fewer block, those fixups were not needed. I
added a new test there with two nested calls, which does end up
needing those fixups. I also added such a test in
inlining_all-features so that we have coverage for such nested
calls (we might remove the eh-legacy file some day, and other
existing tests with nested calls that I found were more complex).
|
|
|
|
| |
In WasmGC modules there is often no memory at all, and we can skip
walking the code in this pass in such cases.
|
|
|
|
|
|
| |
This makes it slightly faster.
Followup to
https://github.com/WebAssembly/binaryen/pull/6953#discussion_r1765313401
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We may inline multiple times into a single function. Previously, if we did so, we
did the "fixups" such as ReFinalize and non-nullable local fixes once per such
inlining. But that is wasteful as each ReFinalize etc. scans the whole function,
and could be done after we copy all the code from all the inlinings, which is
what this PR does: it splits doInlining() into one function that inlines code and
one that does the updates after, and the update is done after all inlinings.
This turns out to be very important, a 5x speedup on two large real-world wasm
files I am looking at. The reason is that we actually inline more than once in
half the cases, and sometimes far more - in one case we inline over 1,000
times into a function! (and ReFinalized 1,000 times too many)
This is practically NFC, but it turns out that there are some tiny noticeable
differences between running ReFinalize once at the end vs. once after each
inlining. These differences are not really functional or observable in the
behavior of the code, and optimizations would remove them anyhow, but
they are noticeable in two tests here. The changes to tests are, in order:
* Different block names, just because the counter we use sees more things.
* In a testcase with unreachable code, we inline twice into a function, and
the first inlining brings in an unreachable, and ReFinalizing early will lead
to it propagating differently than if we wait to ReFinalize. (It actually leads
to another cycle of inlining in that case, as a fluke.)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We already parallelized collecting info about functions and finding
inlining opportunities, but the actual inlining - copying the code into
the target function - was done sequentially. It turns out that a lot of
work happens there: this PR makes the pass over 2x faster.
Details:
* Move nameHint to InliningAction, so it is there when we apply the
actions.
* Add a DoInlining internal pass which calls doInlining().
* Refactor OptUtils a little to make it easy to run DoInlining before
opts.
* Also remove the return value of doInlining which was not used.
|
|
|
|
|
|
|
|
|
| |
The purpose of the datacount section is to pre-declare how many data
segments there will be so that engines can allocate space for them
and not have to back patch subsequent instructions in the code section
that refer to them. Once we use IRBuilder in the binary parser, we will
have to have the data segments available by the time we parse
instructions that use them, so eagerly construct the data segments when
parsing the datacount section.
|
|
|
|
|
|
|
|
| |
In preparation for using IRBuilder in the binary parser, eagerly create
Functions when parsing the function section so that they are already
created once we parse the code section. IRBuilder will require the
functions to exist when parsing calls so it can figure out what type
each call should have, even when there is a call to a function whose
body has not been parsed yet.
|
|
|
| |
This makes the pass 15% faster.
|
|
|
| |
This makes the pass 20% faster.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Before, a br would send its value to a BreakTargetLocation. That would then be
linked to the target:
{ br's value } => BreakTargetLocation(target name) => { location of target }
This PR skips the middle:
{ br's value } => { location of target }
It just connects breaks directly to the targets. We can do that if we keep a
map of the targets as we go.
This is 2% faster as well as simplifies the code, as an NFC refactoring. But it
also fixes a bug: we have handling on ExpressionLocation that filters values as they
come in (they must accord with the expression's type). We were not doing
that on BreakTargetLocation, leading to an assert. Removing
BreakTargetLocation entirely is easier and better than adding filtering logic
for it.
Fixes #6955
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a struct.get or array.get is optimized to have a null reference
operand, its return type loses meaning since the operation will always
trap. Previously when refinalizing such expressions, we just left their
return type unchanged since there was no longer an associated struct or
array type to calculate it from. However, this could lead to a strange
setup where the stale return type was the last remaining use of some
heap type in the module. That heap type would never be emitted in the
binary, but it was still used in the IR, so type optimizations would
have to keep updating it. Our type collecting logic went out of its way
to include the return types of struct.get and array.get expressions to
account for this strange possibility, even though it otherwise collected
only types that would appear in binaries.
In principle, all of this should have applied to `call_ref` as well, but
the type collection logic did not have the necessary special case, so
there was probably a latent bug there.
Get rid of these special cases in the type collection logic and make it
impossible for the IR to use a stale type that no longer appears in the
binary by updating such stale types during finalization. One possibility
would have been to make the return types of null accessors unreachable,
but this violates the usual invariant that unreachable instructions must
either have unreachable children or be branches or `(unreachable)`.
Instead, refine the return types to be uninhabitable non-nullable
references to bottom, which is nearly as good as refining them directly
to unreachable.
We can consider refining them to `unreachable` in the future, but
another problem with that is that it would currently allow the parsers
to admit more invalid modules with arbitrary junk after null accessor
instructions.
|
|
|
|
|
|
|
| |
This avoids creating a large Literals (SmallVector of Literal) and then copying it. All the places
that construct GCData do not need the Literals afterwards.
This gives a 7% speedup on the --precompute benchmark from #6931
|
|
|
|
|
|
|
|
| |
The module splitting utility has a configuration option for minimizing
new export names, but it was previously applied only to newly exported
functions. Using the new multi-split mode can produce lots of exported
tables and splitting WasmGC programs can produce lots of exported
globals, so minimizing these export names can have a big impact on code
size.
|
|
|
|
|
|
|
|
| |
The configuration for the module splitting utility previous took a set
of functions to keep in the primary module. Change it to take a list of
functions to split into the secondary module instead. This improves the
code quality in multi-split mode because it keeps stub functions
generated by previous splits from being moved into secondary modules
during later splits.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Maintain the invariant that every defined functions belongs to either
the set of kept functions or the set of split functions. Functions are
kept by default except when --keep-funcs is specified without
--split-funcs on the command line. This is mostly NFC except that it
changes the default behavior when no arguments are specified on the
command line to keep all functions.
This will simplify a follow-on PR that switches from passing the kept
functions to the module splitting utility to passing the split
functions.
|
|
|
|
|
| |
We emit a select between two objects when only two objects exist of a
particular type. However, if the field is packed, we did not handle truncating the
written values.
|
|
|
|
|
|
|
|
|
| |
Rather than analyze what module elements from the primary module a
secondary module will need, the splitting logic conservatively imports
all module elements from the primary module into the secondary module.
Run RemoveUnusedElements on the secondary module to remove any of these
imports that happen to be unnecessary. Leave a TODO mentioning the
possibility of being more selective about which module elements get
exported to reduce code size in the primary module, too.
|
|
|
|
|
|
|
| |
Add a mode that splits a module into arbitrarily many parts based on a
simple manifest file. This is currently implemented by splitting out one
module at a time in a loop, but this could change in the future if
splitting out all the modules at once would improve the quality of the
output.
|
|
|
|
|
|
|
|
|
|
|
| |
In the WebAssembly text format, strings can generally be arbitrary
bytes, but identifiers must be valid UTF-8. Check for UTF-8 validity
when parsing string-style identifiers in the lexer.
Update StringLowering to generate valid UTF-8 global names even for
strings that may not be valid UTF-8 and test that text round tripping
works correctly after StringLowering.
Fixes #6937.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
methods (#6936)
This just moves code around. As a result, isRef() vanishes entirely from the
profiling traces in #6931, since now the core isRef/Tuple/etc. methods are
all inlineable.
This also required some reordering of wasm-type.h, namely to move HeapType
up front. No changes to that class otherwise.
TypeInfo is now in the header. getTypeInfo is now a static method on Type.
This has the downside of moving internal details into the header, and it may
increase compile time a little. The upside is making the --precompute benchmark
from #6931 significantly faster, 33%, and it will also help the many
Type::isNonNullable() etc. calls we have scattered around the codebase in
other passes too.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Wasm-split generally assumes that calls to secondary functions made
before the secondary module has been loaded and instantiated should go
to imported placeholder functions that can be responsible for loading
the secondary module and forwarding the call to the loaded function.
That scheme makes the loading entirely transparent from the
application's point of view, which is not always a good thing. Other
schemes would make it impossible for a secondary function to be called
before the secondary module has been explicitly loaded, in which case
the placeholder functions would never be called. To improve code size
and simplify instantiation under these schemes, add a new
`--no-placeholders` option that skips adding imported placeholder
functions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are a few heap types that are hard-coded to be considered public
and therefore allowed on module boundaries even in --closed-world mode,
specifically to support js-string-builtins. We previously considered
both open and closed (i.e. final) mutable i8 arrays to be public in this
manner, but js-string-builtins only uses the closed versions, so remove
the open versions.
This fixes a particular bug in which Unsubtyping optimized a private
array type to be equivalent to an ignorable public array type,
incorrectly changing the behavior of a cast, but it does not address the
larger problem of optimizations producing types that are equivalent to
public types. Add a TODO about that problem for now.
Fixes #6935.
|
| |
|
|
|
|
| |
`wasm_of_ocaml` uses Binaryen's `wasm_as`, `wasm_merge` and `wasm_opt`
as a part of its toolchain.
|
|
|
|
|
|
|
| |
To do this, add locations and getInfluences to LazyLocalGraph. Both cannot
really be computed in a fine-grained manner, so just compute them all on the
first request. That is not as efficient as our lazy computation of getSets and
setInfluences, but they are also less important, and this change makes the
pass 20% faster.
|
|
|
|
|
|
| |
Followup to #6928. If the fallthrough precomputes to an invalid type, it
makes no sense to precompute the non-fallthrough, as it can't return anything
useful.
|
|
|
|
|
|
|
|
|
|
| |
The error checking we had to report an error when the input contains
block parameters was in a code path that is no longer executed under
normal circumstances. Specifically, it was part of the
`ParseModuleTypesCtx` phase of parsing, which no longer parses function
bodies. Move the error checking to the `ParseDefsCtx` phase, which does
parse function bodies.
Fixes #6929.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
That is the slowest part of --precompute-propagate, which is one of our slowest
passes. This makes that pass 25% faster.
The main change is to make the maps consider missing elements as non-constant,
rather than storing a None element in them. That saves allocating entries for the
common case of a non-constant set/get.
Also switch to a simple vector for the work queue, which is possible if we only
add work when a get/set becomes constant (which can only happen once). Another
benefit to adding work in that manner is that we don't start by adding every single
get/set as "work" at the start.
|
|
|
|
|
|
|
|
|
|
|
| |
#6400 fixed this case but that handled only when a `pop` is an
immediate child of the current expression, but a `pop` can be nested
deeper down.
We conservatively run the EH fixup whenever we have a `pop` and create
`block`s in the optimization. We considered using `FindAll<Pop>` to make
it precise, but we decided the quadratic time plexity was not worth it.
Fixes #6918.
|
|
|
|
|
|
|
|
|
| |
To avoid having two separate topological sort utilities in the code
base, replace remaining uses of the old DFS-based, CRTP topological sort
with the newer Kahn's algorithm implementation.
This would be NFC, except that the new topological sort produces a
different order than the old topological sort, so the output of some
passes is reordered.
|
|
|
|
|
|
|
|
| |
We were doing a debug logging for every LEB byte. It turns out that the
isDebugEnabled() calls are expensive when called so frequently: in a
release+assertion build, even with debug disabled, these checks are the
highest thing in the profile. This PR removes the checks, which makes
binary reading 12% faster.
|
|
|
|
|
| |
That pass only cares about reference locals, and even among them, only ones
that we see a struct.new/array.new that flows through locals. This makes the
pass 40% faster.
|
|
|
|
|
|
|
|
| |
The pass optimizes loads and stores, so without a memory there is nothing to
do.
This only helps if the user set --low-memory-unused and also has no memory,
which is likely rare, but it's a trivial change so it seems worthwhile. In particular
this pass constructs a LocalGraph, so if we can avoid work it can be substantial.
|
|
|
|
|
| |
Update the remaining tests whose readability will be affected by the
removal of the old topological sort in #6902, no matter how small their
diffs would have been.
|
|
|
|
|
|
| |
These are the tests that would otherwise have the largest diffs when
changing the topological sort used to sort types.
signature-refining_gto.wat also cannot be automatically updated, so
there is extra benefit to making sure it has stable output.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unlike other module elements, types are not stored on the `Module`.
Instead, they are collected by traversing the IR before printing and
binary writing. The code that collects the types tries to optimize the
order of rec groups based on the number of times each type is used. As a
result, the output order of types generally has no relation to the input
order of types. In addition, most type optimizations rewrite the types
into a single large rec group, and the order of types in that group is
essentially arbitrary. Changes to the code for counting type uses,
sorting types, or sorting rec groups can yield very large changes in the
output order of types, producing test diffs that are hard to review and
potentially harming the readability of tests by moving output types away
from the corresponding input types.
To help make test output more stable and readable, introduce a tool
option that causes the order of output types to match the order of input
types as closely as possible. It is implemented by having the parsers
record the indices of the input types on the `Module` just like they
already record the type names. The `GlobalTypeRewriter` infrastructure
used by type optimizations associates the new types with the old indices
just like it already does for names and also respects the input order
when rewriting types into a large recursion group.
By default, wasm-opt and other tools clear the recorded type indices
after parsing the module, so their default behavior is not modified by
this change.
Follow-on PRs will use the new flag in more tests, which will generate
large diffs but leave the tests in stable, more readable states that
will no longer change due to other changes to the optimizing type
sorting logic.
|
|
|
|
| |
As the name of a class, uppercase seems better here.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The LocalGraph there was used for two purposes:
1. Get the list of gets and sets.
2. Get only the reachable gets and sets.
It is trivial to get all the gets and sets in a much faster way, by just walking the
code as this PR does. The downside is that we also consider unreachable gets
and sets, so unreachable code can prevent us from optimizing, but that seems
worthwhile as many passes make that assumption (and they all become
maximally effective after --dce). That is the only non-NFC part here.
Removing LocalGraph + the fixup code for unreachability makes this
significantly shorter, and also 2-3x faster.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Do not even construct the Flower helper class until we actually need it.
This avoids even scanning the function and building the internal CFG if
we never get any API call that needs it.
This speeds up LICM by 50% (as now we never construct the CFG if we
don't find a loop), and Stack IR-enabled binary writing by 10% (as many
functions do not have locals in positions that can be optimized using
LocalGraph).
This moves |locations| from the base class to LocalGraph. It is not
needed in the lazy version, so that makes sense for now (we can't keep
it in the base, as then it would need to be mutable, which only makes
sense for laziness).
|