| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
Before this fix, the first table (index 0) is counted as its element segment
having "no table index" even when its type is not funcref, which could break
things if that table had a more specialized type.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(call_indirect
..args..
(select
(i32.const x)
(i32.const y)
(condition)
)
)
=>
(if
(condition)
(call $func-for-x
..args..
)
(call $func-for-y
..args..
)
)
To do this we must reorder the condition with the args, and also use
the args more than once, so place them all in locals.
This works towards the goal of polymorphic devirtualization, that is,
turning an indirect call of more than one possible target into more
than one direct call.
|
|
|
| |
Rather than load from the table and call that reference, call using the table.
|
| |
|
|
|
|
|
|
|
|
| |
- fpcast-emu.wast
- generate-dyncalls_all-features.wast
- generaite-i64-dyncalls.wast
- instrument-locals_all-features_disable-typed-function-references.wast
- instrument-memory.wast
- instrument-memory64.wast
|
|
|
|
| |
Adds the part of the spec test suite that this passes (without table.set we
can't do it all).
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now that they are all implemented, we can optimize them. This removes the
big if that ignored static operations, and implements things for them.
In general this matches the existing rtt-using case, but there are a few things
we can do better, which this does:
* A cast of a subtype to a type always succeeds.
* A test of a subtype to a type is always 1 (if non-nullable).
* Repeated static casts can leave just the most demanding of them.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A SmallSet starts with fixed storage that it uses in the simplest
possible way (linear scan, no sorting). If it exceeds a size then it
starts using a normal std::set. So for small amounts of data it
avoids allocation and any other overhead.
This adds a unit test and also uses it in LocalGraph which provides
a large amount of additional coverage.
I also changed an unrelated data structure from std::map to
std::unordered_map which I noticed while doing profiling in
LocalGraph. (And a tiny bit of additional refactoring there.)
This makes LocalGraph-using passes like ssa-nomerge and
precompute-propagate 10-15% faster on a bunch of real-world
codebases I tested.
|
|
|
|
|
|
|
|
|
| |
This clang-formats c/cpp files in test/ directory, and updates
clang-format-diff.sh so that it does not ignore test/ directory anymore.
bigswitch.cpp is excluded from formatting, because there are big
commented-out code blocks, and apparently clang-format messes up
formatting in them. Also to make matters worse, different clang-format
versions do different things on those commented-out code blocks.
|
|
|
|
|
| |
Locally I saw a 10% speedup on j2cl but reports of regressions have
arrived, so let's disable it for now pending investigation. The option added
here should make it easy to experiment.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously the set of functions to keep was initially empty, then the profile
added new functions to keep, then the --keep-funcs functions were added, then
the --split-funcs functions were removed. This method of composing these
different options was arbitrary and not necessarily intuitive, and it prevented
reasonable workflows from working. For example, providing only a --split-funcs
list would result in all functions being split out not matter which functions
were listed.
To make the behavior of these options, and --split-funcs in particular, more
intuitive, disallow mixing them and when --split-funcs is used, split out only
the listed functions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Precompute has a mode in which it propagates results from local.sets to
local.gets. That constructs a LocalGraph which is a non-trivial amount of
work. We used to run multiple iterations of this, but investigation shows that
such opportunities are extremely rare, as doing just a single propagation
iteration has no effect on the entire emscripten benchmark suite, nor on
j2cl output. Furthermore, we run this pass twice in the normal pipeline (once
early, once late) so even if there are such opportunities they may be
optimized already. And, --converge is a way to get additional iterations of
all passes if a user wants that, so it makes sense not to costly work for more
iterations automatically.
In effect, 99.99% of the time before this pass we would create the LocalGraph
twice: once the first time, then a second time only to see that we can't
actually optimize anything further. This PR makes us only create it once, which
makes precompute-propagate 10% faster on j2cl and even faster on other things
like poppler (33%) and LLVM (29%).
See the change in the test suite for an example of a case that does require
more than one iteration to be optimized. Note that even there, we only manage
to get benefit from a second iteration by doing something that overlaps with
another pass (optimizing out an if with condition 0), which shows even more
how unnecessary the extra work was.
See #4165
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
if (A) {
if (B) {
C
}
}
=>
if (A ? B : 0) {
C
}
when B has no side effects, and is fast enough to consider running
unconditionally. In that case, we replace an if with a select and a
zero, which is the same size, but should be faster and may be
further optimized.
As suggested in #4168
|
| |
|
|
|
|
|
|
|
|
|
| |
See #4149
This modifies the test added in #4163 which used static casts on
dynamically-created structs and arrays. That was technically not
valid (as we won't want users to "mix" the two forms). This makes that
test 100% static, which both fixes the test and gives test coverage
to the new instructions added here.
|
|
|
|
|
|
|
|
|
|
|
|
| |
We added an optional ReFinalize in OptimizeInstructions at some point,
but that is not valid: The ReFinalize only updates types when all other
works is done, but the pass works incrementally. The bug the fuzzer found
is that a child is changed to be unreachable, and then the parent is
optimized before finalize() is called on it, which led to an assertion being
hit (as the child was unreachable but not the parent, which should also
be).
To fix this, do not change types in this pass. Emit an extra block with a
declared type when necessary. Other passes can remove the extra block.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
These variants take a HeapType that is the type we intend to cast to,
and do not take an RTT.
These are intended to be more statically optimizable. For now though
this PR just implements the minimum to get them parsing and to get
through the optimizer without crashing.
Spec: https://docs.google.com/document/d/1afthjsL_B9UaMqCA5ekgVmOm75BVFu6duHNsN9-gnXw/edit#
See #4149
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR helps with functions like this:
function foo(x) {
if (x) {
..
lots of work here
..
}
}
If "lots of work" is large enough, then we won't inline such a
function. However, we may end up calling into the function
only to get a false on that if and immediately exit. So it is useful
to partially inline this function, basically by creating a split
of it into a condition part that is inlineable
function foo$inlineable(x) {
if (x) {
foo$outlined();
}
}
and an outlined part that is not inlineable:
function foo$outlined(x) {
..
lots of work here
..
}
We can then inline the inlineable part. That means that a call
like
foo(param);
turns into
if (param) {
foo$outlined();
}
In other words, we end up replacing a call and then a check with
a check and then a call. Any time that the condition is false, this
will be a speedup.
The cost here is increased size, as we duplicate the condition
into the callsites. For that reason, only do this when heavily
optimizing for size.
This is a 10% speedup on j2cl. This helps two types of functions
there: Java class inits, which often look like "have I been
initialized before? if not, do all this work", and also assertion
methods which look like "if the input is null, throw an
exception".
|
|
|
|
|
|
|
|
|
|
| |
If we can remove such traps, we can remove ref.as_non_null if the local
type is nullable anyhow.
If we support non-nullable locals, however, then do not do this, as it
could inhibit specializing the local type later. Do the same for tees which
we had existing code for.
Background: #4061 (comment)
|
|
|
|
|
|
| |
This means that when check.py tries to run the lit tests with
BINARYEN_PASS_DEBUG, this is now correctly reflected in the tests. Manually
validated to catch the bug identified in
https://github.com/WebAssembly/binaryen/pull/4130#discussion_r709619855.
|
|
|
|
|
|
|
|
|
| |
That PR reused the same node twice in the output, which fails on the
assertion in BINARYEN_PASS_DEBUG=1 mode.
No new test is needed because the existing test suite fails already
in that mode. That the PR managed to land seems to say that we are
not testing pass-debug mode on our lit tests, which we need to
investigate.
|
|
|
|
|
|
| |
Avoids a crash in calling getHeapType when there isn't one.
Also add the relevant lit test (and a few others) to the list of files to
fuzz more heavily.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is reland of #3071
Do similar optimizations as in #3038 but for memory.fill.
`memory.fill(d, v, 0)` ==> `{ drop(d), drop(v) }` only with `ignoreImplicitTraps` or `trapsNeverHappen`
`memory.fill(d, v, 1)` ==> `store8(d, v)`
Further simplifications can be done only if v is constant because otherwise binary size would increase:
`memory.fill(d, C, 1)` ==> `store8(d, (C & 0xFF))`
`memory.fill(d, C, 2)` ==> `store16(d, (C & 0xFF) * 0x0101)`
`memory.fill(d, C, 4)` ==> `store32(d, (C & 0xFF) * 0x01010101)`
`memory.fill(d, C, 8)` ==> `store64(d, (C & 0xFF) * 0x0101010101010101)`
`memory.fill(d, C, 16)` ==> `store128(d, i8x16.splat(C & 0xFF))`
|
|
|
|
|
|
|
|
|
|
|
|
| |
tablify() attempts to turns a sequence of br_ifs into a single
br_table. This PR adds some flexibility to the specific pattern it
looks for, specifically:
* Accept i32.eqz as a comparison to zero, and not just to look
for i32.eq against a constant.
* Allow the first condition to be a tee. If it is, compare later
conditions to local.get of that local.
This will allow more br_tables to be emitted in j2cl output.
|
|
|
|
|
|
|
|
|
|
|
|
| |
If all a select's inputs are boolean, we can sometimes turn the select
into an AND or an OR operation,
x ? y : 0 => x & y
x ? 1 : y => x | y
I believe LLVM aggressively canonicalizes to this form. It makes sense
to do here too as it is smaller (save the constant 0 or 1). It also allows
further optimizations (which is why LLVM does it) but I don't think we
have those yet.
|
|
|
|
|
|
|
| |
See also:
spec change: https://github.com/WebAssembly/tool-conventions/pull/170
llvm change: https://reviews.llvm.org/D109595
wabt change: https://github.com/WebAssembly/wabt/pull/1707
emscripten change: https://github.com/emscripten-core/emscripten/pull/15019
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
An "intrinsic" is modeled as a call to an import. We could also add new
IR things for them, but that would take more work and lead to less
clear errors in other tools if they try to read a binary using such a
nonstandard extension.
A first intrinsic is added here, call.without.effects This is basically the same
as call_ref except that the optimizer is free to assume the call has no
side effects. Consequently, if the result is not used then it can be optimized
out (as even if it is not used then side effects could have kept it around).
Likewise, the lack of side effects allows more reordering and other
things.
A lowering pass for intrinsics is provided. Rather than automatically
lower them to normal wasm at the end of optimizations, the user must
call that pass explicitly. A typical workflow might be
-O --intrinsic-lowering -O
That optimizes with the intrinsic present - perhaps removing calls
thanks to it - then lowers it into normal wasm - it turns into a call_ref -
and then optimizes further, which would turns the call_ref into a
direct call, potentially inline, etc.
|
|
|
|
|
|
|
| |
array.init is like array.new_with_rtt except that it takes
as arguments the values to initialize the array with (as opposed to
a size and an optional initial value).
Spec: https://docs.google.com/document/d/1afthjsL_B9UaMqCA5ekgVmOm75BVFu6duHNsN9-gnXw/edit#
|
|
|
|
|
|
|
|
|
| |
MergeBlocks was written a very long time ago, before the
iteration API, so it had a bunch of hardcoded things for
specific instructions. In particular, that did not handle GC.
This does a small refactoring to use iteration. The refactoring
is NFC, but while doing so it adds support for new relevant
instructions, including wasm GC.
|
|
|
|
|
|
|
|
|
|
|
|
| |
```ts
-x * -y => (x * y)
-x * y => -(x * y)
x * -y => -(x * y), if x != C && y != C
-x * C => x * -C, if C != C_pot || shrinkLevel != 0
-x * C => -(x * C), otherwise
```
We are skipping propagation when lhs and rhs are constants because this should handled by constant folding. Also skip cases like `-x * 4 -> x * -4` for `shrinkLevel != 0`, as this will be further converted to `-(x << 2)`.
|
|
|
|
|
|
| |
We can assume that imported memories (and the profiling data they contain) are
already accessible from the module's environment, so there's no need to export
them. This also avoids needing to add knowledge of "profile-memory" to
Emscripten's library_dylink.js.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
To avoid requiring a static memory allocation, wasm-split's instrumentation
defaults to recording profile data in Wasm globals. This causes problems for
multithreaded applications because the globals are thread-local, but it is not
always feasible to arrange for a separate profile to be dumped on each thread.
To simplify the profiling of such multithreaded applications, add a new
instrumentation mode that stores the profiling data in shared memory instead of
in globals. This allows a single profile to be written that correctly reflects
the called functions on all threads.
This new mode is not on by default because it requires users to ensure that the
program will not trample the in-memory profiling data. The data is stored
beginning at address zero and occupies one byte per declared function in the
instrumented module. Emscripten can be told to leave this memory free using the
GLOBAL_BASE option.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some functions run only once with this pattern:
function foo() {
if (foo$ran) return;
foo$ran = 1;
...
}
If that global is not ever set to 0, then the function's payload (after the
initial if and return) will never execute more than once. That means we
can optimize away dominated calls:
foo();
foo(); // we can remove this
To do this, we find which globals are "once", which means they can
fit in that pattern, as they are never set to 0. If a function looks like the
above pattern, and it's global is "once", then the function is "once" as
well, and we can perform this optimization.
This removes over 8% of static calls in j2cl.
|
|
|
|
|
|
| |
Before this, the element segments would be printed as having type
funcref, and then if their table had a specialized type, the element
type would not be a subtype of the table and validation would fail.
|
|
|
|
|
|
| |
This appeared to be a regression from #4117, however this
was always a bug, and that PR just exposed it. That is,
somehow we forgot to indicate the effects of ArrayCopy, and
after that PR we'd vacuum it out incorrectly.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We had already replaced the check on drop, but we can also
use that mode on all the other things there, as the pass never
does reorderings of things - it just removes them.
For example, the pass can now remove part of a dropped thing,
(drop (struct.get (foo)))
=>
(drop (foo))
In this example the struct.get can be removed, even if the foo
can't.
|
|
|
|
|
|
|
|
|
|
|
| |
This finishes the refactoring started in #4115 by doing the
same change to pass a Module into EffectAnalyzer instead of
features. To do so this refactors the fallthrough API and a few
other small things. After those changes, this PR removes the
old feature constructor of EffectAnalyzer entirely.
This requires a small breaking change in the C API, changing
BinaryenExpressionGetSideEffects's feature param to a
module. That makes this change not NFC, but otherwise it is.
|
|
|
|
|
| |
If extra data is found in this section simply propagate it.
Also, remove some dead code from wasm-binary.cpp.
|
|
|
|
|
|
|
|
|
| |
After #4106 we already treat the input file "-" as a shorthand for reading from
stdin at the file.cpp level. However, the "-" input was still treated as a
normal file name at the wasm-io.cpp file and as a result was always treated as
text input. This commit updates wasm-io.cpp to use the stdin code path
supporting both binary and text input for "-".
Fixes #4105 (again).
|
|
|
| |
Followup to #4108
|
|
|
| |
In the JS API this is optional and it defaults to `funcref`.
|
|
|
|
| |
We already supported `-` as meaning stdout for output and this is useful in
similar situations. Fixes #4105.
|
|
|
|
|
|
|
|
| |
Add a class to compute the dominator tree for a CFG consisting
of a list of basic blocks assumed to be in reverse postorder.
This will be useful once cfg-walker emits blocks in reverse-postorder
(which it almost does, another PR will handle that). Then we can write
optimization passes that use block dominance.
|
|
|
|
|
| |
If the types are completely incompatible, we know the cast will fail. However,
ref.cast does allow a null to pass through, which makes it a little more
complicated.
|
| |
|
| |
|