| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR helps with functions like this:
function foo(x) {
if (x) {
..
lots of work here
..
}
}
If "lots of work" is large enough, then we won't inline such a
function. However, we may end up calling into the function
only to get a false on that if and immediately exit. So it is useful
to partially inline this function, basically by creating a split
of it into a condition part that is inlineable
function foo$inlineable(x) {
if (x) {
foo$outlined();
}
}
and an outlined part that is not inlineable:
function foo$outlined(x) {
..
lots of work here
..
}
We can then inline the inlineable part. That means that a call
like
foo(param);
turns into
if (param) {
foo$outlined();
}
In other words, we end up replacing a call and then a check with
a check and then a call. Any time that the condition is false, this
will be a speedup.
The cost here is increased size, as we duplicate the condition
into the callsites. For that reason, only do this when heavily
optimizing for size.
This is a 10% speedup on j2cl. This helps two types of functions
there: Java class inits, which often look like "have I been
initialized before? if not, do all this work", and also assertion
methods which look like "if the input is null, throw an
exception".
|
|
|
|
|
|
|
|
|
|
| |
If we can remove such traps, we can remove ref.as_non_null if the local
type is nullable anyhow.
If we support non-nullable locals, however, then do not do this, as it
could inhibit specializing the local type later. Do the same for tees which
we had existing code for.
Background: #4061 (comment)
|
|
|
|
|
|
|
|
|
| |
That PR reused the same node twice in the output, which fails on the
assertion in BINARYEN_PASS_DEBUG=1 mode.
No new test is needed because the existing test suite fails already
in that mode. That the PR managed to land seems to say that we are
not testing pass-debug mode on our lit tests, which we need to
investigate.
|
|
|
|
|
|
| |
Avoids a crash in calling getHeapType when there isn't one.
Also add the relevant lit test (and a few others) to the list of files to
fuzz more heavily.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is reland of #3071
Do similar optimizations as in #3038 but for memory.fill.
`memory.fill(d, v, 0)` ==> `{ drop(d), drop(v) }` only with `ignoreImplicitTraps` or `trapsNeverHappen`
`memory.fill(d, v, 1)` ==> `store8(d, v)`
Further simplifications can be done only if v is constant because otherwise binary size would increase:
`memory.fill(d, C, 1)` ==> `store8(d, (C & 0xFF))`
`memory.fill(d, C, 2)` ==> `store16(d, (C & 0xFF) * 0x0101)`
`memory.fill(d, C, 4)` ==> `store32(d, (C & 0xFF) * 0x01010101)`
`memory.fill(d, C, 8)` ==> `store64(d, (C & 0xFF) * 0x0101010101010101)`
`memory.fill(d, C, 16)` ==> `store128(d, i8x16.splat(C & 0xFF))`
|
|
|
|
|
|
|
|
|
|
|
|
| |
tablify() attempts to turns a sequence of br_ifs into a single
br_table. This PR adds some flexibility to the specific pattern it
looks for, specifically:
* Accept i32.eqz as a comparison to zero, and not just to look
for i32.eq against a constant.
* Allow the first condition to be a tee. If it is, compare later
conditions to local.get of that local.
This will allow more br_tables to be emitted in j2cl output.
|
|
|
|
|
|
|
|
|
|
|
|
| |
If all a select's inputs are boolean, we can sometimes turn the select
into an AND or an OR operation,
x ? y : 0 => x & y
x ? 1 : y => x | y
I believe LLVM aggressively canonicalizes to this form. It makes sense
to do here too as it is smaller (save the constant 0 or 1). It also allows
further optimizations (which is why LLVM does it) but I don't think we
have those yet.
|
|
|
|
|
|
|
| |
See also:
spec change: https://github.com/WebAssembly/tool-conventions/pull/170
llvm change: https://reviews.llvm.org/D109595
wabt change: https://github.com/WebAssembly/wabt/pull/1707
emscripten change: https://github.com/emscripten-core/emscripten/pull/15019
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
An "intrinsic" is modeled as a call to an import. We could also add new
IR things for them, but that would take more work and lead to less
clear errors in other tools if they try to read a binary using such a
nonstandard extension.
A first intrinsic is added here, call.without.effects This is basically the same
as call_ref except that the optimizer is free to assume the call has no
side effects. Consequently, if the result is not used then it can be optimized
out (as even if it is not used then side effects could have kept it around).
Likewise, the lack of side effects allows more reordering and other
things.
A lowering pass for intrinsics is provided. Rather than automatically
lower them to normal wasm at the end of optimizations, the user must
call that pass explicitly. A typical workflow might be
-O --intrinsic-lowering -O
That optimizes with the intrinsic present - perhaps removing calls
thanks to it - then lowers it into normal wasm - it turns into a call_ref -
and then optimizes further, which would turns the call_ref into a
direct call, potentially inline, etc.
|
|
|
| |
Followup to #4137
|
|
|
|
|
|
|
| |
array.init is like array.new_with_rtt except that it takes
as arguments the values to initialize the array with (as opposed to
a size and an optional initial value).
Spec: https://docs.google.com/document/d/1afthjsL_B9UaMqCA5ekgVmOm75BVFu6duHNsN9-gnXw/edit#
|
|
|
|
|
|
|
|
|
| |
MergeBlocks was written a very long time ago, before the
iteration API, so it had a bunch of hardcoded things for
specific instructions. In particular, that did not handle GC.
This does a small refactoring to use iteration. The refactoring
is NFC, but while doing so it adds support for new relevant
instructions, including wasm GC.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
```ts
-x * -y => (x * y)
-x * y => -(x * y)
x * -y => -(x * y), if x != C && y != C
-x * C => x * -C, if C != C_pot || shrinkLevel != 0
-x * C => -(x * C), otherwise
```
We are skipping propagation when lhs and rhs are constants because this should handled by constant folding. Also skip cases like `-x * 4 -> x * -4` for `shrinkLevel != 0`, as this will be further converted to `-(x << 2)`.
|
|
|
|
|
|
|
| |
Validation is performed on multiple threads at once and when there are multiple
validation failures, those threads can all end up in `numToString` at the same
time as they construct their respective error messages. Previously the threads
would race on their access to the snprintf buffers, sometimes leading to
segfaults. Fix the data races by making the buffers thread local.
|
| |
|
|
|
|
|
|
| |
We can assume that imported memories (and the profiling data they contain) are
already accessible from the module's environment, so there's no need to export
them. This also avoids needing to add knowledge of "profile-memory" to
Emscripten's library_dylink.js.
|
|
|
| |
To support CMake 3.10. `add_executable` does not support OBJECT libraries until 3.12.
|
|
|
|
|
|
|
|
| |
It can be confusing during debugging to keep a map of pointers when
we might have removed some of those functions from the module
meanwhile (if you iterate over it in some additional debug logging).
This change has no observable effect, however, as no bug could have
actually occurred in practice given that nothing is done with the
pointers in the actual code.
|
|
|
| |
Helps #3739
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
To avoid requiring a static memory allocation, wasm-split's instrumentation
defaults to recording profile data in Wasm globals. This causes problems for
multithreaded applications because the globals are thread-local, but it is not
always feasible to arrange for a separate profile to be dumped on each thread.
To simplify the profiling of such multithreaded applications, add a new
instrumentation mode that stores the profiling data in shared memory instead of
in globals. This allows a single profile to be written that correctly reflects
the called functions on all threads.
This new mode is not on by default because it requires users to ensure that the
program will not trample the in-memory profiling data. The data is stored
beginning at address zero and occupies one byte per declared function in the
instrumented module. Emscripten can be told to leave this memory free using the
GLOBAL_BASE option.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some functions run only once with this pattern:
function foo() {
if (foo$ran) return;
foo$ran = 1;
...
}
If that global is not ever set to 0, then the function's payload (after the
initial if and return) will never execute more than once. That means we
can optimize away dominated calls:
foo();
foo(); // we can remove this
To do this, we find which globals are "once", which means they can
fit in that pattern, as they are never set to 0. If a function looks like the
above pattern, and it's global is "once", then the function is "once" as
well, and we can perform this optimization.
This removes over 8% of static calls in j2cl.
|
|
|
|
|
| |
As wasm-split has gained new functionality, its implementation file has become
large. In preparation for adding even more functionality, split the existing
implementation across multiple files in a new tools/wasm-split subdirectory.
|
|
|
|
|
|
| |
Before this, the element segments would be printed as having type
funcref, and then if their table had a specialized type, the element
type would not be a subtype of the table and validation would fail.
|
|
|
|
|
|
| |
This appeared to be a regression from #4117, however this
was always a bug, and that PR just exposed it. That is,
somehow we forgot to indicate the effects of ArrayCopy, and
after that PR we'd vacuum it out incorrectly.
|
|
|
|
| |
NFC (#4090)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We had already replaced the check on drop, but we can also
use that mode on all the other things there, as the pass never
does reorderings of things - it just removes them.
For example, the pass can now remove part of a dropped thing,
(drop (struct.get (foo)))
=>
(drop (foo))
In this example the struct.get can be removed, even if the foo
can't.
|
|
|
|
|
|
|
|
|
|
|
| |
This finishes the refactoring started in #4115 by doing the
same change to pass a Module into EffectAnalyzer instead of
features. To do so this refactors the fallthrough API and a few
other small things. After those changes, this PR removes the
old feature constructor of EffectAnalyzer entirely.
This requires a small breaking change in the C API, changing
BinaryenExpressionGetSideEffects's feature param to a
module. That makes this change not NFC, but otherwise it is.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Knowing the module will allow us to do more analysis in the
effect analyzer. For now, this just refactors the code to allow
providing a module instead of features, and to infer the
features from the module. This actually shortens the code in
most places which is nice (just pass module instead of
module->features).
This modifies basically all callers to use the new module
form, except for the fallthrough logic. That would require
some more refactoring, so to keep this PR reasonably small
that is not yet done.
|
|
|
|
|
| |
If extra data is found in this section simply propagate it.
Also, remove some dead code from wasm-binary.cpp.
|
|
|
|
|
|
|
|
|
| |
After #4106 we already treat the input file "-" as a shorthand for reading from
stdin at the file.cpp level. However, the "-" input was still treated as a
normal file name at the wasm-io.cpp file and as a result was always treated as
text input. This commit updates wasm-io.cpp to use the stdin code path
supporting both binary and text input for "-".
Fixes #4105 (again).
|
|
|
| |
In the JS API this is optional and it defaults to `funcref`.
|
|
|
|
| |
We already supported `-` as meaning stdout for output and this is useful in
similar situations. Fixes #4105.
|
|
|
|
| |
relevantLiveLocals (#4108)
|
|
|
|
|
|
|
|
| |
Add a class to compute the dominator tree for a CFG consisting
of a list of basic blocks assumed to be in reverse postorder.
This will be useful once cfg-walker emits blocks in reverse-postorder
(which it almost does, another PR will handle that). Then we can write
optimization passes that use block dominance.
|
|
|
| |
The cost type isn't an index in a wasm binary, it's just a number.
|
|
|
|
|
| |
If the types are completely incompatible, we know the cast will fail. However,
ref.cast does allow a null to pass through, which makes it a little more
complicated.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(#4101)
Reverse postorder basically just means that a block's immediate dominator
must precede it in the list. That is useful because then algorithms that look at
dominance can simply process the list in order, and the immediate dominator
will have already been seen before each block.
Another way to put it is that in reverse postorder a block's dominators
appear before it in the list, as do all non-loop predecessors. At least in
reducible graphs that is the case, and our IR, like wasm, is reducible.
It is pretty natural to emit reverse postorder on wasm given the
reducibility: simply process the wasm in postorder, and make sure to
create new basic blocks only when reaching their code - that is, do not
create them "ahead of time". We were doing that in a single place, for
try-catch, so this PR refactors that. Specifically it makes us create the
basic blocks for catches right when we reach them, and not earlier.
So the data structure that used to store them becomes a list of things
to connect to them.
This is useful for #4100 , see more details there.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Enable it in -O3 and -Os and higher.
This helps very little on output from LLVM, but also it does not alter
compile times much anyhow. On code that has not been run through
an optimizing compiler already, this can help quite a lot, e.g., 15% of
code size on some wasm GC samples.
This will not normally help with speed, as optimizing VMs do such
things anyhow. However, this can help baseline compilers and
interpreters and so forth.
|
|
|
|
|
|
|
|
| |
Before, we'd compute the hash of a child, then store that in a map, then
the parent would find the child's hash in the map using the pointer to the
child. But as we do a simple postorder walk, we can use a stack, and
avoid hashing the child pointers.
This makes it 10% faster or so.
|
|
|
|
|
|
|
|
|
|
|
| |
This allows common patterns in J2CL to be optimized, where we write
to various array indices and get the values or the reference from a
struct.
It would be nice to do even better here, and look at actually specific
types, but I think we should be careful to keep the runtime constant.
That seems hard to do if we accumulate a list of types and do
Type::isSubType on them etc. But maybe someone has a better
idea than this PR?
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If we replace
A
A
A
with
(local.set A)
(local.get)
(local.get)
then it is ok for A to trap (so long as it does so deterministically), as if
it does trap then the first appearance will do so, and the others not
be reached anyhow.
This helps GC code as often there are repeated struct.gets and such that
may trap.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The goal of this mode is to remove obviously-unneeded code like
(drop
(i32.load
(local.get $x)))
In general we can't remove it, as the load might trap - we'd be removing
a side effect. This is fairly rare in general, but actually becomes quite
annoying with wasm GC code where such patterns are more common,
and we really need to remove them.
Historically the IgnoreImplicitTraps option was meant to help here. However,
in practice it did not quite work well enough for most production code, as
mentioned e.g. in #3934 . TrapsNeverHappen mode is an attempt to fix that,
based on feedback from @askeksa in that issue, and also I believe this
implements an idea that @fitzgen mentioned a while ago (sorry, I can't
remember where exactly...). So I'm hopeful this will be generally useful
and not just for GC.
The idea in TrapsNeverHappen mode is that traps are assumed to not
actually happen at runtime. That is, if there is a trap in the code, it will
not be reached, or if it is reached then it will not trap. For example, an
(unreachable) would be assumed to never be reached, which means
that the optimizer can remove it and any code that executes right before
it:
(if
(..condition..)
(block
(..code that can be removed, if it does not branch out..)
(..code that can be removed, if it does not branch out..)
(..code that can be removed, if it does not branch out..)
(unreachable)))
And something like a load from memory is assumed to not trap, etc.,
which in particular would let us remove that dropped load from earlier.
This mode should be usable in production builds with assertions
disabled, if traps are seen as failing assertions. That might not be true
of all release builds (maybe some use traps for other purposes), but
hopefully in some. That is, if traps are like assertions, then enabling
this new mode would be like disabling assertions in release builds
and living with the fact that if an assertion would have been hit then
that is "undefined behavior" and the optimizer might have removed
the trap or done something weird.
TrapsNeverHappen (TNH) is different from IgnoreImplicitTraps (IIT).
The old IIT mode would just ignore traps when computing effects.
That is a simple model, but a problem happens with a trap behind
a condition, like this:
if (x != 0) foo(1 / x);
We won't trap on integer division by zero here only because of the
guarding if. In IIT, we'd compute no side effects on 1 / x, and then
we might end up moving it around, depending on other code in
the area, and potentially out of the if - which would make it happen
unconditionally, which would break.
TNH avoids that problem because it does not simply ignore traps.
Instead, there is a new hasUnremovableSideEffects() method
that must be opted-in by passes. That checks if there are no side
effects, or if there are, if we can remove them - and we know we can
remove a trap if we are running under TrapsNeverHappen mode,
as the trap won't happen by assumption. A pass must only use that
method where it is safe, that is, where it would either remove the
side effect (in which case, no problem), or if not, that it at least does
not move it around (avoiding the above problem with IIT).
This PR does not implement all optimizations possible with
TNH, just a small initial set of things to get started. It is already
useful on wasm GC code, including being as good as IIT on removing
unnecessary casts in some cases, see the test suite updates here.
Also, a significant part of the 18% speedup measured in
#4052 (comment)
is due to my testing with this enabled, as otherwise the devirtualization
there leaves a lot of unneeded code.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Technically this is not a new pass, but it is a rewrite almost from scratch.
Local Common Subexpression Elimination looks for repeated patterns,
stuff like this:
x = (a + b) + c
y = a + b
=>
temp = a + b
x = temp + c
y = temp
The old pass worked on flat IR, which is inefficient, and was overly
complicated because of that. The new pass uses a new algorithm that
I think is pretty simple, see the detailed comment at the top.
This keeps the pass enabled only in -O4, like before - right after
flattening the IR. That is to make this as minimal a change as possible.
Followups will enable the pass in the main pipeline, that is, we will
finally be able to run it by default. (Note that to make the pass work
well after flatten, an extra simplify-locals is added - the old pass used
to do part of simplify-locals internally, which was one source of
complexity. Even so, some of the -O4 tests have changes, due to
minor factors - they are just minor orderings etc., which can be
seen by inspecting the outputs before and after using e.g.
--metrics)
This plus some followup work leads to large wins on wasm GC output.
On j2cl there is a common pattern of repeated struct.gets, so common
that this pass removes 85% of all struct.gets, which makes the total
binary 15% smaller. However, on LLVM-emitted code the benefit is
minor, less than 1%.
|