| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We found one cast that has another as its input, and forgot that
the child was possibly a fallthrough value. That is, there might be more
code that needs to be kept around.
Rather than fix the middle of the three cases there - the one with
HeapType::isSubType(childIntendedType, intendedType) - I
noticed it is not actually needed. That case checks if the child's
type is more specific than the parent's, and if so, then the parent
is not needed. But we already handle that earlier above in the
same function: regardless of what the child of a static cast is,
if the cast is static and the input is the proper type already, the
cast is unneeded (see lines 1565-1566).
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Canonicalize:
(signed)x > -1 ==> x >= 0
(signed)x <= -1 ==> x < 0
(signed)x < 1 ==> x <= 0
(signed)x >= 1 ==> x > 0
(unsigned)x < 1 ==> x == 0
(unsigned)x >= 1 ==> x != 0
This should help #4265, and in general 0 is usually a more
common constant, and reasonable to canonicalize to.
|
|
|
| |
Fuzzing followup to #4244.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This optimizes this type of pattern:
(local.set $x (struct.new X Y Z))
(struct.set (local.get $x) X')
=>
(local.set $x (struct.new X' Y Z))
Note how the struct.set is removed, and X' moves to where X was.
This removes almost 90% (!) of the struct.sets in j2wasm output, which reduces
total code size by 2.5%. However, I see no speedup with this - I guess that either
this is not on the hot path, or V8 optimizes it well already, or the CPU is making
stores "free" anyhow...
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
patterns (#4181)
i32(x) ? i32(x) : 0 ==> x
i32(x) ? 0 : i32(x) ==> {x, 0}
i64(x) == 0 ? 0 : i64(x) ==> x
i64(x) != 0 ? i64(x) : 0 ==> x
i64(x) == 0 ? i64(x) : 0 ==> {x, 0}
i64(x) != 0 ? 0 : i64(x) ==> {x, 0}
|
|
|
| |
Rather than load from the table and call that reference, call using the table.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now that they are all implemented, we can optimize them. This removes the
big if that ignored static operations, and implements things for them.
In general this matches the existing rtt-using case, but there are a few things
we can do better, which this does:
* A cast of a subtype to a type always succeeds.
* A test of a subtype to a type is always 1 (if non-nullable).
* Repeated static casts can leave just the most demanding of them.
|
|
|
|
|
|
|
|
|
|
|
|
| |
We added an optional ReFinalize in OptimizeInstructions at some point,
but that is not valid: The ReFinalize only updates types when all other
works is done, but the pass works incrementally. The bug the fuzzer found
is that a child is changed to be unreachable, and then the parent is
optimized before finalize() is called on it, which led to an assertion being
hit (as the child was unreachable but not the parent, which should also
be).
To fix this, do not change types in this pass. Emit an extra block with a
declared type when necessary. Other passes can remove the extra block.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
These variants take a HeapType that is the type we intend to cast to,
and do not take an RTT.
These are intended to be more statically optimizable. For now though
this PR just implements the minimum to get them parsing and to get
through the optimizer without crashing.
Spec: https://docs.google.com/document/d/1afthjsL_B9UaMqCA5ekgVmOm75BVFu6duHNsN9-gnXw/edit#
See #4149
|
|
|
|
|
|
|
|
|
|
| |
If we can remove such traps, we can remove ref.as_non_null if the local
type is nullable anyhow.
If we support non-nullable locals, however, then do not do this, as it
could inhibit specializing the local type later. Do the same for tees which
we had existing code for.
Background: #4061 (comment)
|
|
|
|
|
|
|
|
|
| |
That PR reused the same node twice in the output, which fails on the
assertion in BINARYEN_PASS_DEBUG=1 mode.
No new test is needed because the existing test suite fails already
in that mode. That the PR managed to land seems to say that we are
not testing pass-debug mode on our lit tests, which we need to
investigate.
|
|
|
|
|
|
| |
Avoids a crash in calling getHeapType when there isn't one.
Also add the relevant lit test (and a few others) to the list of files to
fuzz more heavily.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is reland of #3071
Do similar optimizations as in #3038 but for memory.fill.
`memory.fill(d, v, 0)` ==> `{ drop(d), drop(v) }` only with `ignoreImplicitTraps` or `trapsNeverHappen`
`memory.fill(d, v, 1)` ==> `store8(d, v)`
Further simplifications can be done only if v is constant because otherwise binary size would increase:
`memory.fill(d, C, 1)` ==> `store8(d, (C & 0xFF))`
`memory.fill(d, C, 2)` ==> `store16(d, (C & 0xFF) * 0x0101)`
`memory.fill(d, C, 4)` ==> `store32(d, (C & 0xFF) * 0x01010101)`
`memory.fill(d, C, 8)` ==> `store64(d, (C & 0xFF) * 0x0101010101010101)`
`memory.fill(d, C, 16)` ==> `store128(d, i8x16.splat(C & 0xFF))`
|
|
|
|
|
|
|
|
|
|
|
|
| |
If all a select's inputs are boolean, we can sometimes turn the select
into an AND or an OR operation,
x ? y : 0 => x & y
x ? 1 : y => x | y
I believe LLVM aggressively canonicalizes to this form. It makes sense
to do here too as it is smaller (save the constant 0 or 1). It also allows
further optimizations (which is why LLVM does it) but I don't think we
have those yet.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
```ts
-x * -y => (x * y)
-x * y => -(x * y)
x * -y => -(x * y), if x != C && y != C
-x * C => x * -C, if C != C_pot || shrinkLevel != 0
-x * C => -(x * C), otherwise
```
We are skipping propagation when lhs and rhs are constants because this should handled by constant folding. Also skip cases like `-x * 4 -> x * -4` for `shrinkLevel != 0`, as this will be further converted to `-(x << 2)`.
|
|
|
|
| |
NFC (#4090)
|
|
|
|
|
|
|
|
|
|
|
| |
This finishes the refactoring started in #4115 by doing the
same change to pass a Module into EffectAnalyzer instead of
features. To do so this refactors the fallthrough API and a few
other small things. After those changes, this PR removes the
old feature constructor of EffectAnalyzer entirely.
This requires a small breaking change in the C API, changing
BinaryenExpressionGetSideEffects's feature param to a
module. That makes this change not NFC, but otherwise it is.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Knowing the module will allow us to do more analysis in the
effect analyzer. For now, this just refactors the code to allow
providing a module instead of features, and to infer the
features from the module. This actually shortens the code in
most places which is nice (just pass module instead of
module->features).
This modifies basically all callers to use the new module
form, except for the fallthrough logic. That would require
some more refactoring, so to keep this PR reasonably small
that is not yet done.
|
|
|
|
|
| |
If the types are completely incompatible, we know the cast will fail. However,
ref.cast does allow a null to pass through, which makes it a little more
complicated.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The goal of this mode is to remove obviously-unneeded code like
(drop
(i32.load
(local.get $x)))
In general we can't remove it, as the load might trap - we'd be removing
a side effect. This is fairly rare in general, but actually becomes quite
annoying with wasm GC code where such patterns are more common,
and we really need to remove them.
Historically the IgnoreImplicitTraps option was meant to help here. However,
in practice it did not quite work well enough for most production code, as
mentioned e.g. in #3934 . TrapsNeverHappen mode is an attempt to fix that,
based on feedback from @askeksa in that issue, and also I believe this
implements an idea that @fitzgen mentioned a while ago (sorry, I can't
remember where exactly...). So I'm hopeful this will be generally useful
and not just for GC.
The idea in TrapsNeverHappen mode is that traps are assumed to not
actually happen at runtime. That is, if there is a trap in the code, it will
not be reached, or if it is reached then it will not trap. For example, an
(unreachable) would be assumed to never be reached, which means
that the optimizer can remove it and any code that executes right before
it:
(if
(..condition..)
(block
(..code that can be removed, if it does not branch out..)
(..code that can be removed, if it does not branch out..)
(..code that can be removed, if it does not branch out..)
(unreachable)))
And something like a load from memory is assumed to not trap, etc.,
which in particular would let us remove that dropped load from earlier.
This mode should be usable in production builds with assertions
disabled, if traps are seen as failing assertions. That might not be true
of all release builds (maybe some use traps for other purposes), but
hopefully in some. That is, if traps are like assertions, then enabling
this new mode would be like disabling assertions in release builds
and living with the fact that if an assertion would have been hit then
that is "undefined behavior" and the optimizer might have removed
the trap or done something weird.
TrapsNeverHappen (TNH) is different from IgnoreImplicitTraps (IIT).
The old IIT mode would just ignore traps when computing effects.
That is a simple model, but a problem happens with a trap behind
a condition, like this:
if (x != 0) foo(1 / x);
We won't trap on integer division by zero here only because of the
guarding if. In IIT, we'd compute no side effects on 1 / x, and then
we might end up moving it around, depending on other code in
the area, and potentially out of the if - which would make it happen
unconditionally, which would break.
TNH avoids that problem because it does not simply ignore traps.
Instead, there is a new hasUnremovableSideEffects() method
that must be opted-in by passes. That checks if there are no side
effects, or if there are, if we can remove them - and we know we can
remove a trap if we are running under TrapsNeverHappen mode,
as the trap won't happen by assumption. A pass must only use that
method where it is safe, that is, where it would either remove the
side effect (in which case, no problem), or if not, that it at least does
not move it around (avoiding the above problem with IIT).
This PR does not implement all optimizations possible with
TNH, just a small initial set of things to get started. It is already
useful on wasm GC code, including being as good as IIT on removing
unnecessary casts in some cases, see the test suite updates here.
Also, a significant part of the 18% speedup measured in
#4052 (comment)
is due to my testing with this enabled, as otherwise the devirtualization
there leaves a lot of unneeded code.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Technically this is not a new pass, but it is a rewrite almost from scratch.
Local Common Subexpression Elimination looks for repeated patterns,
stuff like this:
x = (a + b) + c
y = a + b
=>
temp = a + b
x = temp + c
y = temp
The old pass worked on flat IR, which is inefficient, and was overly
complicated because of that. The new pass uses a new algorithm that
I think is pretty simple, see the detailed comment at the top.
This keeps the pass enabled only in -O4, like before - right after
flattening the IR. That is to make this as minimal a change as possible.
Followups will enable the pass in the main pipeline, that is, we will
finally be able to run it by default. (Note that to make the pass work
well after flatten, an extra simplify-locals is added - the old pass used
to do part of simplify-locals internally, which was one source of
complexity. Even so, some of the -O4 tests have changes, due to
minor factors - they are just minor orderings etc., which can be
seen by inspecting the outputs before and after using e.g.
--metrics)
This plus some followup work leads to large wins on wasm GC output.
On j2cl there is a common pattern of repeated struct.gets, so common
that this pass removes 85% of all struct.gets, which makes the total
binary 15% smaller. However, on LLVM-emitted code the benefit is
minor, less than 1%.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
typing (#4069)
(if (result i32)
(local.get $x)
(struct.get $B 1
(ref.null $B)
)
(struct.get $C 1
(ref.null $C)
)
)
With structural typing it is safe to turn this into this:
(struct.get $A 1
(if (result (ref $A))
(local.get $x)
(ref.null $B)
(ref.null $C)
)
)
Here $A is the LUB of the others. This works since $A must have
field 1 in it. But with nominal types it is possible that the LUB in fact
does not have that field, and we would not validate.
This actually seems like a more general issue that might happen with
other things, even though atm perhaps it can't. For simplicity, avoid this
pattern in both nominal and structural typing, to avoid making a
difference between them.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
First, move the tiny pattern of call-ref-of-ref-func from Directize
into OptimizeInstructions. This is important because Directize is
a global optimization pass - it looks at the table to see if a
CallIndirect can be turned into a direct call. We only run global
passes at the end of the pipeline, but we don't need any global
data for call-ref of a ref-func, and OptimizeInstructions is the
place for such patterns.
Second, extend that to also handle fallthrough values. This is
less simple, but as call_ref is so inefficient, it's worth doing all
we can.
|
| |
|
| |
|
|
|
|
|
|
|
| |
For signed or unsigned extension to 64-bits after lowering from partially filled 64-bit arguments:
```rust
i64.extend_i32_u(i32.wrap_i64(x)) => x // where maxBits(x) <= 32
i64.extend_i32_s(i32.wrap_i64(x)) => x // where maxBits(x) <= 31
```
|
|
|
| |
Similar to #4005 but on OptimizeInstructions instead of RemoveUnusedBrs.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fixes #3973
Loads:
f32.reinterpret_i32(i32.load(x)) => f32.load(x)
f64.reinterpret_i64(i64.load(x)) => f64.load(x)
i32.reinterpret_f32(f32.load(x)) => i32.load(x)
i64.reinterpret_f64(f64.load(x)) => i64.load(x)
Stores:
f32.store(y, f32.reinterpret_i32(x)) => i32.store(y, x)
f64.store(y, f64.reinterpret_i64(x)) => i64.store(y, x)
i32.store(y, i32.reinterpret_f32(x)) => f32.store(y, x)
i64.store(y, i64.reinterpret_f64(x)) => f64.store(y, x)
Also optimize reinterprets that are undone:
i32.reinterpret_f32(f32.reinterpret_i32(x)) => x
i64.reinterpret_f64(f64.reinterpret_i64(x)) => x
f32.reinterpret_i32(i32.reinterpret_f32(x)) => x
f64.reinterpret_i64(i64.reinterpret_f64(x)) => x
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
requiring sign-extension:
```
i64(x) << 56 >> 56 ==> i64.extend8_s(x)
i64(x) << 48 >> 48 ==> i64.extend16_s(x)
i64(x) << 32 >> 32 ==> i64.extend32_s(x)
i64.extend_i32_s(i32.wrap_i64(x)) ==> i64.extend32_s(x)
```
general:
```
i32.wrap_i64(i64.extend_i32_s(x)) ==> x
i32.wrap_i64(i64.extend_i32_u(x)) ==> x
```
|
|
|
|
|
|
| |
i32(x) < 0 ? i32(-1) : i32(1) -> x >> 31 | 1
i64(x) < 0 ? i64(-1) : i64(1) -> x >> 63 | 1
This shrinks 2 bytes.
|
|
|
|
|
|
|
|
| |
Spec for it is here:
https://docs.google.com/document/d/1DklC3qVuOdLHSXB5UXghM_syCh-4cMinQ50ICiXnK3Q/edit#
Also reorder some things in wasm.h that were not in the canonical order (that has
no effect, but it is confusing to read).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For example,
(if (result i32)
(local.get $x)
(return
(local.get $y)
)
(return
(local.get $z)
)
)
If we move the returns outside we'd become unreachable, but we should
not make such type changes in this pass (they are handled by DCE and
Vacuum).
(found by the fuzzer)
|
| |
|
|
|
|
|
|
|
| |
We tried to ignore unreachable code, but only checked the type of
the entire node. But an arm might be unreachable, and after moving
code around that requires more work to update the type. But such
cases are best left to DCE anyhow, so just check for any unreachability
and stop there.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Effects are fine in the moved code, if we are doing so on an if
(which runs just one arm anyhow).
Allow unreachable, which lets us hoist returns for example.
Allow none, which lets us hoist drop and call for example. For
this we also need to be careful with subtyping, as at least drop
is polymorphic, so the child types may not have an LUB (see
example in code).
Adds a small ShallowEffectAnalyzer child of EffectAnalyzer that
calls visit to just do a shallow analysis (instead of walk which
walks the children).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(select
(foo
(X)
)
(foo
(Y)
)
(condition)
)
=>
(foo
(select
(X)
(Y)
(condition)
)
)
To make this simpler, refactor optimizeTernary to be templated.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(select
(i32.eqz (X))
(i32.const 0|1)
(Y)
)
=>
(i32.eqz
(select
(X)
(i32.const 1|0)
(Y)
)
)
This is beneficial as the eqz may be folded into something on the outside. I
see this pattern in real-world code, both a GC benchmark (which is why I
noticed it) and it shrinks code size by tiny amounts on the emscripten
benchmark suite as well.
|
|
|
|
| |
In both cases doing the ref.as_non_null last is beneficial as we have
optimizations that can remove it based on where it is consumed.
|
|
|
|
|
| |
* Note that ref.cast has a fallthrough value.
* Optimize ref.eq on identical inputs.
|
|
|
|
|
|
| |
ref.as_non_null is not needed if the value flows into a place that traps
on null anyhow. We replace a trap on one instruction with a trap on
another, but we allow such things (and even changing trap types, which
does not happen here).
|
|
|
|
|
|
|
| |
If we are ignoring implicit traps, and if the cast is from a subtype to a supertype,
then we ignore the possible RTT-related inconsistency and can just drop the
cast.
See #3636
|
|
|
|
|
|
|
|
|
| |
This is similar to the optimization of BrOn in #3719 and #3724. When the
type tells us the kind of input we have, we can tell at compile time what
result we'll get, like ref.is_func of something with type (ref func) will
always return 1, etc.
There are some code size and perf tradeoffs that should be looked into
more and are marked as TODOs.
|
|
|
|
|
|
|
|
|
| |
This was noticed by samparker on LLVM:
https://reviews.llvm.org/D99171
This is apparently a pattern LLVM emits, and doing it there helps by 1-2%
on the real-world Bullet Physics codebase. Seems worthwhile doing here
as well.
|
|
|
| |
Same as we already do for struct.set.
|
|
|
|
|
|
|
| |
(#3680)
When storing to an i8, we can ignore any higher bits, etc.
Adds a getByteSize utility to Field to make this convenient.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Instead of a single big optimize() method we now use separate functions
per instruction. This gives us smaller functions and less nesting in some cases,
and avoids manually casting and checking etc.
The reason this was not done originally is that this pass does repeated
applications. That is, if optimize() changed something, it would run again
on the result, perhaps further optimizing it. It did not need to run on the
children, but just on the result itself, so it didn't just do another full walk,
and so the simplest way was to just do a loop on optimize(). To replace that,
this PR modifies replaceCurrent() which the methods now call to report
that the current node can be replaced. There is some code in there now that
keeps doing more processing while changes happen. It's not trivial code as
it avoids recursion, but that slight complexity seems worthwhile in order to
simplify the bulk of the (very large) pass.
|
|
|
|
|
|
|
| |
Since in principle an unreachable expression can be used in any position. An
exception to this rule is in OptimizeInstructions, which avoids replacing
concrete expressions with unreachable expressions so that it doesn't need to
refinalize any expressions. Notably, Type::getLeastUpperBound was already
treating unreachable as the bottom type.
|