| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
| |
That pass only cares about reference locals, and even among them, only ones
that we see a struct.new/array.new that flows through locals. This makes the
pass 40% faster.
|
|
|
|
|
| |
This replaces direct access of the data structure graph.*influences[foo] with a call
graph.get*influences(foo). This will allow a later PR to make those calls optionally
lazy.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Before we just had a map that people would access with localGraph.getSetses[get],
while now it is a call localGraph.getSets(get), which more nicely hides the internal
implementation details.
Also rename getSetses => getSetsMap.
This will allow a later PR to optimize the internals of this API.
This is performance-neutral as far as I can measure. (We do replace a direct read
from a data structure with a call, but the call is in a header and should always get
inlined.)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously we tracked only whether an expression was relevant to analysis, that is,
whether it interacted with the allocation we were tracing the behavior of. That is
not enough for all cases, though, so also track the form of the interaction, namely
whether the allocation flows through or is fully consumed. An example where that
matters:
(ref.eq
(struct.get $A 0
(local.tee $x
(struct.new_default $A)
)
)
(local.get $x)
)
Here the local.get flows out the allocation, but the struct.get only fully consumes
it. Before this PR we thought the struct.get flowed the allocation, and we misoptimized
this to 1.
To make this possible, do a bunch of minor refactoring:
* Move ParentChildInteraction out of the class.
* Add a "None" interaction there.
* Replace the set of reached expressions with a map of them to their interactions.
* Add helper functions to get an expression's interaction or to update it when replacing.
The new testcase here shows the main fix. The new assertions are covered by existing
testcases.
|
|
|
|
|
|
|
|
| |
Followup to #6727 which added support for failing casts in Struct2Local, but it
turns out that it required Array2Struct changes as well. Specifically, when we
turn an array into a struct then casts can look like they behave differently
(what used to be an array input, becomes a struct), so like with RefTest that we
already handled, check if the cast succeeds in the original form and handle
that.
|
|
|
|
|
|
|
| |
Previously we just did not optimize cases where our escape analysis showed an
allocation flowed into a cast that failed. However, after inlining there can be
real-world cases where that happens, even in traps-never-happen mode (if the
cast is behind a conditional branch), so it seems worth optimizing.
|
|
|
|
|
| |
This is a tiny bit more code but it is more consistent with other
operations, and it saves work later.
|
| |
|
|
|
|
|
| |
If an allocation does not escape, then we can compute ref.eq for it: when
compared to itself the result is 1, and when compared to anything else it
is 0 (since it did not escape, anything else must be different).
|
|
|
|
|
|
|
|
| |
We tracked which expressions we saw an allocated struct/array reach, and then
quickly exited when another one did (as when two allocations mix, we can
optimize neither). It turns out that this helps very little in actual measurements
(looks like within noise - likely we are ruling out the un-optimizable cases early
otherwise anyhow). Also the complexity it adds is a problem for an improvement
I want to make to the pass, so remove it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Heap2Local's (#6493)
CFP already had logic for truncating but not for sign-extending, which this
fixes.
Use the new helper function in Heap2Local as well. This changes the model
there from "truncate on set, sign-extend on get" to "truncate or sign-extend
on get". That is both simpler by reusing the same logic as CFP but also more
optimal: the idea to truncate on sets made sense since sets are rarer, but if
we must then sign-extend on gets then we can end up doing more work
overall (as the truncations on sets are not needed if all gets are signed).
Found by #6486
|
|
|
|
|
|
| |
In #6480 I forgot that StructGet can be signed, which means we need to emit
a sign-extend.
Arrays already copied the field as part of Array2Struct.
|
|
|
|
|
|
| |
Previously we did not optimize a struct or an array with a packed field. As a result a
single packed field in a struct prevented the entire struct from being localized,
which this fixes. This is also useful for arrays as packed arrays are common (e.g. for
string data).
|
|
|
|
|
|
|
|
|
|
|
|
| |
To keep things simple, this adds a Array2Struct component to the pass. When we
find a non-escaping array, we run that to turn it into a struct, and then run the
existing Struct2Local to convert that to locals. This avoids refactoring Struct2Local
to handle both structs and arrays (with the downside of making the optimization of
arrays a little less efficient, but they are rarer, I suspect - that is certainly the case
in Java output I've seen).
The core EscapeAnalyzer logic is generalized to handle both arrays and structs,
but the changes there are thankfully quite minor.
|
|
|
|
|
|
| |
Separate out an EscapeAnalyzer class that does the escape analysis,
and a Struct2Local one that does the optimization.
Also make a few things const here to be safer.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We previously supported a non-standard `(func "name" ...` syntax for declaring
functions exported with the quoted name. Since that is not part of the standard
text format, drop support for it, replacing it with the standard `(func $name
(export "name") ...` syntax instead.
Also replace our other usage of the quoted form in our text output, which was
where we quoted names containing characters that are not allowed to appear in
standard names. To handle that case, adjust our output from `"$name"` to
`$"name"`, which is the standards-track way of supporting such names. Also fix
how we detect non-standard name characters to match the spec.
Update the lit test output generation script to account for these changes,
including by making the `$` prefix on names mandatory. This causes the script to
stop interpreting declarative element segments with the `(elem declare ...`
syntax as being named "declare", so prevent our generated output from regressing
by counting "declare" as a name in the script.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
interactions with a parent (#6089)
We had a simple rule that if we reach an expression twice then we give up, which makes
sense for say a block: if one allocation flows out of it, then another can't - it would get
mixed in with the other one, which is a case we don't optimize. However, there are
cases where a parent has multiple children and different interactions with them, like
a struct.set: the reference child does not escape, but the value child does. Before this
PR if we reached the value child first, we'd mark the parent as seen, and then the reference
child would see it isn't the first to get here, and not optimize.
To fix this, reorder the code to handle this case. The manner of interaction between the
child and the parent decides whether we mark the parent as seen and to be further
avoided.
Noticed by the determinism fuzzer, since the order of analysis mattered here.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
E.g.
(local $x (ref eq)
...
(local.set $x
(struct.new $float
...
)
)
(struct.get $float 0
(ref.cast (ref $float)
(local.get $x)
)
)
This PR allows us to use heap2local, ignoring the passing cast.
This is similar to existing handling of ref.as_non_null.
|
|
|
|
|
|
|
|
| |
We shouldn't need to in the general case, but the fuzzer found a corner case
where we do need to, see the explanation + testcase, but basically Heap2Local
replaces struct fields with locals, and the locals should have the same types,
but if a field was somehow less refined for some reason, then the locals could
actually be more refined. (And a field could be less refined if we read it from a
typed that was under-refined due to a tee or such.)
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Before this PR, if a call had no paths to a catch in the same function then we skipped
creating a new basic block right after it. As a result, we could have a call in the middle
of a basic block. If EH is enabled that means we might transfer control flow out of
the function from the middle of a block. But it is better to have the property that
any transfer of control flow - to another basic block, or outside of the function - can
only happen at the end of a basic block.
This causes some overhead, but a subsequent PR (#5838) will remove that as a
followup, and this PR adds a little code to pass the module and check if EH is enabled,
and avoid the overhead if not, which at least avoids regressing the non-EH case
until that followup lands.
|
|
|
|
|
|
|
| |
If we see (ref.cast $A) but we have inferred that a more refined type will
be present there at runtime $B then we can refine the cast to (ref.cast $B).
We could do the same even when a cast is not present, but that would
increase code size. This optimization keeps code size constant.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously only WalkerPasses had access to the `getPassRunner` and
`getPassOptions` methods. Move those methods to `Pass` so all passes can use
them. As a result, the `PassRunner` passed to `Pass::run` and
`Pass::runOnFunction` is no longer necessary, so remove it.
Also update `Pass::create` to return a unique_ptr, which is more efficient than
having it return a raw pointer only to have the `PassRunner` wrap that raw
pointer in a `unique_ptr`.
Delete the unused template `PassRunner::getLast()`, which looks like it was
intended to enable retrieving previous analyses and has been in the code base
since 2015 but is not implemented anywhere.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
An overview of this is in the README in the diff here (conveniently, it is near the
top of the diff). Basically, we fix up nn locals after each pass, by default. This keeps
things easy to reason about - what validates is what is valid wasm - but there are
some minor nuances as mentioned there, in particular, we ignore nameless blocks
(which are commonly added by various passes; ignoring them means we can keep
more locals non-nullable).
The key addition here is LocalStructuralDominance which checks which local
indexes have the "structural dominance" property of 1a, that is, that each get has
a set in its block or an outer block that precedes it. I optimized that function quite
a lot to reduce the overhead of running that logic after each pass. The overhead
is something like 2% on J2Wasm and 0% on Dart (0%, because in this mode we
shrink code size, so there is less work actually, and it balances out).
Since we run fixups after each pass, this PR removes logic to manually call the
fixup code from various places we used to call it (like eh-utils and various passes).
Various passes are now marked as requiresNonNullableLocalFixups => false.
That lets us skip running the fixups after them, which we normally do automatically.
This helps avoid overhead. Most passes still need the fixups, though - any pass
that adds a local, or a named block, or moves code around, likely does.
This removes a hack in SimplifyLocals that is no longer needed. Before we
worked to avoid moving a set into a try, as it might not validate. Now, we just do it
and let fixups happen automatically if they need to: in the common code they
probably don't, so the extra complexity seems not worth it.
Also removes a hack from StackIR. That hack tried to avoid roundtrip adding a
nondefaultable local. But we have the logic to fix that up now, and opts will
likely keep it non-nullable as well.
Various tests end up updated here because now a local can be non-nullable -
previous fixups are no longer needed.
Note that this doesn't remove the gc-nn-locals feature. That has been useful for
testing, and may still be useful in the future - it basically just allows nn locals in
all positions (that can't read the null default value at the entry). We can consider
removing it separately.
Fixes #4824
|
|
|
|
|
|
|
| |
RTTs were removed from the GC spec and if they are added back in in the future,
they will be heap types rather than value types as in our implementation.
Updating our implementation to have RTTs be heap types would have been more work
than deleting them for questionable benefit since we don't know how long it will
be before they are specced again.
|
| |
|
|
|
|
|
|
|
| |
When the allocation we optimize away flows through a loop, then just like
with a block we must change the type to be nullable, since we are replacing
the allocation with a null.
Fixes #4287
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A SmallSet starts with fixed storage that it uses in the simplest
possible way (linear scan, no sorting). If it exceeds a size then it
starts using a normal std::set. So for small amounts of data it
avoids allocation and any other overhead.
This adds a unit test and also uses it in LocalGraph which provides
a large amount of additional coverage.
I also changed an unrelated data structure from std::map to
std::unordered_map which I noticed while doing profiling in
LocalGraph. (And a tiny bit of additional refactoring there.)
This makes LocalGraph-using passes like ssa-nomerge and
precompute-propagate 10-15% faster on a bunch of real-world
codebases I tested.
|
|
|
|
|
|
|
|
|
| |
See #4149
This modifies the test added in #4163 which used static casts on
dynamically-created structs and arrays. That was technically not
valid (as we won't want users to "mix" the two forms). This makes that
test 100% static, which both fixes the test and gives test coverage
to the new instructions added here.
|
|
|
|
|
|
|
|
|
|
|
| |
This finishes the refactoring started in #4115 by doing the
same change to pass a Module into EffectAnalyzer instead of
features. To do so this refactors the fallthrough API and a few
other small things. After those changes, this PR removes the
old feature constructor of EffectAnalyzer entirely.
This requires a small breaking change in the C API, changing
BinaryenExpressionGetSideEffects's feature param to a
module. That makes this change not NFC, but otherwise it is.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Given
(local $x (ref $foo))
(local.set $x ..)
(.. (local.get $x))
If we remove the local.set but not the get, then we end up with
(local $x (ref $foo))
(.. (local.get $x))
It looks like the local.get reads the initial value of a non-nullable local,
which is not allowed.
In practice, this would crash in precompute-propagate which would
try to propagate the initial value to the get. Add an assertion there with
a clear message, as until we have full validation of non-nullable locals
(and the spec for that is in flux), that pass is where bugs will end up
being noticed.
To fix this, replace the get as well. We can replace it with a null
for simplicity; it will never be used anyhow.
This also uncovered a small bug with reached not containing all
the things we reached - it was missing local.gets.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously we would try to stop using the allocation as much as possible,
for example not writing it to locals any more, and leaving it to other passes
to actually remove it (and remove gets of those locals etc.). This seemed
simpler and more modular, but does not actually work in some cases as
the fuzzer has found. Specifically, if we stop writing our allocation to
locals, then if we do a (ref.as_non_null (local.get ..)) of that, then we
will trap on the null present in the local.
Instead, this changes our rewriting to do slightly more work, but it is
simpler in the end. We replace the allocation with a null, and replace
all the places that use it accordingly, for example, updating types to be
nullable, and removing RefAsNonNulls, etc. This literally gets rid of the
allocation and all the places it flows to (leaving less for other passes to
do later).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If we branch to a block, and there are no other branches or a final
value on the block either, then there is no mixing, and we may be able
to optimize the allocation. Before this PR, all branches stopped us.
To do this, add some helpers in BranchUtils.
The main flow logic in Heap2Local used to stop when we reached a
child for the second time. With branches, however, a child can flow both to
its immediate parent, and to branch targets, and so the proper thing to look
at is when we reach a parent for the second time (which would definitely
indicate mixing).
Tests are added for the new functionality. Note that some existing tests
already covered some things we should not optimize, and so no tests
were needed for them. The existing ones are: $get-through-block,
$branch-to-block.
|
|
If we allocate some GC data, and do not let the reference escape, then we can
replace the allocation with locals, one local for each field in the allocation
basically. This avoids the allocation, and also allows us to optimize the locals
further.
On the Dart DeltaBlue benchmark, this is a 24% speedup (making it faster than
the JS version, incidentially), and also a 6% reduction in code size.
The tests are not the best way to show what this does, as the pass assumes
other passes will clean up after. Here is an example to clarify. First, in pseudocode:
ref = new Int(42)
do {
ref.set(ref.get() + 1)
} while (import(ref.get())
That is, we allocate an int on the heap and use it as a counter. Unnecessarily,
as it could be a normal int on the stack.
Wat:
(module
;; A boxed integer: an entire struct just to hold an int.
(type $boxed-int (struct (field (mut i32))))
(import "env" "import" (func $import (param i32) (result i32)))
(func "example"
(local $ref (ref null $boxed-int))
;; Allocate a boxed integer of 42 and save the reference to it.
(local.set $ref
(struct.new_with_rtt $boxed-int
(i32.const 42)
(rtt.canon $boxed-int)
)
)
;; Increment the integer in a loop, looking for some condition.
(loop $loop
(struct.set $boxed-int 0
(local.get $ref)
(i32.add
(struct.get $boxed-int 0
(local.get $ref)
)
(i32.const 1)
)
)
(br_if $loop
(call $import
(struct.get $boxed-int 0
(local.get $ref)
)
)
)
)
)
)
Before this pass, the optimizer could do essentially nothing with this.
Even with this pass, running -O1 has no effect, as the pass is only
used in -O2+. However, running --heap2local -O1 leads to this:
(func $0
(local $0 i32)
(local.set $0
(i32.const 42)
)
(loop $loop
(br_if $loop
(call $import
(local.tee $0
(i32.add
(local.get $0)
(i32.const 1)
)
)
)
)
)
)
All the GC heap operations have been removed, and we just
have a plain int now, allowing a bunch of other opts to run. That
output is basically the optimal code, I think.
|