summaryrefslogtreecommitdiff
path: root/src/parser/lexer.cpp
Commit message (Collapse)AuthorAgeFilesLines
* Require string-style identifiers to be UTF-8 (#6941)Thomas Lively2024-09-161-0/+10
| | | | | | | | | | | In the WebAssembly text format, strings can generally be arbitrary bytes, but identifiers must be valid UTF-8. Check for UTF-8 validity when parsing string-style identifiers in the lexer. Update StringLowering to generate valid UTF-8 global names even for strings that may not be valid UTF-8 and test that text round tripping works correctly after StringLowering. Fixes #6937.
* Rewrite wasm-shell to use new wast parser (#6601)Thomas Lively2024-05-171-0/+4
| | | | | | | | | | | | | | | | | | Use the new wast parser to parse a full script up front, then traverse the parsed script data structure and execute the commands. wasm-shell had previously used the new wat parser for top-level modules, but it now uses the new parser for module assertions as well. Fix various bugs this uncovered. After this change, wasm-shell supports all the assertions used in the upstream spec tests (although not new kinds of assertions introduced in any proposals). Uncomment various `assert_exhaustion` tests that we can now execute. Other kinds of assertions remain commented out in our tests: wasm-shell now supports `assert_unlinkable`, but the interpreter does not eagerly check for the existence of imports, so those tests do not pass. Tests that check for NaNs also remain commented out because they do not yet use the standard syntax that wasm-shell now supports for canonical and arithmetic NaN results, and our interpreter would not pass all of those tests even if they did use the standard syntax.
* [Parser][NFC] Clean up the lexer index/pos API (#6553)Thomas Lively2024-04-291-17/+17
| | | | | The lexer previously had both `getPos` and `getIndex` APIs that did different things, but after a recent refactoring there is no difference between the index and the position. Deduplicate the API surface.
* [Parser] Do not eagerly lex numbers (#6544)Thomas Lively2024-04-251-189/+132
| | | | Lex integers and floats on demand to avoid wasted work. Remove `Token` completely now that all kinds of tokens are lexed on demand.
* [Parser] Do not eagerly lex strings (#6543)Thomas Lively2024-04-251-22/+16
| | | Lex them on demand instead to avoid wasted work.
* [Parser] Do not eagerly lex IDs (#6542)Thomas Lively2024-04-251-20/+21
| | | Lex them on demand instead to avoid wasted work.
* [Parser] Do not eagerly lex keywords (#6541)Thomas Lively2024-04-251-6/+51
| | | Lex them on demand instead to avoid wasted work.
* [Parser] Do not eagerly lex parens (#6540)Thomas Lively2024-04-251-27/+27
| | | | | | | | | | | The lexer currently lexes tokens eagerly and stores them in a `Token` variant ahead of when they are actually requested by the parser. It is wasteful, however, to classify tokens before they are requested by the parser because it is likely that the next token will be precisely the kind the parser requests. The work of checking and rejecting other possible classifications ahead of time is not useful. To make incremental progress toward removing `Token` completely, lex parentheses on demand instead of eagerly.
* [Parser][NFC] Improve performance of idchar lexing (#6515)Thomas Lively2024-04-191-30/+18
| | | | | The parsing of idchars was hot enough to show up while profiling the parsing of a very large module. Optimize it to speed up the overall parse by about 16% in a very unscientific measurement.
* [Strings] Represent string values as WTF-16 internally (#6418)Thomas Lively2024-03-221-19/+2
| | | | | | | | | | | | | | | | WTF-16, i.e. arbitrary sequences of 16-bit values, is the encoding of Java and JavaScript strings, and using the same encoding makes the interpretation of string operations trivial, even when accounting for non-ascii characters. Specifically, use little-endian WTF-16. Re-encode string constants from WTF-8 to WTF-16 in the parsers, then back to WTF-8 in the writers. Update the constructor for string `Literal`s to interpret the string as WTF-16 and store a sequence of WTF-16 code units, i.e. 16-bit integers. Update `Builder::makeConstantExpression` accordingly to convert from the new `Literal` string representation back to a WTF-16 string. Update the interpreter to remove the logic for detecting non-ascii characters and bailing out. The naive implementations of all the string operations are correct now that our string encoding matches the JS string encoding.
* [Parser] Parse annotations, including source map comments (#6345)Thomas Lively2024-02-261-3/+150
| | | | | | | | | | Parse annotations using the standards-track `(@annotation ...)` format as well as the `;;@ source-map:0:1` format. Have the lexer implicitly collect annotations while it skips whitespace and add lexer APIs to access the annotations since the last token was parsed. Collect annotations before parsing each instruction and pass the annotations explicitly to the parser and parser context functions for instructions. Add an API to `IRBuilder` to set a debug location to be attached to the next visited or created instruction and use it from the parser.
* [Parser] Support string-style identifiers (#6278)Thomas Lively2024-02-061-21/+60
| | | | | | | | | | In addition to normal identifiers, support parsing identifiers of the format `$"..."`. This format is not yet allowed by the standard, but it is a popular proposed extension (see https://github.com/WebAssembly/spec/issues/617 and https://github.com/WebAssembly/annotations/issues/21). Binaryen has historically allowed a similar format and has supported arbitrary non-standard identifier characters, so it's much easier to support this extended syntax than to fix everything to use the restricted standard syntax.
* [Parser] Parse v128.const (#6275)Thomas Lively2024-02-051-0/+5
|
* [Parser] Templatize lexing of integers (#6272)Thomas Lively2024-02-051-48/+23
| | | | | | Have a single implementation for lexing each of unsigned, signed, and uninterpreted integers, each generic over the bit width of the integer. This reduces duplication in the existing code and it will make it much easier to support lexing more 8- and 16-bit integers.
* [NFC] Split the new wat parser into multiple files (#5960)Thomas Lively2023-09-191-0/+1038
And put the new files in a new source directory, "parser". This is a rough split and is not yet expected to dramatically improve compile times. The exact organization of the new files is subject to change, but this splitting should be enough to make further parser development more pleasant.