summaryrefslogtreecommitdiff
path: root/src/parser/lexer.h
Commit message (Collapse)AuthorAgeFilesLines
* Validate that names are valid UTF-8 (#6682)Thomas Lively2024-06-191-4/+5
| | | | | | Add an `isUTF8` utility and use it in both the text and binary parsers. Add missing checks for overlong encodings and overlarge code points in our WTF8 reader, which the new utility uses. Re-enable the spec tests that test UTF-8 validation.
* [Parser][NFC] Clean up the lexer index/pos API (#6553)Thomas Lively2024-04-291-10/+8
| | | | | The lexer previously had both `getPos` and `getIndex` APIs that did different things, but after a recent refactoring there is no difference between the index and the position. Deduplicate the API surface.
* [Parser] Do not eagerly lex numbers (#6544)Thomas Lively2024-04-251-104/+9
| | | | Lex integers and floats on demand to avoid wasted work. Remove `Token` completely now that all kinds of tokens are lexed on demand.
* [Parser] Do not eagerly lex strings (#6543)Thomas Lively2024-04-251-27/+9
| | | Lex them on demand instead to avoid wasted work.
* [Parser] Do not eagerly lex IDs (#6542)Thomas Lively2024-04-251-23/+2
| | | Lex them on demand instead to avoid wasted work.
* [Parser] Do not eagerly lex keywords (#6541)Thomas Lively2024-04-251-79/+5
| | | Lex them on demand instead to avoid wasted work.
* [Parser] Do not eagerly lex parens (#6540)Thomas Lively2024-04-251-38/+9
| | | | | | | | | | | The lexer currently lexes tokens eagerly and stores them in a `Token` variant ahead of when they are actually requested by the parser. It is wasteful, however, to classify tokens before they are requested by the parser because it is likely that the next token will be precisely the kind the parser requests. The work of checking and rejecting other possible classifications ahead of time is not useful. To make incremental progress toward removing `Token` completely, lex parentheses on demand instead of eagerly.
* [Parser] Use the new parser in wasm-shell and wasm-as (#6529)Thomas Lively2024-04-241-1/+2
| | | | | | | | | | | | | | | | | | | Updating just one or the other of these tools would cause the tests spec/import-after-*.fail.wast to fail, since only the updated tool would correctly fail to parse its contents. To avoid this, update both tools at once. (The tests erroneously pass before this change because check.py does not ensure that .fail.wast tests fail, only that failing tests end in .fail.wast.) In wasm-shell, to minimize the diff, only use the new parser to parse modules and instructions. Continue using the legacy parsing based on s-expressions for the other wast commands. Updating the parsing of the other commands to use `Lexer` instead of `SExpressionParser` is left as future work. The boundary between the two parsing styles is somewhat hacky, but it is worth it to enable incremental development. Update the tests to fix incorrect wast rejected by the new parser. Many of the spec/old_* tests use non-standard forms from before Wasm MVP was standardized, so fixing them would have been onerous. All of these tests have non-old_* variants, so simply delete them.
* [Parser] Parse annotations, including source map comments (#6345)Thomas Lively2024-02-261-8/+20
| | | | | | | | | | Parse annotations using the standards-track `(@annotation ...)` format as well as the `;;@ source-map:0:1` format. Have the lexer implicitly collect annotations while it skips whitespace and add lexer APIs to access the annotations since the last token was parsed. Collect annotations before parsing each instruction and pass the annotations explicitly to the parser and parser context functions for instructions. Add an API to `IRBuilder` to set a debug location to be attached to the next visited or created instruction and use it from the parser.
* [Parser][NFC] Remove `Token` from lexer interface (#6333)Thomas Lively2024-02-221-34/+38
| | | | | | Replace the general `peek` method that returned a `Token` with specific peek methods that look for (but do not consume) specific kinds of tokens. This change is a prerequisite for simplifying the lexer implementation by removing `Token` entirely.
* [Parser][NFC] Remove parser/input.h (#6332)Thomas Lively2024-02-221-0/+9
| | | | Remove the layer of abstraction sitting between the parser and the lexer now that the lexer has an interface the parser can use directly.
* [Parser] Simplify the lexer interface (#6319)Thomas Lively2024-02-201-32/+214
| | | | | | | | | | | The lexer was previously an iterator over tokens, but that expressivity is not actually used in the parser. Instead, we have `input.h` that adapts the token iterator interface into an iterface that is actually useful. As a first step toward simplifying the lexer implementation to no longer be an iterator over tokens, update its interface by moving the adaptation from input.h to the lexer itself. This requires extensive changes to the lexer unit tests, which will not have to change further when we actually simplify the lexer implementation.
* [Parser] Support string-style identifiers (#6278)Thomas Lively2024-02-061-8/+8
| | | | | | | | | | In addition to normal identifiers, support parsing identifiers of the format `$"..."`. This format is not yet allowed by the standard, but it is a popular proposed extension (see https://github.com/WebAssembly/spec/issues/617 and https://github.com/WebAssembly/annotations/issues/21). Binaryen has historically allowed a similar format and has supported arbitrary non-standard identifier characters, so it's much easier to support this extended syntax than to fix everything to use the restricted standard syntax.
* [Parser] Templatize lexing of integers (#6272)Thomas Lively2024-02-051-6/+4
| | | | | | Have a single implementation for lexing each of unsigned, signed, and uninterpreted integers, each generic over the bit width of the integer. This reduces duplication in the existing code and it will make it much easier to support lexing more 8- and 16-bit integers.
* [NFC] Split the new wat parser into multiple files (#5960)Thomas Lively2023-09-191-0/+227
And put the new files in a new source directory, "parser". This is a rough split and is not yet expected to dramatically improve compile times. The exact organization of the new files is subject to change, but this splitting should be enough to make further parser development more pleasant.