diff options
Diffstat (limited to 'doc/lispref/parsing.texi')
-rw-r--r-- | doc/lispref/parsing.texi | 1515 |
1 files changed, 1515 insertions, 0 deletions
diff --git a/doc/lispref/parsing.texi b/doc/lispref/parsing.texi new file mode 100644 index 00000000000..3784531fe59 --- /dev/null +++ b/doc/lispref/parsing.texi @@ -0,0 +1,1515 @@ +@c -*- mode: texinfo; coding: utf-8 -*- +@c This is part of the GNU Emacs Lisp Reference Manual. +@c Copyright (C) 2021 Free Software Foundation, Inc. +@c See the file elisp.texi for copying conditions. +@node Parsing Program Source +@chapter Parsing Program Source + +Emacs provides various ways to parse program source text and produce a +@dfn{syntax tree}. In a syntax tree, text is no longer a +one-dimensional stream but a structured tree of nodes, where each node +representing a piece of text. Thus a syntax tree can enable +interesting features like precise fontification, indentation, +navigation, structured editing, etc. + +Emacs has a simple facility for parsing balanced expressions +(@pxref{Parsing Expressions}). There is also SMIE library for generic +navigation and indentation (@pxref{SMIE}). + +Emacs also provides integration with tree-sitter library +(@uref{https://tree-sitter.github.io/tree-sitter}) if compiled with +it. The tree-sitter library implements an incremental parser and has +support from a wide range of programming languages. + +@defun treesit-available-p +This function returns non-nil if tree-sitter features are available +for this Emacs instance. +@end defun + +For tree-sitter integration with existing Emacs features, +@pxref{Parser-based Font Lock}, @ref{Parser-based Indentation}, and +@ref{List Motion}. + +To access the syntax tree of the text in a buffer, we need to first +load a language definition and create a parser with it. Next, we can +query the parser for specific nodes in the syntax tree. Then, we can +access various information about the node, and we can pattern-match a +node with a powerful syntax. Finally, we explain how to work with +source files that mixes multiple languages. The following sections +explain how to do each of the tasks in detail. + +@menu +* Language Definitions:: Loading tree-sitter language definitions. +* Using Parser:: Introduction to parsers. +* Retrieving Node:: Retrieving node from syntax tree. +* Accessing Node:: Accessing node information. +* Pattern Matching:: Pattern matching with query patterns. +* Multiple Languages:: Parse text written in multiple languages. +* Tree-sitter C API:: Compare the C API and the ELisp API. +@end menu + +@node Language Definitions +@section Tree-sitter Language Definitions + +@heading Loading a language definition + +Tree-sitter relies on language definitions to parse text in that +language. In Emacs, A language definition is represented by a symbol. +For example, C language definition is represented as @code{c}, and +@code{c} can be passed to tree-sitter functions as the @var{language} +argument. + +@vindex treesit-extra-load-path +@vindex treesit-load-language-error +@vindex treesit-load-suffixes +Tree-sitter language definitions are distributed as dynamic libraries. +In order to use a language definition in Emacs, you need to make sure +that the dynamic library is installed on the system. Emacs looks for +language definitions under load paths in +@code{treesit-extra-load-path}, @code{user-emacs-directory}/tree-sitter, +and system default locations for dynamic libraries, in that order. +Emacs tries each extensions in @code{treesit-load-suffixes}. If Emacs +cannot find the library or has problem loading it, Emacs signals +@code{treesit-load-language-error}. The signal data is a list of +specific error messages. + +@defun treesit-language-available-p language +This function checks whether the dynamic library for @var{language} is +present on the system, and return non-nil if it is. +@end defun + +@vindex treesit-load-name-override-list +By convention, the dynamic library for @var{language} is +@code{libtree-sitter-@var{language}.@var{ext}}, where @var{ext} is the +system-specific extension for dynamic libraries. Also by convention, +the function provided by that library is named +@code{tree_sitter_@var{language}}. If a language definition doesn't +follow this convention, you should add an entry + +@example +(@var{language} @var{library-base-name} @var{function-name}) +@end example + +to @code{treesit-load-name-override-list}, where +@var{library-base-name} is the base filename for the dynamic library +(conventionally @code{libtree-sitter-@var{language}}), and +@var{function-name} is the function provided by the library +(conventionally @code{tree_sitter_@var{language}}). For example, + +@example +(cool-lang "libtree-sitter-coool" "tree_sitter_cooool") +@end example + +for a language too cool to abide by conventions. + +@defun treesit-language-version &optional min-compatible +Tree-sitter library has a @dfn{language version}, a language +definition's version needs to match this version to be compatible. + +This function returns tree-sitter library’s language version. If +@var{min-compatible} is non-nil, it returns the minimal compatible +version. +@end defun + +@heading Concrete syntax tree + +A syntax tree is what a parser generates. In a syntax tree, each node +represents a piece of text, and is connected to each other by a +parent-child relationship. For example, if the source text is + +@example +1 + 2 +@end example + +@noindent +its syntax tree could be + +@example +@group + +--------------+ + | root "1 + 2" | + +--------------+ + | + +--------------------------------+ + | expression "1 + 2" | + +--------------------------------+ + | | | ++------------+ +--------------+ +------------+ +| number "1" | | operator "+" | | number "2" | ++------------+ +--------------+ +------------+ +@end group +@end example + +We can also represent it in s-expression: + +@example +(root (expression (number) (operator) (number))) +@end example + +@subheading Node types + +@cindex tree-sitter node type +@anchor{tree-sitter node type} +@cindex tree-sitter named node +@anchor{tree-sitter named node} +@cindex tree-sitter anonymous node +Names like @code{root}, @code{expression}, @code{number}, +@code{operator} are nodes' @dfn{type}. However, not all nodes in a +syntax tree have a type. Nodes that don't are @dfn{anonymous nodes}, +and nodes with a type are @dfn{named nodes}. Anonymous nodes are +tokens with fixed spellings, including punctuation characters like +bracket @samp{]}, and keywords like @code{return}. + +@subheading Field names + +@cindex tree-sitter node field name +@anchor{tree-sitter node field name} To make the syntax tree easier to +analyze, many language definitions assign @dfn{field names} to child +nodes. For example, a @code{function_definition} node could have a +@code{declarator} and a @code{body}: + +@example +@group +(function_definition + declarator: (declaration) + body: (compound_statement)) +@end group +@end example + +@deffn Command treesit-inspect-mode +This minor mode displays the node that @emph{starts} at point in +mode-line. The mode-line will display + +@example +@var{parent} @var{field-name}: (@var{child} (@var{grand-child} (...))) +@end example + +@var{child}, @var{grand-child}, and @var{grand-grand-child}, etc, are +nodes that have their beginning at point. And @var{parent} is the +parent of @var{child}. + +If there is no node that starts at point, i.e., point is in the middle +of a node, then the mode-line only displays the smallest node that +spans point, and its immediate parent. + +This minor mode doesn't create parsers on its own. It simply uses the +first parser in @code{(treesit-parser-list)} (@pxref{Using Parser}). +@end deffn + +@heading Reading the grammar definition + +Authors of language definitions define the @dfn{grammar} of a +language, and this grammar determines how does a parser construct a +concrete syntax tree out of the text. In order to use the syntax +tree effectively, we need to read the @dfn{grammar file}. + +The grammar file is usually @code{grammar.js} in a language +definition’s project repository. The link to a language definition’s +home page can be found in tree-sitter’s homepage +(@uref{https://tree-sitter.github.io/tree-sitter}). + +The grammar is written in JavaScript syntax. For example, the rule +matching a @code{function_definition} node looks like + +@example +@group +function_definition: $ => seq( + $.declaration_specifiers, + field('declarator', $.declaration), + field('body', $.compound_statement) +) +@end group +@end example + +The rule is represented by a function that takes a single argument +@var{$}, representing the whole grammar. The function itself is +constructed by other functions: the @code{seq} function puts together a +sequence of children; the @code{field} function annotates a child with +a field name. If we write the above definition in BNF syntax, it +would look like + +@example +@group +function_definition := + <declaration_specifiers> <declaration> <compound_statement> +@end group +@end example + +@noindent +and the node returned by the parser would look like + +@example +@group +(function_definition + (declaration_specifier) + declarator: (declaration) + body: (compound_statement)) +@end group +@end example + +Below is a list of functions that one will see in a grammar +definition. Each function takes other rules as arguments and returns +a new rule. + +@itemize @bullet +@item +@code{seq(rule1, rule2, ...)} matches each rule one after another. + +@item +@code{choice(rule1, rule2, ...)} matches one of the rules in its +arguments. + +@item +@code{repeat(rule)} matches @var{rule} for @emph{zero or more} times. +This is like the @samp{*} operator in regular expressions. + +@item +@code{repeat1(rule)} matches @var{rule} for @emph{one or more} times. +This is like the @samp{+} operator in regular expressions. + +@item +@code{optional(rule)} matches @var{rule} for @emph{zero or one} time. +This is like the @samp{?} operator in regular expressions. + +@item +@code{field(name, rule)} assigns field name @var{name} to the child +node matched by @var{rule}. + +@item +@code{alias(rule, alias)} makes nodes matched by @var{rule} appear as +@var{alias} in the syntax tree generated by the parser. For example, + +@example +alias(preprocessor_call_exp, call_expression) +@end example + +makes any node matched by @code{preprocessor_call_exp} to appear as +@code{call_expression}. +@end itemize + +Below are grammar functions less interesting for a reader of a +language definition. + +@itemize +@item +@code{token(rule)} marks @var{rule} to produce a single leaf node. +That is, instead of generating a parent node with individual child +nodes under it, everything is combined into a single leaf node. + +@item +Normally, grammar rules ignore preceding whitespaces, +@code{token.immediate(rule)} changes @var{rule} to match only when +there is no preceding whitespaces. + +@item +@code{prec(n, rule)} gives @var{rule} a level @var{n} precedence. + +@item +@code{prec.left([n,] rule)} marks @var{rule} as left-associative, +optionally with level @var{n}. + +@item +@code{prec.right([n,] rule)} marks @var{rule} as right-associative, +optionally with level @var{n}. + +@item +@code{prec.dynamic(n, rule)} is like @code{prec}, but the precedence +is applied at runtime instead. +@end itemize + +The tree-sitter project talks about writing a grammar in more detail: +@uref{https://tree-sitter.github.io/tree-sitter/creating-parsers}. +Read especially ``The Grammar DSL'' section. + +@node Using Parser +@section Using Tree-sitter Parser +@cindex Tree-sitter parser + +This section described how to create and configure a tree-sitter +parser. In Emacs, each tree-sitter parser is associated with a +buffer. As we edit the buffer, the associated parser and the syntax +tree is automatically kept up-to-date. + +@defvar treesit-max-buffer-size +This variable contains the maximum size of buffers in which +tree-sitter can be activated. Major modes should check this value +when deciding whether to enable tree-sitter features. +@end defvar + +@defun treesit-can-enable-p +This function checks whether the current buffer is suitable for +activating tree-sitter features. It basically checks +@code{treesit-available-p} and @code{treesit-max-buffer-size}. +@end defun + +@cindex Creating tree-sitter parsers +@defun treesit-parser-create language &optional buffer no-reuse +To create a parser, we provide a @var{buffer} and the @var{language} +to use (@pxref{Language Definitions}). If @var{buffer} is nil, the +current buffer is used. + +By default, this function reuses a parser if one already exists for +@var{language} in @var{buffer}, if @var{no-reuse} is non-nil, this +function always creates a new parser. +@end defun + +Given a parser, we can query information about it: + +@defun treesit-parser-buffer parser +Returns the buffer associated with @var{parser}. +@end defun + +@defun treesit-parser-language parser +Returns the language that @var{parser} uses. +@end defun + +@defun treesit-parser-p object +Checks if @var{object} is a tree-sitter parser. Return non-nil if it +is, return nil otherwise. +@end defun + +There is no need to explicitly parse a buffer, because parsing is done +automatically and lazily. A parser only parses when we query for a +node in its syntax tree. Therefore, when a parser is first created, +it doesn't parse the buffer; it waits until we query for a node for +the first time. Similarly, when some change is made in the buffer, a +parser doesn't re-parse immediately. + +@vindex treesit-buffer-too-large +When a parser do parse, it checks for the size of the buffer. +Tree-sitter can only handle buffer no larger than about 4GB. If the +size exceeds that, Emacs signals @code{treesit-buffer-too-large} +with signal data being the buffer size. + +Once a parser is created, Emacs automatically adds it to the +internal parser list. Every time a change is made to the buffer, +Emacs updates parsers in this list so they can update their syntax +tree incrementally. + +@defun treesit-parser-list &optional buffer +This function returns the parser list of @var{buffer}. And +@var{buffer} defaults to the current buffer. +@end defun + +@defun treesit-parser-delete parser +This function deletes @var{parser}. +@end defun + +@cindex tree-sitter narrowing +@anchor{tree-sitter narrowing} Normally, a parser ``sees'' the whole +buffer, but when the buffer is narrowed (@pxref{Narrowing}), the +parser will only see the visible region. As far as the parser can +tell, the hidden region is deleted. And when the buffer is later +widened, the parser thinks text is inserted in the beginning and in +the end. Although parsers respect narrowing, narrowing shouldn't be +the mean to handle a multi-language buffer; instead, set the ranges in +which a parser should operate in. @xref{Multiple Languages}. + +Because a parser parses lazily, when we narrow the buffer, the parser +is not affected immediately; as long as we don't query for a node +while the buffer is narrowed, the parser is oblivious of the +narrowing. + +@cindex tree-sitter parse string +@defun treesit-parse-string string language +Besides creating a parser for a buffer, we can also just parse a +string. Unlike a buffer, parsing a string is a one-time deal, and +there is no way to update the result. + +This function parses @var{string} with @var{language}, and returns the +root node of the generated syntax tree. +@end defun + +@node Retrieving Node +@section Retrieving Node + +@cindex tree-sitter find node +@cindex tree-sitter get node +Before we continue, lets go over some conventions of tree-sitter +functions. + +We talk about a node being ``smaller'' or ``larger'', and ``lower'' or +``higher''. A smaller and lower node is lower in the syntax tree and +therefore spans a smaller piece of text; a larger and higher node is +higher up in the syntax tree, containing many smaller nodes as its +children, and therefore spans a larger piece of text. + +When a function cannot find a node, it returns nil. And for the +convenience for function chaining, all the functions that take a node +as argument and returns a node accept the node to be nil; in that +case, the function just returns nil. + +@vindex treesit-node-outdated +Nodes are not automatically updated when the associated buffer is +modified. And there is no way to update a node once it is retrieved. +Using an outdated node throws @code{treesit-node-outdated} error. + +@heading Retrieving node from syntax tree + +@defun treesit-node-at beg end &optional parser-or-lang named +This function returns the @emph{smallest} node that starts at or after +the @var{point}. In other words, the start of the node is equal or +greater than @var{point}. + +When @var{parser-or-lang} is nil, this function uses the first parser +in @code{(treesit-parser-list)} in the current buffer. If +@var{parser-or-lang} is a parser object, it use that parser; if +@var{parser-or-lang} is a language, it finds the first parser using +that language in @code{(treesit-parser-list)} and use that. + +If @var{named} is non-nil, this function looks for a named node +only (@pxref{tree-sitter named node, named node}). + +Example: +@example +@group +;; Find the node at point in a C parser's syntax tree. +(treesit-node-at (point) 'c) + @c @result{} #<treesit-node from 1 to 4 in *scratch*> +@end group +@end example +@end defun + +@defun treesit-node-on beg end &optional parser-or-lang named +This function returns the @emph{smallest} node that covers the span +from @var{beg} to @var{end}. In other words, the start of the node is +less or equal to @var{beg}, and the end of the node is greater or +equal to @var{end}. + +@emph{Beware} that calling this function on an empty line that is not +inside any top-level construct (function definition, etc) most +probably will give you the root node, because the root node is the +smallest node that covers that empty line. Most of the time, you want +to use @code{treesit-node-at}. + +When @var{parser-or-lang} is nil, this function uses the first parser +in @code{(treesit-parser-list)} in the current buffer. If +@var{parser-or-lang} is a parser object, it use that parser; if +@var{parser-or-lang} is a language, it finds the first parser using +that language in @code{(treesit-parser-list)} and use that. + +If @var{named} is non-nil, this function looks for a named node only +(@pxref{tree-sitter named node, named node}). +@end defun + +@defun treesit-parser-root-node parser +This function returns the root node of the syntax tree generated by +@var{parser}. +@end defun + +@defun treesit-buffer-root-node &optional language +This function finds the first parser that uses @var{language} in +@code{(treesit-parser-list)} in the current buffer, and returns the +root node of that buffer. If it cannot find an appropriate parser, +nil is returned. +@end defun + +Once we have a node, we can retrieve other nodes from it, or query for +information about this node. + +@heading Retrieving node from other nodes + +@subheading By kinship + +@defun treesit-node-parent node +This function returns the immediate parent of @var{node}. +@end defun + +@defun treesit-node-child node n &optional named +This function returns the @var{n}'th child of @var{node}. If +@var{named} is non-nil, then it only counts named nodes +(@pxref{tree-sitter named node, named node}). For example, in a node +that represents a string: @code{"text"}, there are three children +nodes: the opening quote @code{"}, the string content @code{text}, and +the enclosing quote @code{"}. Among these nodes, the first child is +the opening quote @code{"}, the first named child is the string +content @code{text}. +@end defun + +@defun treesit-node-children node &optional named +This function returns all of @var{node}'s children in a list. If +@var{named} is non-nil, then it only retrieves named nodes. +@end defun + +@defun treesit-next-sibling node &optional named +This function finds the next sibling of @var{node}. If @var{named} is +non-nil, it finds the next named sibling. +@end defun + +@defun treesit-prev-sibling node &optional named +This function finds the previous sibling of @var{node}. If +@var{named} is non-nil, it finds the previous named sibling. +@end defun + +@subheading By field name + +To make the syntax tree easier to analyze, many language definitions +assign @dfn{field names} to child nodes (@pxref{tree-sitter node field +name, field name}). For example, a @code{function_definition} node +could have a @code{declarator} and a @code{body}. + +@defun treesit-child-by-field-name node field-name +This function finds the child of @var{node} that has @var{field-name} +as its field name. + +@example +@group +;; Get the child that has "body" as its field name. +(treesit-child-by-field-name node "body") + @c @result{} #<treesit-node from 3 to 11 in *scratch*> +@end group +@end example +@end defun + +@subheading By position + +@defun treesit-first-child-for-pos node pos &optional named +This function finds the first child of @var{node} that extends beyond +@var{pos}. ``Extend beyond'' means the end of the child node >= +@var{pos}. This function only looks for immediate children of +@var{node}, and doesn't look in its grand children. If @var{named} is +non-nil, it only looks for named child (@pxref{tree-sitter named node, +named node}). +@end defun + +@defun treesit-node-descendant-for-range node beg end &optional named +This function finds the @emph{smallest} child/grandchild... of +@var{node} that spans the range from @var{beg} to @var{end}. It is +similar to @code{treesit-node-at}. If @var{named} is non-nil, it only +looks for named child. +@end defun + +@heading Searching for node + +@defun treesit-search-subtree node predicate &optional all backward limit +This function traverses the subtree of @var{node} (including +@var{node}), and match @var{predicate} with each node along the way. +And @var{predicate} is a regexp that matches (case-insensitively) +against each node's type, or a function that takes a node and returns +nil/non-nil. If a node matches, that node is returned, if no node +ever matches, nil is returned. + +By default, this function only traverses named nodes, if @var{all} is +non-nil, it traverses all nodes. If @var{backward} is non-nil, it +traverses backwards. If @var{limit} is non-nil, it only traverses +that number of levels down in the tree. +@end defun + +@defun treesit-search-forward start predicate &optional all backward up +This function is somewhat similar to @code{treesit-search-subtree}. +It also traverse the parse tree and match each node with +@var{predicate} (except for @var{start}), where @var{predicate} can be +a (case-insensitive) regexp or a function. For a tree like the below +where @var{start} is marked 1, this function traverses as numbered: + +@example +@group + o + | + 3--------4-----------8 + | | | +o--o-+--1 5--+--6 9---+-----12 +| | | | | | +o o 2 7 +-+-+ +--+--+ + | | | | | + 10 11 13 14 15 +@end group +@end example + +Same as in @code{treesit-search-subtree}, this function only searches +for named nodes by default. But if @var{all} is non-nil, it searches +for all nodes. If @var{backward} is non-nil, it searches backwards. + +If @var{up} is non-nil, this function will only traverse to siblings +and parents. In that case, only 1 3 4 8 would be traversed. +@end defun + +@defun treesit-search-forward-goto predicate side &optional all backward up +This function jumps to the start or end of the next node in buffer +that matches @var{predicate}. Parameters @var{predicate}, @var{all}, +@var{backward}, and @var{up} are the same as in +@code{treesit-search-forward}. And @var{side} controls which side of +the matched no do we stop at, it can be @code{start} or @code{end}. +@end defun + +@defun treesit-induce-sparse-tree root predicate &optional process-fn limit +This function creates a sparse tree from @var{root}'s subtree. + +Basically, it takes the subtree under @var{root}, and combs it so only +the nodes that match @var{predicate} are left, like picking out grapes +on the vine. Like previous functions, @var{predicate} can be a regexp +string that matches against each node's type case-insensitively, or a +function that takes a node and return nil/non-nil. + +For example, for a subtree on the left that consist of both numbers +and letters, if @var{predicate} is ``letter only'', the returned tree +is the one on the right. + +@example +@group + a a a + | | | ++---+---+ +---+---+ +---+---+ +| | | | | | | | | +b 1 2 b | | b c d + | | => | | => | + c +--+ c + e + | | | | | + +--+ d 4 +--+ d + | | | + e 5 e +@end group +@end example + +If @var{process-fn} is non-nil, instead of returning the matched +nodes, this function passes each node to @var{process-fn} and uses the +returned value instead. If non-nil, @var{limit} is the number of +levels to go down from @var{root}. + +Each node in the returned tree looks like @code{(@var{tree-sitter +node} . (@var{child} ...))}. The @var{tree-sitter node} of the root +of this tree will be nil if @var{ROOT} doesn't match @var{pred}. If +no node matches @var{predicate}, return nil. +@end defun + +@heading More convenient functions + +@defun treesit-filter-child node pred &optional named +This function finds immediate children of @var{node} that satisfies +@var{pred}. + +Function @var{pred} takes the child node as the argument and should +return non-nil to indicated keeping the child. If @var{named} +non-nil, this function only searches for named nodes. +@end defun + +@defun treesit-parent-until node pred +This function repeatedly finds the parent of @var{node}, and returns +the parent if it satisfies @var{pred} (which takes the parent as the +argument). If no parent satisfies @var{pred}, this function returns +nil. +@end defun + +@defun treesit-parent-while +This function repeatedly finds the parent of @var{node}, and keeps +doing so as long as the parent satisfies @var{pred} (which takes the +parent as the single argument). I.e., this function returns the +farthest parent that still satisfies @var{pred}. +@end defun + +@node Accessing Node +@section Accessing Node Information + +Before going further, make sure you have read the basic conventions +about tree-sitter nodes in the previous node. + +@heading Basic information + +Every node is associated with a parser, and that parser is associated +with a buffer. The following functions let you retrieve them. + +@defun treesit-node-parser node +This function returns @var{node}'s associated parser. +@end defun + +@defun treesit-node-buffer node +This function returns @var{node}'s parser's associated buffer. +@end defun + +@defun treesit-node-language node +This function returns @var{node}'s parser's associated language. +@end defun + +Each node represents a piece of text in the buffer. Functions below +finds relevant information about that text. + +@defun treesit-node-start node +Return the start position of @var{node}. +@end defun + +@defun treesit-node-end node +Return the end position of @var{node}. +@end defun + +@defun treesit-node-text node &optional object +Returns the buffer text that @var{node} represents. (If @var{node} is +retrieved from parsing a string, it will be text from that string.) +@end defun + +Here are some basic checks on tree-sitter nodes. + +@defun treesit-node-p object +Checks if @var{object} is a tree-sitter syntax node. +@end defun + +@defun treesit-node-eq node1 node2 +Checks if @var{node1} and @var{node2} are the same node in a syntax +tree. +@end defun + +@heading Property information + +In general, nodes in a concrete syntax tree fall into two categories: +@dfn{named nodes} and @dfn{anonymous nodes}. Whether a node is named +or anonymous is determined by the language definition +(@pxref{tree-sitter named node, named node}). + +@cindex tree-sitter missing node +Apart from being named/anonymous, a node can have other properties. A +node can be ``missing'': missing nodes are inserted by the parser in +order to recover from certain kinds of syntax errors, i.e., something +should probably be there according to the grammar, but not there. + +@cindex tree-sitter extra node +A node can be ``extra'': extra nodes represent things like comments, +which can appear anywhere in the text. + +@cindex tree-sitter node that has changes +A node ``has changes'' if the buffer changed since when the node is +retrieved, i.e., outdated. + +@cindex tree-sitter node that has error +A node ``has error'' if the text it spans contains a syntax error. It +can be the node itself has an error, or one of its +children/grandchildren... has an error. + +@defun treesit-node-check node property +This function checks if @var{node} has @var{property}. @var{property} +can be @code{'named}, @code{'missing}, @code{'extra}, +@code{'has-changes}, or @code{'has-error}. +@end defun + + +@defun treesit-node-type node +Named nodes have ``types'' (@pxref{tree-sitter node type, node type}). +For example, a named node can be a @code{string_literal} node, where +@code{string_literal} is its type. + +This function returns @var{node}'s type as a string. +@end defun + +@heading Information as a child or parent + +@defun treesit-node-index node &optional named +This function returns the index of @var{node} as a child node of its +parent. If @var{named} is non-nil, it only count named nodes +(@pxref{tree-sitter named node, named node}). +@end defun + +@defun treesit-node-field-name node +A child of a parent node could have a field name (@pxref{tree-sitter +node field name, field name}). This function returns the field name +of @var{node} as a child of its parent. +@end defun + +@defun treesit-node-field-name-for-child node n +This function returns the field name of the @var{n}'th child of +@var{node}. +@end defun + +@defun treesit-child-count node &optional named +This function finds the number of children of @var{node}. If +@var{named} is non-nil, it only counts named child (@pxref{tree-sitter +named node, named node}). +@end defun + +@node Pattern Matching +@section Pattern Matching Tree-sitter Nodes + +Tree-sitter let us pattern match with a small declarative language. +Pattern matching consists of two steps: first tree-sitter matches a +@dfn{pattern} against nodes in the syntax tree, then it @dfn{captures} +specific nodes in that pattern and returns the captured nodes. + +We describe first how to write the most basic query pattern and how to +capture nodes in a pattern, then the pattern-match function, finally +more advanced pattern syntax. + +@heading Basic query syntax + +@cindex Tree-sitter query syntax +@cindex Tree-sitter query pattern +A @dfn{query} consists of multiple @dfn{patterns}. Each pattern is an +s-expression that matches a certain node in the syntax node. A +pattern has the following shape: + +@example +(@var{type} @var{child}...) +@end example + +@noindent +For example, a pattern that matches a @code{binary_expression} node that +contains @code{number_literal} child nodes would look like + +@example +(binary_expression (number_literal)) +@end example + +To @dfn{capture} a node in the query pattern above, append +@code{@@capture-name} after the node pattern you want to capture. For +example, + +@example +(binary_expression (number_literal) @@number-in-exp) +@end example + +@noindent +captures @code{number_literal} nodes that are inside a +@code{binary_expression} node with capture name @code{number-in-exp}. + +We can capture the @code{binary_expression} node too, with capture +name @code{biexp}: + +@example +(binary_expression + (number_literal) @@number-in-exp) @@biexp +@end example + +@heading Query function + +Now we can introduce the query functions. + +@defun treesit-query-capture node query &optional beg end node-only +This function matches patterns in @var{query} in @var{node}. +Parameter @var{query} can be either a string, a s-expression, or a +compiled query object. For now, we focus on the string syntax; +s-expression syntax and compiled query are described at the end of the +section. + +Parameter @var{node} can also be a parser or a language symbol. A +parser means using its root node, a language symbol means find or +create a parser for that language in the current buffer, and use the +root node. + +The function returns all captured nodes in a list of +@code{(@var{capture_name} . @var{node})}. If @var{node-only} is +non-nil, a list of node is returned instead. If @var{beg} and +@var{end} are both non-nil, this function only pattern matches nodes +in that range. + +@vindex treesit-query-error +This function raise a @var{treesit-query-error} if @var{query} is +malformed. The signal data contains a description of the specific +error. You can use @code{treesit-query-validate} to debug the query. +@end defun + +For example, suppose @var{node}'s content is @code{1 + 2}, and +@var{query} is + +@example +@group +(setq query + "(binary_expression + (number_literal) @@number-in-exp) @@biexp") +@end group +@end example + +Querying that query would return + +@example +@group +(treesit-query-capture node query) + @result{} ((biexp . @var{<node for "1 + 2">}) + (number-in-exp . @var{<node for "1">}) + (number-in-exp . @var{<node for "2">})) +@end group +@end example + +As we mentioned earlier, a @var{query} could contain multiple +patterns. For example, it could have two top-level patterns: + +@example +@group +(setq query + "(binary_expression) @@biexp + (number_literal) @@number @@biexp") +@end group +@end example + +@defun treesit-query-string string query language +This function parses @var{string} with @var{language}, pattern matches +its root node with @var{query}, and returns the result. +@end defun + +@heading More query syntax + +Besides node type and capture, tree-sitter's query syntax can express +anonymous node, field name, wildcard, quantification, grouping, +alternation, anchor, and predicate. + +@subheading Anonymous node + +An anonymous node is written verbatim, surrounded by quotes. A +pattern matching (and capturing) keyword @code{return} would be + +@example +"return" @@keyword +@end example + +@subheading Wild card + +In a query pattern, @samp{(_)} matches any named node, and @samp{_} +matches any named and anonymous node. For example, to capture any +named child of a @code{binary_expression} node, the pattern would be + +@example +(binary_expression (_) @@in_biexp) +@end example + +@subheading Field name + +We can capture child nodes that has specific field names: + +@example +@group +(function_definition + declarator: (_) @@func-declarator + body: (_) @@func-body) +@end group +@end example + +We can also capture a node that doesn't have certain field, say, a +@code{function_definition} without a @code{body} field. + +@example +(function_definition !body) @@func-no-body +@end example + +@subheading Quantify node + +Tree-sitter recognizes quantification operators @samp{*}, @samp{+} and +@samp{?}. Their meanings are the same as in regular expressions: +@samp{*} matches the preceding pattern zero or more times, @samp{+} +matches one or more times, and @samp{?} matches zero or one time. + +For example, this pattern matches @code{type_declaration} nodes +that has @emph{zero or more} @code{long} keyword. + +@example +(type_declaration "long"*) @@long-type +@end example + +And this pattern matches a type declaration that has zero or one +@code{long} keyword: + +@example +(type_declaration "long"?) @@long-type +@end example + +@subheading Grouping + +Similar to groups in regular expression, we can bundle patterns into a +group and apply quantification operators to it. For example, to +express a comma separated list of identifiers, one could write + +@example +(identifier) ("," (identifier))* +@end example + +@subheading Alternation + +Again, similar to regular expressions, we can express ``match anyone +from this group of patterns'' in the query pattern. The syntax is a +list of patterns enclosed in square brackets. For example, to capture +some keywords in C, the query pattern would be + +@example +@group +[ + "return" + "break" + "if" + "else" +] @@keyword +@end group +@end example + +@subheading Anchor + +The anchor operator @samp{.} can be used to enforce juxtaposition, +i.e., to enforce two things to be directly next to each other. The +two ``things'' can be two nodes, or a child and the end of its parent. +For example, to capture the first child, the last child, or two +adjacent children: + +@example +@group +;; Anchor the child with the end of its parent. +(compound_expression (_) @@last-child .) + +;; Anchor the child with the beginning of its parent. +(compound_expression . (_) @@first-child) + +;; Anchor two adjacent children. +(compound_expression + (_) @@prev-child + . + (_) @@next-child) +@end group +@end example + +Note that the enforcement of juxtaposition ignores any anonymous +nodes. + +@subheading Predicate + +We can add predicate constraints to a pattern. For example, if we use +the following query pattern + +@example +@group +( + (array . (_) @@first (_) @@last .) + (#equal @@first @@last) +) +@end group +@end example + +Then tree-sitter only matches arrays where the first element equals to +the last element. To attach a predicate to a pattern, we need to +group then together. A predicate always starts with a @samp{#}. +Currently there are two predicates, @code{#equal} and @code{#match}. + +@deffn Predicate equal arg1 arg2 +Matches if @var{arg1} equals to @var{arg2}. Arguments can be either a +string or a capture name. Capture names represent the text that the +captured node spans in the buffer. +@end deffn + +@deffn Predicate match regexp capture-name +Matches if the text that @var{capture-name}’s node spans in the buffer +matches regular expression @var{regexp}. Matching is case-sensitive. +@end deffn + +Note that a predicate can only refer to capture names appeared in the +same pattern. Indeed, it makes little sense to refer to capture names +in other patterns anyway. + +@heading S-expression patterns + +Besides strings, Emacs provides a s-expression based syntax for query +patterns. It largely resembles the string-based syntax. For example, +the following pattern + +@example +@group +(treesit-query-capture + node "(addition_expression + left: (_) @@left + \"+\" @@plus-sign + right: (_) @@right) @@addition + + [\"return\" \"break\"] @@keyword") +@end group +@end example + +@noindent +is equivalent to + +@example +@group +(treesit-query-capture + node '((addition_expression + left: (_) @@left + "+" @@plus-sign + right: (_) @@right) @@addition + + ["return" "break"] @@keyword)) +@end group +@end example + +Most pattern syntax can be written directly as strange but +never-the-less valid s-expressions. Only a few of them needs +modification: + +@itemize +@item +Anchor @samp{.} is written as @code{:anchor}. +@item +@samp{?} is written as @samp{:?}. +@item +@samp{*} is written as @samp{:*}. +@item +@samp{+} is written as @samp{:+}. +@item +@code{#equal} is written as @code{:equal}. In general, predicates +change their @samp{#} to @samp{:}. +@end itemize + +For example, + +@example +@group +"( + (compound_expression . (_) @@first (_)* @@rest) + (#match \"love\" @@first) + )" +@end group +@end example + +is written in s-expression as + +@example +@group +'(( + (compound_expression :anchor (_) @@first (_) :* @@rest) + (:match "love" @@first) + )) +@end group +@end example + +@heading Compiling queries + +If a query will be used repeatedly, especially in tight loops, it is +important to compile that query, because a compiled query is much +faster than an uncompiled one. A compiled query can be used anywhere +a query is accepted. + +@defun treesit-query-compile language query +This function compiles @var{query} for @var{language} into a compiled +query object and returns it. + +This function raise a @var{treesit-query-error} if @var{query} is +malformed. The signal data contains a description of the specific +error. You can use @code{treesit-query-validate} to debug the query. +@end defun + +@defun treesit-query-expand query +This function expands the s-expression @var{query} into a string +query. +@end defun + +@defun treesit-pattern-expand pattern +This function expands the s-expression @var{pattern} into a string +pattern. +@end defun + +Finally, tree-sitter project's documentation about +pattern-matching can be found at +@uref{https://tree-sitter.github.io/tree-sitter/using-parsers#pattern-matching-with-queries}. + +@node Multiple Languages +@section Parsing Text in Multiple Languages + +Sometimes, the source of a programming language could contain sources +of other languages, HTML + CSS + JavaScript is one example. In that +case, we need to assign individual parsers to text segments written in +different languages. Traditionally this is achieved by using +narrowing. While tree-sitter works with narrowing (@pxref{tree-sitter +narrowing, narrowing}), the recommended way is to set ranges in which +a parser will operate. + +@defun treesit-parser-set-included-ranges parser ranges +This function sets the range of @var{parser} to @var{ranges}. Then +@var{parser} will only read the text covered in each range. Each +range in @var{ranges} is a list of cons @code{(@var{beg} +. @var{end})}. + +Each range in @var{ranges} must come in order and not overlap. That +is, in pseudo code: + +@example +@group +(cl-loop for idx from 1 to (1- (length ranges)) + for prev = (nth (1- idx) ranges) + for next = (nth idx ranges) + should (<= (car prev) (cdr prev) + (car next) (cdr next))) +@end group +@end example + +@vindex treesit-range-invalid +If @var{ranges} violates this constraint, or something else went +wrong, this function signals a @code{treesit-range-invalid}. The +signal data contains a specific error message and the ranges we are +trying to set. + +This function can also be used for disabling ranges. If @var{ranges} +is nil, the parser is set to parse the whole buffer. + +Example: + +@example +@group +(treesit-parser-set-included-ranges + parser '((1 . 9) (16 . 24) (24 . 25))) +@end group +@end example +@end defun + +@defun treesit-parser-included-ranges parser +This function returns the ranges set for @var{parser}. The return +value is the same as the @var{ranges} argument of +@code{treesit-parser-included-ranges}: a list of cons +@code{(@var{beg} . @var{end})}. And if @var{parser} doesn't have any +ranges, the return value is nil. + +@example +@group +(treesit-parser-included-ranges parser) + @result{} ((1 . 9) (16 . 24) (24 . 25)) +@end group +@end example +@end defun + +@defun treesit-set-ranges parser-or-lang ranges +Like @code{treesit-parser-set-included-ranges}, this function sets +the ranges of @var{parser-or-lang} to @var{ranges}. Conveniently, +@var{parser-or-lang} could be either a parser or a language. If it is +a language, this function looks for the first parser in +@code{(treesit-parser-list)} for that language in the current buffer, +and set range for it. +@end defun + +@defun treesit-get-ranges parser-or-lang +This function returns the ranges of @var{parser-or-lang}, like +@code{treesit-parser-included-ranges}. And like +@code{treesit-set-ranges}, @var{parser-or-lang} can be a parser or +a language symbol. +@end defun + +@defun treesit-query-range source query &optional beg end +This function matches @var{source} with @var{query} and returns the +ranges of captured nodes. The return value has the same shape of +other functions: a list of @code{(@var{beg} . @var{end})}. + +For convenience, @var{source} can be a language symbol, a parser, or a +node. If a language symbol, this function matches in the root node of +the first parser using that language; if a parser, this function +matches in the root node of that parser; if a node, this function +matches in that node. + +Parameter @var{query} is the query used to capture nodes +(@pxref{Pattern Matching}). The capture names don't matter. Parameter +@var{beg} and @var{end}, if both non-nil, limits the range in which +this function queries. + +Like other query functions, this function raises an +@var{treesit-query-error} if @var{query} is malformed. +@end defun + +@defun treesit-language-at point +This function tries to figure out which language is responsible for +the text at @var{point}. It goes over each parser in +@code{(treesit-parser-list)} and see if that parser's range covers +@var{point}. +@end defun + +@defvar treesit-range-functions +A list of range functions. Font-locking and indenting code uses +functions in this alist to set correct ranges for a language parser +before using it. + +The signature of each function should be + +@example +(@var{start} @var{end} &rest @var{_}) +@end example + +where @var{start} and @var{end} marks the region that is about to be +used. A range function only need to (but not limited to) update +ranges in that region. + +Each function in the list is called in-order. +@end defvar + +@defun treesit-update-ranges &optional start end +This function is used by font-lock and indent to update ranges before +using any parser. Each range function in +@var{treesit-range-functions} is called in-order. Arguments +@var{start} and @var{end} are passed to each range function. +@end defun + +@heading An example + +Normally, in a set of languages that can be mixed together, there is a +major language and several embedded languages. We first parse the +whole document with the major language’s parser, set ranges for the +embedded languages, then parse the embedded languages. + +Suppose we want to parse a very simple document that mixes HTML, CSS +and JavaScript: + +@example +@group +<html> + <script>1 + 2</script> + <style>body @{ color: "blue"; @}</style> +</html> +@end group +@end example + +We first parse with HTML, then set ranges for CSS and JavaScript: + +@example +@group +;; Create parsers. +(setq html (treesit-get-parser-create 'html)) +(setq css (treesit-get-parser-create 'css)) +(setq js (treesit-get-parser-create 'javascript)) + +;; Set CSS ranges. +(setq css-range + (treesit-query-range + 'html + "(style_element (raw_text) @@capture)")) +(treesit-parser-set-included-ranges css css-range) + +;; Set JavaScript ranges. +(setq js-range + (treesit-query-range + 'html + "(script_element (raw_text) @@capture)")) +(treesit-parser-set-included-ranges js js-range) +@end group +@end example + +We use a query pattern @code{(style_element (raw_text) @@capture)} to +find CSS nodes in the HTML parse tree. For how to write query +patterns, @pxref{Pattern Matching}. + +@node Tree-sitter C API +@section Tree-sitter C API Correspondence + +Emacs' tree-sitter integration doesn't expose every feature +tree-sitter's C API provides. Missing features include: + +@itemize +@item +Creating a tree cursor and navigating the syntax tree with it. +@item +Setting timeout and cancellation flag for a parser. +@item +Setting the logger for a parser. +@item +Printing a DOT graph of the syntax tree to a file. +@item +Coping and modifying a syntax tree. (Emacs doesn't expose a tree +object.) +@item +Using (row, column) coordinates as position. +@item +Updating a node with changes. (In Emacs, retrieve a new node instead +of updating the existing one.) +@item +Querying statics of a language definition. +@end itemize + +In addition, Emacs makes some changes to the C API to make the API more +convenient and idiomatic: + +@itemize +@item +Instead of using byte positions, the ELisp API uses character +positions. +@item +Null nodes are converted to nil. +@end itemize + +Below is the correspondence between all C API functions and their +ELisp counterparts. Sometimes one ELisp function corresponds to +multiple C functions, and many C functions don't have an ELisp +counterpart. + +@example +ts_parser_new treesit-parser-create +ts_parser_delete +ts_parser_set_language +ts_parser_language treesit-parser-language +ts_parser_set_included_ranges treesit-parser-set-included-ranges +ts_parser_included_ranges treesit-parser-included-ranges +ts_parser_parse +ts_parser_parse_string treesit-parse-string +ts_parser_parse_string_encoding +ts_parser_reset +ts_parser_set_timeout_micros +ts_parser_timeout_micros +ts_parser_set_cancellation_flag +ts_parser_cancellation_flag +ts_parser_set_logger +ts_parser_logger +ts_parser_print_dot_graphs +ts_tree_copy +ts_tree_delete +ts_tree_root_node +ts_tree_language +ts_tree_edit +ts_tree_get_changed_ranges +ts_tree_print_dot_graph +ts_node_type treesit-node-type +ts_node_symbol +ts_node_start_byte treesit-node-start +ts_node_start_point +ts_node_end_byte treesit-node-end +ts_node_end_point +ts_node_string treesit-node-string +ts_node_is_null +ts_node_is_named treesit-node-check +ts_node_is_missing treesit-node-check +ts_node_is_extra treesit-node-check +ts_node_has_changes treesit-node-check +ts_node_has_error treesit-node-check +ts_node_parent treesit-node-parent +ts_node_child treesit-node-child +ts_node_field_name_for_child treesit-node-field-name-for-child +ts_node_child_count treesit-node-child-count +ts_node_named_child treesit-node-child +ts_node_named_child_count treesit-node-child-count +ts_node_child_by_field_name treesit-node-by-field-name +ts_node_child_by_field_id +ts_node_next_sibling treesit-next-sibling +ts_node_prev_sibling treesit-prev-sibling +ts_node_next_named_sibling treesit-next-sibling +ts_node_prev_named_sibling treesit-prev-sibling +ts_node_first_child_for_byte treesit-first-child-for-pos +ts_node_first_named_child_for_byte treesit-first-child-for-pos +ts_node_descendant_for_byte_range treesit-descendant-for-range +ts_node_descendant_for_point_range +ts_node_named_descendant_for_byte_range treesit-descendant-for-range +ts_node_named_descendant_for_point_range +ts_node_edit +ts_node_eq treesit-node-eq +ts_tree_cursor_new +ts_tree_cursor_delete +ts_tree_cursor_reset +ts_tree_cursor_current_node +ts_tree_cursor_current_field_name +ts_tree_cursor_current_field_id +ts_tree_cursor_goto_parent +ts_tree_cursor_goto_next_sibling +ts_tree_cursor_goto_first_child +ts_tree_cursor_goto_first_child_for_byte +ts_tree_cursor_goto_first_child_for_point +ts_tree_cursor_copy +ts_query_new +ts_query_delete +ts_query_pattern_count +ts_query_capture_count +ts_query_string_count +ts_query_start_byte_for_pattern +ts_query_predicates_for_pattern +ts_query_step_is_definite +ts_query_capture_name_for_id +ts_query_string_value_for_id +ts_query_disable_capture +ts_query_disable_pattern +ts_query_cursor_new +ts_query_cursor_delete +ts_query_cursor_exec treesit-query-capture +ts_query_cursor_did_exceed_match_limit +ts_query_cursor_match_limit +ts_query_cursor_set_match_limit +ts_query_cursor_set_byte_range +ts_query_cursor_set_point_range +ts_query_cursor_next_match +ts_query_cursor_remove_match +ts_query_cursor_next_capture +ts_language_symbol_count +ts_language_symbol_name +ts_language_symbol_for_name +ts_language_field_count +ts_language_field_name_for_id +ts_language_field_id_for_name +ts_language_symbol_type +ts_language_version +@end example |