Next: Using Tree-sitter Parser, Up: Parsing Program Source [Contents][Index]
Tree-sitter relies on language definitions to parse text in that
language. In Emacs, A language definition is represented by a symbol.
For example, C language definition is represented as c
, and
c
can be passed to tree-sitter functions as the language
argument.
Tree-sitter language definitions are distributed as dynamic libraries.
In order to use a language definition in Emacs, you need to make sure
that the dynamic library is installed on the system. Emacs looks for
language definitions under load paths in
treesit-extra-load-path
, user-emacs-directory
/tree-sitter,
and system default locations for dynamic libraries, in that order.
Emacs tries each extensions in treesit-load-suffixes
. If Emacs
cannot find the library or has problem loading it, Emacs signals
treesit-load-language-error
. The signal data is a list of
specific error messages.
This function checks whether the dynamic library for language is present on the system, and return non-nil if it is.
By convention, the dynamic library for language is
libtree-sitter-language.ext
, where ext is the
system-specific extension for dynamic libraries. Also by convention,
the function provided by that library is named
tree_sitter_language
. If a language definition doesn’t
follow this convention, you should add an entry
(language library-base-name function-name)
to treesit-load-name-override-list
, where
library-base-name is the base filename for the dynamic library
(conventionally libtree-sitter-language
), and
function-name is the function provided by the library
(conventionally tree_sitter_language
). For example,
(cool-lang "libtree-sitter-coool" "tree_sitter_cooool")
for a language too cool to abide by conventions.
Tree-sitter library has a language version, a language definition’s version needs to match this version to be compatible.
This function returns tree-sitter library’s language version. If min-compatible is non-nil, it returns the minimal compatible version.
A syntax tree is what a parser generates. In a syntax tree, each node represents a piece of text, and is connected to each other by a parent-child relationship. For example, if the source text is
1 + 2
its syntax tree could be
+--------------+ | root "1 + 2" | +--------------+ | +--------------------------------+ | expression "1 + 2" | +--------------------------------+ | | | +------------+ +--------------+ +------------+ | number "1" | | operator "+" | | number "2" | +------------+ +--------------+ +------------+
We can also represent it in s-expression:
(root (expression (number) (operator) (number)))
Names like root
, expression
, number
,
operator
are nodes’ type. However, not all nodes in a
syntax tree have a type. Nodes that don’t are anonymous nodes,
and nodes with a type are named nodes. Anonymous nodes are
tokens with fixed spellings, including punctuation characters like
bracket ‘]’, and keywords like return
.
To make the syntax tree easier to
analyze, many language definitions assign field names to child
nodes. For example, a function_definition
node could have a
declarator
and a body
:
(function_definition declarator: (declaration) body: (compound_statement))
This minor mode displays the node that starts at point in mode-line. The mode-line will display
parent field-name: (child (grand-child (...)))
child, grand-child, and grand-grand-child, etc, are nodes that have their beginning at point. And parent is the parent of child.
If there is no node that starts at point, i.e., point is in the middle of a node, then the mode-line only displays the smallest node that spans point, and its immediate parent.
This minor mode doesn’t create parsers on its own. It simply uses the
first parser in (treesit-parser-list)
(see Using Tree-sitter Parser).
Authors of language definitions define the grammar of a language, and this grammar determines how does a parser construct a concrete syntax tree out of the text. In order to use the syntax tree effectively, we need to read the grammar file.
The grammar file is usually grammar.js
in a language
definition’s project repository. The link to a language definition’s
home page can be found in tree-sitter’s homepage
(https://tree-sitter.github.io/tree-sitter).
The grammar is written in JavaScript syntax. For example, the rule
matching a function_definition
node looks like
function_definition: $ => seq( $.declaration_specifiers, field('declarator', $.declaration), field('body', $.compound_statement) )
The rule is represented by a function that takes a single argument
$, representing the whole grammar. The function itself is
constructed by other functions: the seq
function puts together a
sequence of children; the field
function annotates a child with
a field name. If we write the above definition in BNF syntax, it
would look like
function_definition := <declaration_specifiers> <declaration> <compound_statement>
and the node returned by the parser would look like
(function_definition (declaration_specifier) declarator: (declaration) body: (compound_statement))
Below is a list of functions that one will see in a grammar definition. Each function takes other rules as arguments and returns a new rule.
seq(rule1, rule2, ...)
matches each rule one after another.
choice(rule1, rule2, ...)
matches one of the rules in its
arguments.
repeat(rule)
matches rule for zero or more times.
This is like the ‘*’ operator in regular expressions.
repeat1(rule)
matches rule for one or more times.
This is like the ‘+’ operator in regular expressions.
optional(rule)
matches rule for zero or one time.
This is like the ‘?’ operator in regular expressions.
field(name, rule)
assigns field name name to the child
node matched by rule.
alias(rule, alias)
makes nodes matched by rule appear as
alias in the syntax tree generated by the parser. For example,
alias(preprocessor_call_exp, call_expression)
makes any node matched by preprocessor_call_exp
to appear as
call_expression
.
Below are grammar functions less interesting for a reader of a language definition.
token(rule)
marks rule to produce a single leaf node.
That is, instead of generating a parent node with individual child
nodes under it, everything is combined into a single leaf node.
token.immediate(rule)
changes rule to match only when
there is no preceding whitespaces.
prec(n, rule)
gives rule a level n precedence.
prec.left([n,] rule)
marks rule as left-associative,
optionally with level n.
prec.right([n,] rule)
marks rule as right-associative,
optionally with level n.
prec.dynamic(n, rule)
is like prec
, but the precedence
is applied at runtime instead.
The tree-sitter project talks about writing a grammar in more detail: https://tree-sitter.github.io/tree-sitter/creating-parsers. Read especially “The Grammar DSL” section.
Next: Using Tree-sitter Parser, Up: Parsing Program Source [Contents][Index]