Next: , Previous: , Up: Parsing Program Source   [Contents][Index]


37.6 Parsing Text in Multiple Languages

Sometimes, the source of a programming language could contain snippets of other languages; HTML + CSS + JavaScript is one example. In that case, text segments written in different languages need to be assigned different parsers. Traditionally, this is achieved by using narrowing. While tree-sitter works with narrowing (see narrowing), the recommended way is instead to set regions of buffer text in which a parser will operate.

Function: treesit-parser-set-included-ranges parser ranges

This function sets up parser to operate on ranges. The parser will only read the text of the specified ranges. Each range in ranges is a list of the form (beg . end).

The ranges in ranges must come in order and must not overlap. That is, in pseudo code:

(cl-loop for idx from 1 to (1- (length ranges))
         for prev = (nth (1- idx) ranges)
         for next = (nth idx ranges)
         should (<= (car prev) (cdr prev)
                    (car next) (cdr next)))

If ranges violates this constraint, or something else went wrong, this function signals the treesit-range-invalid error. The signal data contains a specific error message and the ranges we are trying to set.

This function can also be used for disabling ranges. If ranges is nil, the parser is set to parse the whole buffer.

Example:

(treesit-parser-set-included-ranges
 parser '((1 . 9) (16 . 24) (24 . 25)))
Function: treesit-parser-included-ranges parser

This function returns the ranges set for parser. The return value is the same as the ranges argument of treesit-parser-included-ranges: a list of cons cells of the form (beg . end). If parser doesn’t have any ranges, the return value is nil.

(treesit-parser-included-ranges parser)
    ⇒ ((1 . 9) (16 . 24) (24 . 25))
Function: treesit-set-ranges parser-or-lang ranges

Like treesit-parser-set-included-ranges, this function sets the ranges of parser-or-lang to ranges. Conveniently, parser-or-lang could be either a parser or a language. If it is a language, this function looks for the first parser in (treesit-parser-list) for that language in the current buffer, and sets the ranges for it.

Function: treesit-get-ranges parser-or-lang

This function returns the ranges of parser-or-lang, like treesit-parser-included-ranges. And like treesit-set-ranges, parser-or-lang can be a parser or a language symbol.

Function: treesit-query-range source query &optional beg end

This function matches source with query and returns the ranges of captured nodes. The return value is a list of cons cells of the form (beg . end), where beg and end specify the beginning and the end of a region of text.

For convenience, source can be a language symbol, a parser, or a node. If it’s a language symbol, this function matches in the root node of the first parser using that language; if a parser, this function matches in the root node of that parser; if a node, this function matches in that node.

The argument query is the query used to capture nodes (see Pattern Matching Tree-sitter Nodes). The capture names don’t matter. The arguments beg and end, if both non-nil, limit the range in which this function queries.

Like other query functions, this function raises the treesit-query-error error if query is malformed.

Variable: treesit-range-functions

This variable holds the list of range functions. Font-locking and indenting code use functions in this list to set correct ranges for a language parser before using it.

The signature of each function in the list should be:

(start end &rest _)

where start and end specify the region that is about to be used. A range function only needs to (but is not limited to) update ranges in that region.

The functions in the list are called in order.

Function: treesit-update-ranges &optional start end

This function is used by font-lock and indentation to update ranges before using any parser. Each range function in treesit-range-functions is called in-order. Arguments start and end are passed to each range function.

Function: treesit-language-at pos

This function tries to figure out which language is responsible for the text at buffer position pos. Under the hood it just calls treesit-language-at-point-function.

Various Lisp programs use this function. For example, the indentation program uses this function to determine which language’s rule to use in a multi-language buffer. So it is important to provide treesit-language-at-point-function for a multi-language major mode.

An example

Normally, in a set of languages that can be mixed together, there is a major language and several embedded languages. A Lisp program usually first parses the whole document with the major language’s parser, sets ranges for the embedded languages, and then parses the embedded languages.

Suppose we need to parse a very simple document that mixes HTML, CSS and JavaScript:

<html>
  <script>1 + 2</script>
  <style>body { color: "blue"; }</style>
</html>

We first parse with HTML, then set ranges for CSS and JavaScript:

;; Create parsers.
(setq html (treesit-get-parser-create 'html))
(setq css (treesit-get-parser-create 'css))
(setq js (treesit-get-parser-create 'javascript))

;; Set CSS ranges.
(setq css-range
      (treesit-query-range
       'html
       "(style_element (raw_text) @capture)"))
(treesit-parser-set-included-ranges css css-range)

;; Set JavaScript ranges.
(setq js-range
      (treesit-query-range
       'html
       "(script_element (raw_text) @capture)"))
(treesit-parser-set-included-ranges js js-range)

We use a query pattern (style_element (raw_text) @capture) to find CSS nodes in the HTML parse tree. For how to write query patterns, see Pattern Matching Tree-sitter Nodes.


Next: Developing major modes with tree-sitter, Previous: Pattern Matching Tree-sitter Nodes, Up: Parsing Program Source   [Contents][Index]