diff options
Diffstat (limited to 'admin/notes/tree-sitter/starter-guide')
-rw-r--r-- | admin/notes/tree-sitter/starter-guide | 442 |
1 files changed, 442 insertions, 0 deletions
diff --git a/admin/notes/tree-sitter/starter-guide b/admin/notes/tree-sitter/starter-guide new file mode 100644 index 00000000000..6cf8cf8a236 --- /dev/null +++ b/admin/notes/tree-sitter/starter-guide @@ -0,0 +1,442 @@ +STARTER GUIDE ON WRITTING MAJOR MODE WITH TREE-SITTER -*- org -*- + +This document guides you on adding tree-sitter support to a major +mode. + +TOC: + +- Building Emacs with tree-sitter +- Install language definitions +- Setup +- Font-lock +- Indent +- Imenu +- Navigation +- Which-func +- More features? +- Common tasks (code snippets) +- Manual + +* Building Emacs with tree-sitter + +You can either install tree-sitter by your package manager, or from +source: + + git clone https://github.com/tree-sitter/tree-sitter.git + cd tree-sitter + make + make install + +Then pull the tree-sitter branch (or the master branch, if it has +merged) and rebuild Emacs. + +* Install language definitions + +Tree-sitter by itself doesn’t know how to parse any particular +language. We need to install language definitions (or “grammars”) for +a language to be able to parse it. There are a couple of ways to get +them. + +You can use this script that I put together here: + + https://github.com/casouri/tree-sitter-module + +You can also find them under this directory in /build-modules. + +This script automatically pulls and builds language definitions for C, +C++, Rust, JSON, Go, HTML, Javascript, CSS, Python, Typescript, +and C#. Better yet, I pre-built these language definitions for +GNU/Linux and macOS, they can be downloaded here: + + https://github.com/casouri/tree-sitter-module/releases/tag/v2.1 + +To build them yourself, run + + git clone git@github.com:casouri/tree-sitter-module.git + cd tree-sitter-module + ./batch.sh + +and language definitions will be in the /dist directory. You can +either copy them to standard dynamic library locations of your system, +eg, /usr/local/lib, or leave them in /dist and later tell Emacs where +to find language definitions by setting ‘treesit-extra-load-path’. + +Language definition sources can be found on GitHub under +tree-sitter/xxx, like tree-sitter/tree-sitter-python. The tree-sitter +organization has all the "official" language definitions: + + https://github.com/tree-sitter + +* Setting up for adding major mode features + +Start Emacs, and load tree-sitter with + + (require 'treesit) + +Now check if Emacs is built with tree-sitter library + + (treesit-available-p) + +For your major mode, first create a tree-sitter switch: + +#+begin_src elisp +(defcustom python-use-tree-sitter nil + "If non-nil, `python-mode' tries to use tree-sitter. +Currently `python-mode' can utilize tree-sitter for font-locking, +imenu, and movement functions." + :type 'boolean) +#+end_src + +Then in other places, we decide on whether to enable tree-sitter by + +#+begin_src elisp +(and python-use-tree-sitter + (treesit-can-enable-p)) +#+end_src + +* Font-lock + +Tree-sitter works like this: You provide a query made of patterns and +capture names, tree-sitter finds the nodes that match these patterns, +tag the corresponding capture names onto the nodes and return them to +you. The query function returns a list of (capture-name . node). For +font-lock, we use face names as capture names. And the captured node +will be fontified in their capture name. The capture name could also +be a function, in which case (START END NODE) is passed to the +function for font-lock. START and END is the start and end the +captured NODE. + +** Query syntax + +There are two types of nodes, named, like (identifier), +(function_definition), and anonymous, like "return", "def", "(", +"}". Parent-child relationship is expressed as + + (parent (child) (child) (child (grand_child))) + +Eg, an argument list (1, "3", 1) could be: + + (argument_list "(" (number) (string) (number) ")") + +Children could have field names in its parent: + + (function_definition name: (identifier) type: (identifier)) + +Match any of the list: + + ["true" "false" "none"] + +Capture names can come after any node in the pattern: + + (parent (child) @child) @parent + +The query above captures both parent and child. + + ["return" "continue" "break"] @keyword + +The query above captures all the keywords with capture name +"keyword". + +These are the common syntax, see all of them in the manual +("Parsing Program Source" section). + +** Query references + +But how do one come up with the queries? Take python for an +example, open any python source file, evaluate + + (treesit-parser-create 'python) + +so there is a parser available, then enable ‘treesit-inspect-mode’. +Now you should see information of the node under point in +mode-line. Move around and you should be able to get a good +picture. Besides this, you can consult the grammar of the language +definition. For example, Python’s grammar file is at + + https://github.com/tree-sitter/tree-sitter-python/blob/master/grammar.js + +Neovim also has a bunch of queries to reference: + + https://github.com/nvim-treesitter/nvim-treesitter/tree/master/queries + +The manual explains how to read grammar files in the bottom of section +"Tree-sitter Language Definitions". + +** Debugging queires + +If your query has problems, it usually cannot compile. In that case +use ‘treesit-query-validate’ to debug the query. It will pop a buffer +containing the query (in text format) and mark the offending part in +red. + +** Code + +To enable tree-sitter font-lock, set ‘treesit-font-lock-settings’ +buffer-locally and call ‘treesit-font-lock-enable’. For example, see +‘python--treesit-settings’ in python.el. Below I paste a snippet of +it. + +Note that like the current font-lock, if the to-be-fontified region +already has a face (ie, an earlier match fontified part/all of the +region), the new face is discarded rather than applied. If you want +later matches always override earlier matches, use the :override +keyword. + +#+begin_src elisp +(defvar python--treesit-settings + (treesit-font-lock-rules + :language 'python + :override t + `(;; Queries for def and class. + (function_definition + name: (identifier) @font-lock-function-name-face) + + (class_definition + name: (identifier) @font-lock-type-face) + + ;; Comment and string. + (comment) @font-lock-comment-face + + ...))) +#+end_src + +Then in ‘python-mode’, enable tree-sitter font-lock: + +#+begin_src elisp +(treesit-parser-create 'python) +;; This turns off the syntax-based font-lock for comments and +;; strings. So it doesn’t override tree-sitter’s fontification. +(setq-local font-lock-keywords-only t) +(setq-local treesit-font-lock-settings + python--treesit-settings) +(treesit-font-lock-enable) +#+end_src + +Concretely, something like this: + +#+begin_src elisp +(define-derived-mode python-mode prog-mode "Python" + ... + + (treesit-parser-create 'python) + + (if (and python-use-tree-sitter + (treesit-can-enable-p)) + ;; Tree-sitter. + (progn + (setq-local font-lock-keywords-only t) + (setq-local treesit-font-lock-settings + python--treesit-settings) + (treesit-font-lock-enable)) + ;; No tree-sitter + (setq-local font-lock-defaults ...)) + + ...) +#+end_src + +You’ll notice that tree-sitter’s font-lock doesn’t respect +‘font-lock-maximum-decoration’, major modes are free to set +‘treesit-font-lock-settings’ based on the value of +‘font-lock-maximum-decoration’, or provide more fine-grained control +through other mode-specific means. + +* Indent + +Indent works like this: We have a bunch of rules that look like this: + + (MATCHER ANCHOR OFFSET) + +At the beginning point is at the BOL of a line, we want to know which +column to indent this line to. Let NODE be the node at point, we pass +this node to the MATCHER of each rule, one of them will match the node +("this node is a closing bracket!"). Then we pass the node to the +ANCHOR, which returns a point, eg, the BOL of the previous line. We +find the column number of that point (eg, 4), add OFFSET to it (eg, +0), and that is the column we want to indent the current line to (4 + +0 = 4). + +For MATHCER we have + + (parent-is TYPE) + (node-is TYPE) + (query QUERY) => matches if querying PARENT with QUERY + captures NODE. + + (match NODE-TYPE PARENT-TYPE NODE-FIELD + NODE-INDEX-MIN NODE-INDEX-MAX) + + => checks everything. If an argument is nil, don’t match that. Eg, + (match nil nil TYPE) is the same as (parent-is TYPE) + +For ANCHOR we have + + first-sibling => start of the first sibling + parent => start of parent + parent-bol => BOL of the line parent is on. + prev-sibling + no-indent => don’t indent + prev-line => same indent as previous line + +There is also a manual section for indent: "Parser-based Indentation". + +When writing indent rules, you can use ‘treesit-check-indent’ to +check if your indentation is correct. To debug what went wrong, set +‘treesit--indent-verboase’ to non-nil. Then when you indent, Emacs +tells you which rule is applied in the echo area. + +#+begin_src elisp +(defvar typescript-mode-indent-rules + (let ((offset typescript-indent-offset)) + `((typescript + ;; This rule matches if node at point is "}", ANCHOR is the + ;; parent node’s BOL, and offset is 0. + ((node-is "}") parent-bol 0) + ((node-is ")") parent-bol 0) + ((node-is "]") parent-bol 0) + ((node-is ">") parent-bol 0) + ((node-is ".") parent-bol ,offset) + ((parent-is "ternary_expression") parent-bol ,offset) + ((parent-is "named_imports") parent-bol ,offset) + ((parent-is "statement_block") parent-bol ,offset) + ((parent-is "type_arguments") parent-bol ,offset) + ((parent-is "variable_declarator") parent-bol ,offset) + ((parent-is "arguments") parent-bol ,offset) + ((parent-is "array") parent-bol ,offset) + ((parent-is "formal_parameters") parent-bol ,offset) + ((parent-is "template_substitution") parent-bol ,offset) + ((parent-is "object_pattern") parent-bol ,offset) + ((parent-is "object") parent-bol ,offset) + ((parent-is "object_type") parent-bol ,offset) + ((parent-is "enum_body") parent-bol ,offset) + ((parent-is "arrow_function") parent-bol ,offset) + ((parent-is "parenthesized_expression") parent-bol ,offset) + ...)))) +#+end_src + +Then you set ‘treesit-simple-indent-rules’ to your rules, and set +‘indent-line-function’: + +#+begin_src elisp +(setq-local treesit-simple-indent-rules typescript-mode-indent-rules) +(setq-local indent-line-function #'treesit-indent) +#+end_src + +* Imenu + +Not much to say except for utilizing ‘treesit-induce-sparse-tree’. +See ‘python--imenu-treesit-create-index-1’ in python.el for an +example. + +Once you have the index builder, set ‘imenu-create-index-function’. + +* Navigation + +Mainly ‘beginning-of-defun-function’ and ‘end-of-defun-function’. +You can find the end of a defun with something like + +(treesit-search-forward-goto "function_definition" 'end) + +where "function_definition" matches the node type of a function +definition node, and ’end means we want to go to the end of that +node. + +Something like this should suffice: + +#+begin_src elisp +(defun xxx-beginning-of-defun (&optional arg) + (if (> arg 0) + ;; Go backward. + (while (and (> arg 0) + (treesit-search-forward-goto + "function_definition" 'start nil t)) + (setq arg (1- arg))) + ;; Go forward. + (while (and (< arg 0) + (treesit-search-forward-goto + "function_definition" 'start)) + (setq arg (1+ arg))))) + +(setq-local beginning-of-defun-function #'xxx-beginning-of-defun) +#+end_src + +And the same for end-of-defun. + +* Which-func + +You can find the current function by going up the tree and looking for +the function_definition node. See ‘python-info-treesit-current-defun’ +in python.el for an example. Since Python allows nested function +definitions, that function keeps going until it reaches the root node, +and records all the function names along the way. + +#+begin_src elisp +(defun python-info-treesit-current-defun (&optional include-type) + "Identical to `python-info-current-defun' but use tree-sitter. +For INCLUDE-TYPE see `python-info-current-defun'." + (let ((node (treesit-node-at (point))) + (name-list ()) + (type nil)) + (cl-loop while node + if (pcase (treesit-node-type node) + ("function_definition" + (setq type 'def)) + ("class_definition" + (setq type 'class)) + (_ nil)) + do (push (treesit-node-text + (treesit-node-child-by-field-name node "name") + t) + name-list) + do (setq node (treesit-node-parent node)) + finally return (concat (if include-type + (format "%s " type) + "") + (string-join name-list "."))))) +#+end_src + +* More features? + +Obviously this list is just a starting point, if there are features in +the major mode that would benefit a parse tree, adding tree-sitter +support for that would be great. But in the minimal case, just adding +font-lock is awesome. + +* Common tasks + +How to... + +** Get the buffer text corresponding to a node? + +(treesit-node-text node) + +BTW ‘treesit-node-string’ does different things. + +** Scan the whole tree for stuff? + +(treesit-search-subtree) +(treesit-search-forward) +(treesit-induce-sparse-tree) + +** Move to next node that...? + +(treesit-search-forward-goto) + +** Get the root node? + +(treesit-buffer-root-node) + +** Get the node at point? + +(treesit-node-at (point)) + +* Manual + +I suggest you read the manual section for tree-sitter in Info. The +section is Parsing Program Source. Typing + + C-h i d m elisp RET g Parsing Program Source RET + +will bring you to that section. You can also read the HTML version +under /html-manual in this directory. I find the HTML version easier +to read. You don’t need to read through every sentence, just read the +text paragraphs and glance over function names. |