summaryrefslogtreecommitdiff
path: root/admin/notes/tree-sitter/html-manual/Language-Definitions.html
blob: ba3eeb9eeb98a20de338b14baabbb8c60cdae815 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<!-- Created by GNU Texinfo 6.8, https://www.gnu.org/software/texinfo/ -->
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<!-- This is the GNU Emacs Lisp Reference Manual
corresponding to Emacs version 29.0.50.

Copyright © 1990-1996, 1998-2022 Free Software Foundation,
Inc.

Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.3 or
any later version published by the Free Software Foundation; with the
Invariant Sections being "GNU General Public License," with the
Front-Cover Texts being "A GNU Manual," and with the Back-Cover
Texts as in (a) below.  A copy of the license is included in the
section entitled "GNU Free Documentation License."

(a) The FSF's Back-Cover Text is: "You have the freedom to copy and
modify this GNU manual.  Buying copies from the FSF supports it in
developing GNU and promoting software freedom." -->
<title>Language Definitions (GNU Emacs Lisp Reference Manual)</title>

<meta name="description" content="Language Definitions (GNU Emacs Lisp Reference Manual)">
<meta name="keywords" content="Language Definitions (GNU Emacs Lisp Reference Manual)">
<meta name="resource-type" content="document">
<meta name="distribution" content="global">
<meta name="Generator" content="makeinfo">
<meta name="viewport" content="width=device-width,initial-scale=1">

<link href="index.html" rel="start" title="Top">
<link href="Index.html" rel="index" title="Index">
<link href="index.html#SEC_Contents" rel="contents" title="Table of Contents">
<link href="Parsing-Program-Source.html" rel="up" title="Parsing Program Source">
<link href="Using-Parser.html" rel="next" title="Using Parser">
<style type="text/css">
<!--
a.copiable-anchor {visibility: hidden; text-decoration: none; line-height: 0em}
a.summary-letter {text-decoration: none}
blockquote.indentedblock {margin-right: 0em}
div.display {margin-left: 3.2em}
div.example {margin-left: 3.2em}
kbd {font-style: oblique}
pre.display {font-family: inherit}
pre.format {font-family: inherit}
pre.menu-comment {font-family: serif}
pre.menu-preformatted {font-family: serif}
span.nolinebreak {white-space: nowrap}
span.roman {font-family: initial; font-weight: normal}
span.sansserif {font-family: sans-serif; font-weight: normal}
span:hover a.copiable-anchor {visibility: visible}
ul.no-bullet {list-style: none}
-->
</style>
<link rel="stylesheet" type="text/css" href="./manual.css">


</head>

<body lang="en">
<div class="section" id="Language-Definitions">
<div class="header">
<p>
Next: <a href="Using-Parser.html" accesskey="n" rel="next">Using Tree-sitter Parser</a>, Up: <a href="Parsing-Program-Source.html" accesskey="u" rel="up">Parsing Program Source</a> &nbsp; [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Index.html" title="Index" rel="index">Index</a>]</p>
</div>
<hr>
<span id="Tree_002dsitter-Language-Definitions"></span><h3 class="section">37.1 Tree-sitter Language Definitions</h3>

<span id="Loading-a-language-definition"></span><h3 class="heading">Loading a language definition</h3>

<p>Tree-sitter relies on language definitions to parse text in that
language. In Emacs, A language definition is represented by a symbol.
For example, C language definition is represented as <code>c</code>, and
<code>c</code> can be passed to tree-sitter functions as the <var>language</var>
argument.
</p>
<span id="index-treesit_002dextra_002dload_002dpath"></span>
<span id="index-treesit_002dload_002dlanguage_002derror"></span>
<span id="index-treesit_002dload_002dsuffixes"></span>
<p>Tree-sitter language definitions are distributed as dynamic libraries.
In order to use a language definition in Emacs, you need to make sure
that the dynamic library is installed on the system.  Emacs looks for
language definitions under load paths in
<code>treesit-extra-load-path</code>, <code>user-emacs-directory</code>/tree-sitter,
and system default locations for dynamic libraries, in that order.
Emacs tries each extensions in <code>treesit-load-suffixes</code>.  If Emacs
cannot find the library or has problem loading it, Emacs signals
<code>treesit-load-language-error</code>.  The signal data is a list of
specific error messages.
</p>
<dl class="def">
<dt id="index-treesit_002dlanguage_002davailable_002dp"><span class="category">Function: </span><span><strong>treesit-language-available-p</strong> <em>language</em><a href='#index-treesit_002dlanguage_002davailable_002dp' class='copiable-anchor'> &para;</a></span></dt>
<dd><p>This function checks whether the dynamic library for <var>language</var> is
present on the system, and return non-nil if it is.
</p></dd></dl>

<span id="index-treesit_002dload_002dname_002doverride_002dlist"></span>
<p>By convention, the dynamic library for <var>language</var> is
<code>libtree-sitter-<var>language</var>.<var>ext</var></code>, where <var>ext</var> is the
system-specific extension for dynamic libraries. Also by convention,
the function provided by that library is named
<code>tree_sitter_<var>language</var></code>.  If a language definition doesn&rsquo;t
follow this convention, you should add an entry
</p>
<div class="example">
<pre class="example">(<var>language</var> <var>library-base-name</var> <var>function-name</var>)
</pre></div>

<p>to <code>treesit-load-name-override-list</code>, where
<var>library-base-name</var> is the base filename for the dynamic library
(conventionally <code>libtree-sitter-<var>language</var></code>), and
<var>function-name</var> is the function provided by the library
(conventionally <code>tree_sitter_<var>language</var></code>). For example,
</p>
<div class="example">
<pre class="example">(cool-lang &quot;libtree-sitter-coool&quot; &quot;tree_sitter_cooool&quot;)
</pre></div>

<p>for a language too cool to abide by conventions.
</p>
<dl class="def">
<dt id="index-treesit_002dlanguage_002dversion"><span class="category">Function: </span><span><strong>treesit-language-version</strong> <em>&amp;optional min-compatible</em><a href='#index-treesit_002dlanguage_002dversion' class='copiable-anchor'> &para;</a></span></dt>
<dd><p>Tree-sitter library has a <em>language version</em>, a language
definition&rsquo;s version needs to match this version to be compatible.
</p>
<p>This function returns tree-sitter library’s language version.  If
<var>min-compatible</var> is non-nil, it returns the minimal compatible
version.
</p></dd></dl>

<span id="Concrete-syntax-tree"></span><h3 class="heading">Concrete syntax tree</h3>

<p>A syntax tree is what a parser generates.  In a syntax tree, each node
represents a piece of text, and is connected to each other by a
parent-child relationship.  For example, if the source text is
</p>
<div class="example">
<pre class="example">1 + 2
</pre></div>

<p>its syntax tree could be
</p>
<div class="example">
<pre class="example">                  +--------------+
                  | root &quot;1 + 2&quot; |
                  +--------------+
                         |
        +--------------------------------+
        |       expression &quot;1 + 2&quot;       |
        +--------------------------------+
           |             |            |
+------------+   +--------------+   +------------+
| number &quot;1&quot; |   | operator &quot;+&quot; |   | number &quot;2&quot; |
+------------+   +--------------+   +------------+
</pre></div>

<p>We can also represent it in s-expression:
</p>
<div class="example">
<pre class="example">(root (expression (number) (operator) (number)))
</pre></div>

<span id="Node-types"></span><h4 class="subheading">Node types</h4>

<span id="index-tree_002dsitter-node-type"></span>
<span id="tree_002dsitter-node-type"></span><span id="index-tree_002dsitter-named-node"></span>
<span id="tree_002dsitter-named-node"></span><span id="index-tree_002dsitter-anonymous-node"></span>
<p>Names like <code>root</code>, <code>expression</code>, <code>number</code>,
<code>operator</code> are nodes&rsquo; <em>type</em>.  However, not all nodes in a
syntax tree have a type.  Nodes that don&rsquo;t are <em>anonymous nodes</em>,
and nodes with a type are <em>named nodes</em>.  Anonymous nodes are
tokens with fixed spellings, including punctuation characters like
bracket &lsquo;<samp>]</samp>&rsquo;, and keywords like <code>return</code>.
</p>
<span id="Field-names"></span><h4 class="subheading">Field names</h4>

<span id="index-tree_002dsitter-node-field-name"></span>
<span id="tree_002dsitter-node-field-name"></span><p>To make the syntax tree easier to
analyze, many language definitions assign <em>field names</em> to child
nodes.  For example, a <code>function_definition</code> node could have a
<code>declarator</code> and a <code>body</code>:
</p>
<div class="example">
<pre class="example">(function_definition
 declarator: (declaration)
 body: (compound_statement))
</pre></div>

<dl class="def">
<dt id="index-treesit_002dinspect_002dmode"><span class="category">Command: </span><span><strong>treesit-inspect-mode</strong><a href='#index-treesit_002dinspect_002dmode' class='copiable-anchor'> &para;</a></span></dt>
<dd><p>This minor mode displays the node that <em>starts</em> at point in
mode-line.  The mode-line will display
</p>
<div class="example">
<pre class="example"><var>parent</var> <var>field-name</var>: (<var>child</var> (<var>grand-child</var> (...)))
</pre></div>

<p><var>child</var>, <var>grand-child</var>, and <var>grand-grand-child</var>, etc, are
nodes that have their beginning at point.  And <var>parent</var> is the
parent of <var>child</var>.
</p>
<p>If there is no node that starts at point, i.e., point is in the middle
of a node, then the mode-line only displays the smallest node that
spans point, and its immediate parent.
</p>
<p>This minor mode doesn&rsquo;t create parsers on its own.  It simply uses the
first parser in <code>(treesit-parser-list)</code> (see <a href="Using-Parser.html">Using Tree-sitter Parser</a>).
</p></dd></dl>

<span id="Reading-the-grammar-definition"></span><h3 class="heading">Reading the grammar definition</h3>

<p>Authors of language definitions define the <em>grammar</em> of a
language, and this grammar determines how does a parser construct a
concrete syntax tree out of the text.  In order to use the syntax
tree effectively, we need to read the <em>grammar file</em>.
</p>
<p>The grammar file is usually <code>grammar.js</code> in a language
definition’s project repository.  The link to a language definition’s
home page can be found in tree-sitter’s homepage
(<a href="https://tree-sitter.github.io/tree-sitter">https://tree-sitter.github.io/tree-sitter</a>).
</p>
<p>The grammar is written in JavaScript syntax.  For example, the rule
matching a <code>function_definition</code> node looks like
</p>
<div class="example">
<pre class="example">function_definition: $ =&gt; seq(
  $.declaration_specifiers,
  field('declarator', $.declaration),
  field('body', $.compound_statement)
)
</pre></div>

<p>The rule is represented by a function that takes a single argument
<var>$</var>, representing the whole grammar.  The function itself is
constructed by other functions: the <code>seq</code> function puts together a
sequence of children; the <code>field</code> function annotates a child with
a field name.  If we write the above definition in BNF syntax, it
would look like
</p>
<div class="example">
<pre class="example">function_definition :=
  &lt;declaration_specifiers&gt; &lt;declaration&gt; &lt;compound_statement&gt;
</pre></div>

<p>and the node returned by the parser would look like
</p>
<div class="example">
<pre class="example">(function_definition
  (declaration_specifier)
  declarator: (declaration)
  body: (compound_statement))
</pre></div>

<p>Below is a list of functions that one will see in a grammar
definition.  Each function takes other rules as arguments and returns
a new rule.
</p>
<ul>
<li> <code>seq(rule1, rule2, ...)</code> matches each rule one after another.

</li><li> <code>choice(rule1, rule2, ...)</code> matches one of the rules in its
arguments.

</li><li> <code>repeat(rule)</code> matches <var>rule</var> for <em>zero or more</em> times.
This is like the &lsquo;<samp>*</samp>&rsquo; operator in regular expressions.

</li><li> <code>repeat1(rule)</code> matches <var>rule</var> for <em>one or more</em> times.
This is like the &lsquo;<samp>+</samp>&rsquo; operator in regular expressions.

</li><li> <code>optional(rule)</code> matches <var>rule</var> for <em>zero or one</em> time.
This is like the &lsquo;<samp>?</samp>&rsquo; operator in regular expressions.

</li><li> <code>field(name, rule)</code> assigns field name <var>name</var> to the child
node matched by <var>rule</var>.

</li><li> <code>alias(rule, alias)</code> makes nodes matched by <var>rule</var> appear as
<var>alias</var> in the syntax tree generated by the parser.  For example,

<div class="example">
<pre class="example">alias(preprocessor_call_exp, call_expression)
</pre></div>

<p>makes any node matched by <code>preprocessor_call_exp</code> to appear as
<code>call_expression</code>.
</p></li></ul>

<p>Below are grammar functions less interesting for a reader of a
language definition.
</p>
<ul>
<li> <code>token(rule)</code> marks <var>rule</var> to produce a single leaf node.
That is, instead of generating a parent node with individual child
nodes under it, everything is combined into a single leaf node.

</li><li> Normally, grammar rules ignore preceding whitespaces,
<code>token.immediate(rule)</code> changes <var>rule</var> to match only when
there is no preceding whitespaces.

</li><li> <code>prec(n, rule)</code> gives <var>rule</var> a level <var>n</var> precedence.

</li><li> <code>prec.left([n,] rule)</code> marks <var>rule</var> as left-associative,
optionally with level <var>n</var>.

</li><li> <code>prec.right([n,] rule)</code> marks <var>rule</var> as right-associative,
optionally with level <var>n</var>.

</li><li> <code>prec.dynamic(n, rule)</code> is like <code>prec</code>, but the precedence
is applied at runtime instead.
</li></ul>

<p>The tree-sitter project talks about writing a grammar in more detail:
<a href="https://tree-sitter.github.io/tree-sitter/creating-parsers">https://tree-sitter.github.io/tree-sitter/creating-parsers</a>.
Read especially &ldquo;The Grammar DSL&rdquo; section.
</p>
</div>
<hr>
<div class="header">
<p>
Next: <a href="Using-Parser.html">Using Tree-sitter Parser</a>, Up: <a href="Parsing-Program-Source.html">Parsing Program Source</a> &nbsp; [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Index.html" title="Index" rel="index">Index</a>]</p>
</div>



</body>
</html>