diff --git a/doc/src/README.md b/doc/src/README.md index a005426..2fdc81f 100644 --- a/doc/src/README.md +++ b/doc/src/README.md @@ -1,14 +1,19 @@ # Presentation -This book will introduce you to parsing and transliteration, using Beans. Beans is written in -[Rust](https://www.rust-lang.org), and henceforth this book will assume familiarity with this -language. However, this book makes no assumptions on prior knowledge on parsing techniques. The -end goal is to allow someone who has never written or used a parser to quickly become productive -at writing and using parsing libraries. +This book will introduce you to parsing and transliteration, using +Beans. Beans is written in [Rust](https://www.rust-lang.org), and +henceforth this book will assume familiarity with this +language. However, this book makes no assumptions on prior knowledge +of parsing techniques. The end goal is to allow someone who has never +written or used a parser to quickly become productive at writing and +using parsing libraries. -Beans aims at being a general-purpose parser and lexer library, providing both enough -performance so that you should never *need* something faster (even though these options exist), -and enough expressiveness so that you never get stuck while using your parser. See the -[tradeoffs](details/tradeoff.md) section for more details. +Beans aims at being a general-purpose parser and lexer library, +providing both enough performance so that you should never *need* +something faster (even though these options exist), and enough +expressiveness so that you never get stuck while using your +parser. See the [tradeoffs](details/tradeoff.md) section for more +details. -Beans is free and open source, dual licensed MIT or GPL3+, at your choice. +Beans is free and open source, dual licensed MIT or GPL3+, at your +choice. diff --git a/doc/src/concepts/README.md b/doc/src/concepts/README.md index e4252fe..e9545fc 100644 --- a/doc/src/concepts/README.md +++ b/doc/src/concepts/README.md @@ -1,18 +1,22 @@ # Common Concepts -When parsing with Beans, as with most other similar tools, three steps are performed, in this -order: +When parsing with Beans, as with most other similar tools, three steps +are performed, in this order: * [Lexing](lexer.md) * [Parsing](parser.md) * [Syntax tree building](ast.md) -The first step, lexing, operates directly on plain text inputs, while the last is in charge of -producing the abstract syntax tree. For more details on the operations that can be performed on -the latter, please refer to the [Rewriting the AST Chapter](ast/README.md). +The first step, lexing, operates directly on plain text inputs, while +the last is in charge of producing the abstract syntax tree. For more +details on the operations that can be performed on the latter, please +refer to the chapter [Rewriting the AST](ast/README.md). # Simple arithmetic expression -Throughout the explanation of the core concepts of parsing, some simple grammars will be written -to allow parsing a language of simple arithmetic expressions, consisting of numbers or binary -operations (addition, multiplication, subtraction and division) on expressions. All the grammars -will be available at https://github.com/jthulhu/beans, in the directory `doc/examples/arith`. +Throughout the explanation of the core concepts of parsing, some +simple grammars will be written to allow parsing a language of simple +arithmetic expressions, consisting of numbers or binary operations +(addition, multiplication, subtraction and division) on +expressions. All the grammars will be available at +https://github.com/jthulhu/beans, in the directory +`doc/examples/arith`. diff --git a/doc/src/concepts/grammars.md b/doc/src/concepts/grammars.md index 00e1ef7..b2d294d 100644 --- a/doc/src/concepts/grammars.md +++ b/doc/src/concepts/grammars.md @@ -1 +1,2 @@ # Grammars + diff --git a/doc/src/concepts/lexer.md b/doc/src/concepts/lexer.md index a8336cd..aaaa1a8 100644 --- a/doc/src/concepts/lexer.md +++ b/doc/src/concepts/lexer.md @@ -2,62 +2,76 @@ ## What does a lexer do? -A lexer performs the initial, important step of grouping together characters that couldn't be -morphologically split, while removing useless ones. For instance, in most programming languages, -spaces are only useful to split words, they do not have any intrinsic meaning. Therefore, they -should be dumped by the lexer, whereas all the characters that form an identifier or a keyword -should be grouped together to form a single *token*. +A lexer performs the initial, important step of grouping together +characters that couldn't be morphologically split [Eh?], while removing +useless ones. For instance, in most programming languages, spaces are +only useful to split words, they do not have any intrinsic +meaning. Therefore, they should be dumped by the lexer, whereas all +the characters that form an identifier or a keyword should be grouped +together to form a single *token*. -> Note: a *token*, also called a *terminal symbol* or more shortly a *terminal*, is a minimal -> span of text of the input with an identified meaning. For instance, any identifier, keyword -> or operator would be considered a token. +> Note: a *token*, also called a *terminal symbol* or more shortly a +> *terminal*, is a minimal span of text of the input with an +> identified meaning. For instance, any identifier, keyword or +> operator would be considered a token. -Both the parser and the lexer in Beans use online algorithms, meaning that they will consume -their input as they process it. Beans' lexer will consume the input string one unicode character -at a time. The lexer might backtrack, but this is, in practice, very rare. Non-degenerate -grammars will never trigger such backtracking. +Both the parser and the lexer in Beans use online [Con degli URL?!?] algorithms, meaning +that they will consume their input as they process it. Beans' lexer +will consume the input string one unicode character at a time. The +lexer might backtrack, but this is, in practice, very +rare. Non-degenerate grammars will never trigger such backtracking. -As the lexer reads the input, it will produce tokens. Sometimes (as with whitespace), it will -discard them. Other times, it might forget what the exact characters where, it will just remember -which token has been read. +As the lexer reads the input, it will produce tokens. Sometimes (as +with whitespace), it will discard them. Other times, it might [Oppure?] forget +what the exact characters where and will just remember which token has +been read. ## Regular expression -Each terminal in Beans is recognized by matching its associated regular expression. Prior -knowledge of regular expressions is assumed. Since regular expressions have loads of different -specifications, here is an exhaustive list of features allowed in Beans regular expressions, -besides the usual disjunction operator `|`, character classes `[...]` or `[^...]` and repetition -with `+`, `*` and `?`. +Each terminal in Beans is recognized by matching its associated +regular expression. Prior knowledge of regular expressions is +assumed. Since regular expressions have many different +specifications, here is an exhaustive list of features allowed in +Beans regular expressions, besides the usual disjunction operator `|`, +character classes `[...]` or `[^...]` and repetition with `+`, `*` and +`?`. + +Che cos'è ϵ? | Escaped character | Name | Meaning | |-------------------|----------------|---------------------------------------------------------------------------| -| `\b` | Word bounary | matches `ϵ` if the previous or the next character are not word characters | +| `\b` | Word boundary | matches `ϵ` if the previous or the next character are not word characters | | `\w` | Word character | equivalent to [a-zA-Z0-9] | | `\t` | Tabulation | matches a tabulation | -| `\Z` or `\z` | End of file | matches `ϵ` at the end of the line | +| `\Z` or `\z` | End of file | matches `ϵ` at the end of the line [file?] | | `\d` | Digit | equivalent to [0-9] | | `\n` | Newline | matches an end of line | | `\s` | Whitespace | matches whatever unicode considers whitespace | -| | | | # Simple arithmetic lexer -Let's try to write a lexer grammar for the simple arithmetic expression language. Ideally, we -would like to parse expressions such as `1+2*3`. So let's start by defining an integer token. -In `arith.lx`, write +Let's try to write a lexer grammar for the simple arithmetic +expression language. Ideally, we would like to parse expressions such +as `1+2*3`. So let's start by defining an integer token. In +`arith.lx`, write ```beans-lx INTEGER ::= \d+ ``` -Let's get through this first definition. `INTEGER` is the name of the terminal, whereas what is -on the right side of `::=` is the regular expression used to match it. +Let's get through this first definition. `INTEGER` is the name of the +terminal, whereas what is on the right side of `::=` is the regular +expression used to match it. -> Note: spaces between `::=` and the start of the regular expression are ignored, but every other -> space will be taken into account, including trailing ones, which are easy to overlook. If the -> regular expression starts with a space, you can always wrap it in a singleton class `[ ]`. +> Note: spaces between `::=` and the start of the regular expression +> are ignored, but every other space will be taken into account, +> including trailing ones, which are easy to overlook. If the regular +> expression starts with a space, you can always wrap it in a +> singleton class `[ ]`. -> Note: terminals are always SCREAMING CASED. While this is not very readable nor practical to -> type, it is coherent with the literature, and will allow you to distinguish between variables -> (which will be snake_cased), non terminals (which will be Pascal Cased) and terminals later on. +> Note: terminals are always SCREAMING CASED. While this is not very +> readable nor practical to type, it is coherent with the literature, +> and will allow you to distinguish between variables (which will be +> snake_cased), non terminals (which will be Pascal Cased) and +> terminals later on. We can also add the terminals for the four other operators ```beans-lx @@ -66,7 +80,8 @@ MULTIPLY ::= \* SUBTRACT ::= - DIVIDE ::= / ``` -If we were to try to lex a file `input` containing the expression `1+2*3`, we would get +If we were to try to lex a file `input` containing the expression +`1+2*3`, we would get ```bash $ beans lex --lexer arith.lx input INTEGER @@ -77,12 +92,14 @@ INTEGER Error: Could not lex anything in file input, at character 5 of line 1. $ ``` -This is bad for two reasons. The first is, of course, that we get an error. This is because our -file ended with a newline `\n`, and that there is no terminal that matches it. In fact, we would -also have a problem if we tried to lex `1 + 2*3`, because no terminal can read spaces. However, -we also *don't* want to produce any token related to such spaces: `1+2*3` and `1 + 2*3` should -be lexed indentically. Thus we will introduce a `SPACE` token with the `ignore` flag, telling -the lexer not to output it. Similarly for `NEWLINE`. +This is bad for two reasons. The first is, of course, that we get an +error. This is because our file ended with a newline `\n`, and that +there is no terminal that matches it. In fact, we would also have a +problem if we tried to lex `1 + 2*3`, because no terminal can read +spaces. However, we also *don't* want to produce any token related to +such spaces: `1+2*3` and `1 + 2*3` should be lexed indentically. Thus +we will introduce a `SPACE` token with the `ignore` flag, telling the +lexer not to output it. Similarly for `NEWLINE`. ```beanx-lx ignore SPACE ::= \s+ ignore NEWLINE ::= \n+ @@ -99,15 +116,16 @@ $ ``` Nice! -However, we now face the second issue: it was probably wise to forget the specific character that -was lexed to `ADD` or `MULTIPLY`, because we don't care; but we don't want to forget the actual -integer we lexed. To correct this, we will use regex groups. In `arith.lx`, we will replace the -definition of `INTEGER` with +However, we now face the second issue: it was probably wise to forget +the specific character that was lexed to `ADD` or `MULTIPLY`, because +we don't care; but we don't want to forget the actual integer we +lexed. To correct this, we will use regex groups. In `arith.lx`, we +will replace the definition of `INTEGER` with ```beans-lx INTEGER ::= (\d+) ``` -This will create a group that will contain everything that `\d+` will match, and this information -will be passed with the created token. +This will create a group that will contain everything that `\d+` will +match, and this information will be passed with the created token. ```bash $ beans lex --lexer arith.lx input INTEGER {0: 1} @@ -118,4 +136,3 @@ INTEGER {0: 3} $ ``` We will see in the next section how to manipulate a stream of tokens. - diff --git a/doc/src/concepts/parser.md b/doc/src/concepts/parser.md index ec1c09a..2b23a3c 100644 --- a/doc/src/concepts/parser.md +++ b/doc/src/concepts/parser.md @@ -1,12 +1,13 @@ # Parser -The parser is given a stream of tokens, which is a "flat" representation of the input, in the -sense that every part of it is at the same leve, and should transform it into a Concrete Syntax -Tree (CST). - -> Note: a CST is a tree whose leaves are terminals, and no inner node is a terminal. It -> represents a way the input was understood. For instance, given the input `1+2*3`, a CST could -> be +The parser is given a stream of tokens, which is a "flat" +representation of the input, in the sense that every part of it is at +the same level, and should transform it into a Concrete Syntax Tree +(CST). + +> Note: a CST is a tree whose leaves are terminals, and no inner node +> is a terminal. It represents the way the input was understood. For +> instance, given the input `1+2*3`, a CST could be > ``` > Expression > ┌───────┘│└───────┐ @@ -16,12 +17,13 @@ Tree (CST). > ``` > Inner nodes of a syntax tree are called *non terminals*. - In a Concrete Syntax Tree, every single token is remembered. This can be annoying, -as we usually want to forget tokens: if a given token held some information, we can extract that -information before dumping the token, but having the actual token is not very useful. +In a Concrete Syntax Tree, every single token is remembered. This can +be annoying, as we usually want to forget tokens: if a given token +held some information, we can extract that information before dumping +the token, but having the actual token is not very useful. -After having pruned the CST from tokens (while still having extracted the useful information), -we get an Abstract Syntax Tree (AST). +After having pruned the CST from tokens (while still having extracted +the useful information), we get an Abstract Syntax Tree (AST). > The AST for the input `1+2*3` might look light > ``` @@ -31,28 +33,33 @@ we get an Abstract Syntax Tree (AST). > ┌───────┘ └───────┐ > 2 3 > ``` -> All tokens have disappeared. From `INTEGER` tokens, we have extracted the actual number that -> was lexed, and we have remember that each `Expression` corresponds to a certain operation, but -> the token corresponding to that operation has also been forgotten. +> All tokens have disappeared. From `INTEGER` tokens, we have +> extracted the actual number that was lexed, and we have remembered +> that each `Expression` corresponds to a certain operation, but the +> token corresponding to that operation has been forgotten. -Similarly to terminals, non terminals are defined by telling Beans how to recognise them. Regulax -expressions, however, are not powerful enough for general parsing. Therefore, non terminals use -production rules instead. +Similarly to terminals, non terminals are defined by telling Beans how +to recognise them. Regular expressions, however, are not powerful +enough for general parsing. Therefore, non terminals use production +rules instead. # Production rules -Production rules are at the core of the recognition and syntax-tree building steps of every -parser, but there are several (equivalent) ways of understanding them. These different point of -view in turn produce very different parsing algorithms. +Production rules are at the core of the recognition and syntax-tree +building steps of every parser, but there are several (equivalent) +ways of understanding them. These different points of view produce in +turn very different parsing algorithms. ## Production rules as recognisers (bottom-up) -A production rule is of the form `A -> A_1 ... A_n`, and means that the non terminal `A` can be -recognised if `A_1` through `A_n` where recognised before. +A production rule is of the form `A -> A_1 ... A_n`, and means that +the non terminal `A` can be recognised if `A_1` through `A_n` where +recognised before. -For instance, for our simple arithmetic expression language, we could define a single non -terminal `Expression` with the following production rules +For instance, for our simple arithmetic expression language, we could +define a single non terminal `Expression` with the following +production rules ``` Expression -> Expression ADD Expression Expression -> Expression MULTIPLY Expression @@ -60,93 +67,115 @@ terminal `Expression` with the following production rules Expression -> Expression DIVIDE Expression Expression -> INTEGER ``` -This matches the definition of an expression we gave earlier -> [expressions are] numbers or binary operations (addition, multiplication, subtraction and -> division) on expressions. - -On the input `1+2*3`, which has been lexed to `INTEGER ADD INTEGER MULTIPLY INTEGER` (note that, -at this step, we don't care about information that tokens hold, such as the actual value of an -integer; these don't come into play when doing a syntaxic analysis), a parser could analyze it -in the following way. - -1. Every `INTEGER` token is a valid `Expression`, so we can replace them by the `Expression` non - terminal. - We get `Expression ADD Expression MULTIPLY Expression`. -> The operation of finding a place in the input that matches the right-hand side of a production -> rule and replacing it with its non terminal on the left-hand size is called a *reduction*. -> The place in the input where the reduction occurs is called a *handle*. -2. `Expression MULTIPLY Expression` is a *handle* for `Expression`, so we *reduce it*. - We get `Expression ADD Expression`. -3. Finally, `Expression ADD Expression` is a handle `Expression` too, so after reduction we have - left `Expression`. - -Here, our recognition ends successfully: the input `1+2*3` is an arithmetic expression, or at -least according to our definition. + +This matches the definition of an expression we gave earlier [Dove?] +> [expressions are] numbers or binary operations (addition, +> multiplication, subtraction and division) on expressions. + +On the input `1+2*3`, which has been lexed to `INTEGER ADD INTEGER +MULTIPLY INTEGER` (note that, at this step, we don't care about +information that tokens hold, such as the actual value of an integer; +these don't come into play when doing a syntactic analysis), a parser +could analyze it in the following way. + +1. Every `INTEGER` token is a valid `Expression`, so we can replace + them by the `Expression` non terminal. We get `Expression ADD + Expression MULTIPLY Expression`. +> The operation of finding a place in the input that matches the +> right-hand side of a production rule and replacing it with its non +> terminal on the left-hand size is called a *reduction*. The place +> in the input where the reduction occurs is called a *handle*. +2. `Expression MULTIPLY Expression` is a *handle* for `Expression`, so + we *reduce it*. We get `Expression ADD Expression`. +3. Finally, `Expression ADD Expression` is a handle `Expression` too, + so after reduction we are left with `Expression`. + +Here, our recognition ends successfully: the input `1+2*3` is an +arithmetic expression, or at least according to our definition. There are several things to note on this example. -> Note 1: at step 2., an `Expression` could have been recognised in different places in the -> partially-recognised input `Expression ADD Expression MULTIPLY Expression`. These recognition -> point are called *handles*. There is a very important difference between choose -> `Expression ADD Expression` as then handle to perform the recognition of `Expression`, and -> choosing `Expression MULTIPLY Expression`, because one would end up with a tree that matches -> the parenthesing of `(1+2)*3` and the other `1+(2*3)`. If we were to, say, evaluate these -> expression, we wouldn't get the same result. -> So, for this grammar, Beans would have to choose between which rule to apply, and this decision -> is crucial in the result. We will see later on how to instruct Beans to apply the "good" rule -> (which, in this case, would be the one that leads to parsing as `1+(2*3)`, if we want to apply -> the usual operator precedence). - -> Note 2: in this example, we have limited ourselves to recognise the input, not trying to parse -> it. It wouldn't be too hard to expand our current "algorithm" to remember which reductions have -> been applied, and in turn construct a syntax tree from that information, but we won't try to -> do this *yet*. - -The order in which we have recognised the input is called "bottom-up", because we have started -with the terminals, and iteratively replaced them with non terminals, ending up with a single -non terminal (if the recognition goes well). Since in the end we want to produce a syntax tree, -and that in a tree, the root is at the top, whereas the leaves are at the bottom, we have -effectively traversed that tree start from the bottom all the way up. But we could have done the -opposite... +> Note 1: at step 2., an `Expression` could have been recognised in +> different places in the partially-recognised input `Expression ADD +> Expression MULTIPLY Expression`. These recognition point are called +> *handles*. There is a very important difference between choosing +> `Expression ADD Expression` as the handle to perform the recognition +> of `Expression`, and choosing `Expression MULTIPLY Expression`, +> because in the first case we would end up with a tree that matches +> the parenthesing of `(1+2)*3`; in the second one we would obtain +> `1+(2*3)`. If we were to, say, evaluate these expression, we +> wouldn't get the same result. So, for this grammar, Beans would +> have to choose between which rule to apply, and this decision is +> crucial in the result. We will see later on how to instruct Beans to +> apply the "good" rule (which, in this case, would be the one that +> leads to parsing as `1+(2*3)`, if we want to apply the usual +> operators precedence). + +> Note 2: in this example, we have limited ourselves to recognise the +> input, not trying to parse it. It wouldn't be too hard to expand our +> current "algorithm" to remember which reductions have been applied, +> and in turn construct a syntax tree from that information, but we +> won't try to do this *yet*. + +The order in which we have recognised the input is called "bottom-up", +because we have started with the terminals, and iteratively replaced +them with non terminals, ending up with a single non terminal (if the +recognition goes well). Since in the end we want to produce a syntax +tree, and that in a tree, the root is at the top, whereas the leaves +are at the bottom, we have effectively traversed that tree start from +the bottom all the way up. But we could have done the opposite... ## Production rules as generators (top-down) -So far, it might not be clear why production rules are called as such, when we have happily been -using them as recognition rules instead; even the arrow seems in the wrong direction: when we -apply a reduction, we transform the right-hand side into the left-hand side of a rule. - -Now, we will see that production rules can be used instead to *generate* valid expressions. -Starting with a single non terminal `Expression`, we can *expand* it to -`Expression ADD Expression` using the corresponding production rule. The first `Expression` -can further be expanded to `INTEGER`, using the last rule, to get `INTEGER ADD Expression`. -If we expand `Expression` with the multiplication rule, we get -`INTEGER ADD Expression MULTIPLY Expression`. Again, by expanding all `Expression`s with -`INTEGER`, we get `INTEGER ADD INTEGER MULTIPLY INTEGER`. Notice that this correponds to the -input `1+2*3`, and so `1+2*3` is a valid expression! - -This "algorithm" might seem a little weird at first, because we have too many choices! In the -previous one, we had only one choice, and by taking the "wrong" option we could have ended with -the wrong parenthesing *if we decided to build a syntax tree*. Otherwise, both options were ok. -Here, we had to choose which `Expression` to expand at each step and, more importantly, which -rule to apply for the expansion. Note that we could easily have blocked ourselves by expanding -`Expression` to `INTEGER` right away, or we could have kept expanding forever, only ever applying -the `Expression -> Expression ADD Expression` rule. - -While this seems a lot more complicated than its bottom-up counterpart, top-down algorithms are -usually much easier to program, mainly because it often suffices to look at a few tokens to -"guess" what the right expansion is at any moment. - -Correspondingly to the bottom-up strategy, if we were to look at how we traverse the syntax -tree while building it, this strategy would actually start by examining the root of the tree, -and we would be visiting the leaves at the end, so we would be traversing the tree top-down. +So far, it might not be clear why production rules are called as such, +when we have happily been using them as recognition rules instead; +even the arrow seems in the wrong direction: when we apply a +reduction, we transform the right-hand side into the left-hand side of +a rule. + +Now, we will see that production rules can be used instead to +*generate* valid expressions. Starting with a single non terminal +`Expression`, we can *expand* it to `Expression ADD Expression` using +the corresponding production rule. The first `Expression` can further +be expanded to `INTEGER`, using the last rule, to get `INTEGER ADD +Expression`. If we expand `Expression` with the multiplication rule, +we get `INTEGER ADD Expression MULTIPLY Expression`. Again, by +expanding all `Expression`s with `INTEGER`, we get `INTEGER ADD +INTEGER MULTIPLY INTEGER`. Notice that this correponds to the input +`1+2*3`, and so `1+2*3` is a valid expression! + +This "algorithm" might seem a little weird at first, because we have +too many choices! In the previous one, we had only one choice, and by +taking the "wrong" option we could have ended with the wrong +parenthesing *if we decided to build a syntax tree*. Otherwise, both +options were ok. Here, we had to choose which `Expression` to expand +at each step and, more importantly, which rule to apply for the +expansion. Note that we could easily have blocked ourselves by +expanding `Expression` to `INTEGER` right away, or we could have kept +expanding forever, only ever applying the `Expression -> Expression +ADD Expression` rule. + +While this seems a lot more complicated than its bottom-up +counterpart, top-down algorithms are usually much easier to implement, +mainly because it often suffices to look at a few tokens to "guess" +what the right expansion is at any moment. + +Paragrafo confuso. Non capito + +Correspondingly to the bottom-up strategy, if we were to look at how +we traverse the syntax tree while building it, this strategy would +actually start by examining the root of the tree, and we would be +visiting the leaves at the end, so we would be traversing the tree +top-down. ## Production rules in Beans -Before going further, let's try to write a parser grammar for Beans to recognise simple -arithmetic expressions. Beans' syntax is a little different from production rules, because the -parser does not only recognise, it also tries to build up a syntax tree; since we are not (yet) -interested in doing that, we will ignore some syntax quirks that will appear. In `arith.gr`, -write +Before going further, let's try to write a parser grammar for Beans to +recognise simple arithmetic expressions. Beans' syntax is a little +different from production rules, because the parser does not only +recognise, it also tries to build up a syntax tree; since we are not +(yet) interested in doing that, we will ignore some syntax quirks that +will appear. In `arith.gr`, write ```beans-gr @Expression ::= Expression ADD Expression <> @@ -155,37 +184,43 @@ write Expression DIVIDE Expression <> INTEGER <>; ``` -We have define the non terminal `Expression` with five production rules. Each production rule -ends with `<>` (you can ignore this for now), and the whole definition ends with a semicolon. - -Furthermore, `Expression` is tagged with `@`, which means it's an *axiom non-terminal*, or, in -other words, it's the non terminal we are allowed to start from in a top-down stategy. Since we -only have a single non-terminal for now, this isn't very important (but don't forget it, or it -won't work!). +We have defined the non terminal `Expression` with five production +rules. Each production rule ends with `<>` (you can ignore this for +now), and the whole definition ends with a semicolon. + +Furthermore, `Expression` is tagged with `@`, which means it's an +*axiom non-terminal*, or, in other words, it's the non terminal we are +allowed to start from in a top-down stategy. Since we only have a +single non-terminal for now, this isn't very important (but don't +forget it, or it won't work!). ```bash $ beans parse --lexer arith.lx --parser arith.gr input -AST +Expression $ ``` -Yay! It works. Well, the output isn't very impressive, because Beans prints the syntax tree we -have produced, but we currently have no rules that manipulate the syntax tree, and in particular -we don't add any node or leaves to it. +Yay! It works. Well, the output isn't very impressive, because Beans +prints the syntax tree we have produced, but we currently have no +rules that manipulate the syntax tree, and in particular we don't add +any node or leaves to it. -You can also try it on wrong inputs, for example `1+2*` or `1+2*3 4` to check it fails as it -should. +You can also try it on wrong inputs, for example `1+2*` or `1+2*3 4` +to check it fails as it should. # Building a syntax tree -Checking if a string is a valid arithmetic expression is a bit boring. We would like to get more -information than just a certain string is valid or not. Furthermore, as pointed earlier, our -grammar is currently ambiguous, meaning that the expression `1+2*3` could be understood in two -different ways, and it would be interesting to see how Beans solves that ambiguity. - -To do so, we need to expand our grammar a little bit. First of all, we might want to bind -expressions that we use to recognise further expressions. For instance, when we have a -`Expression ADD Expression` and we recognise an `Expression` there, we would like to remember -the two sub expressions. To do so, we will add `@name` to every element of a rule that we would -like to remember under the name `name`. +Checking if a string is a valid arithmetic expression is a bit +boring. We would like to get more information than just whether a +certain string is valid or not. Furthermore, as pointed earlier, our +grammar is currently ambiguous, meaning that the expression `1+2*3` +could be understood in two different ways, and it would be interesting +to see how Beans solves that ambiguity. + +To do so, we need to expand our grammar a little bit. First of all, we +might want to bind expressions that we use to recognise further +expressions. For instance, when we have a `Expression ADD Expression` +and we recognise an `Expression` there, we would like to remember the +two sub expressions. To do so, we will add `@name` to every element of +a rule that we would like to remember under the name `name`. ```beans-gr @Expression ::= Expression@left ADD Expression@right <> @@ -195,20 +230,22 @@ like to remember under the name `name`. INTEGER@value <>; ``` -As said in the introduction to this chapter, the goal is also to extract information from tokens, -and then dump these. The only token that holds some information is `INTEGER`, which has a single -group (labeled `0`). We can therefore bind that group, instead of the whole token, by accessing -it with a field-like syntax. +As said in the introduction to this chapter, the goal is also to +extract information from tokens, and then dump these. The only token +that holds some information is `INTEGER`, which has a single group +(labeled `0`). We can therefore bind that group, instead of the whole +token, by accessing it with a field-like syntax. ```beans-gr @Expression ::= ... INTEGER.0@value <>; ``` -Finally, we need to remember what kind of expression each expression is. This is very similar to -naming variants of enumerations: here, each rule bound to `Expression` is a constructor of -`Expression`, and when we will match on `Expression`, we will need to distinguish how that -particular instance of `Expression` was constructed. +Finally, we need to remember what kind of expression each expression +is. This is very similar to naming variants of enumerations: here, +each rule bound to `Expression` is a constructor of `Expression`, and +when we will match on `Expression`, we will need to distinguish how +that particular instance of `Expression` was constructed. ```beans-gr @Expression ::= Expression@left ADD Expression@right @@ -218,6 +255,8 @@ particular instance of `Expression` was constructed. INTEGER.0@value ; ``` +NB: Il mio albero è stampato in ordine inverso: prima i left e poi i right + Let's see what the tree looks like now. ```bash $ beans parse --lexer arith.lx --parser arith.gr input @@ -231,38 +270,46 @@ Expression(Mult) └─ value: 2 $ ``` -Well, it works, but... if you stare at that syntax tree long enough, you'll realize that it was -parsed like `(1+2)*3`, not like `1+(2*3)`. We will see in the next section how to solve this -issue, and how ambiguity is handled in general. +Well, it works, but... if you stare at that syntax tree long enough, +you'll realize that it was parsed like `(1+2)*3`, not like +`1+(2*3)`. We will see in the next section how to solve this issue, +and how ambiguity is handled in general. # Ambiguity -A grammar is said to be *ambiguous* when there exists an input that can be parsed in two -different ways, that is, there are two *derivation tree* for that input. Most of the time, an -ambiguity in the grammar is symptomatic of a semantic ambiguity, that is, the language that we -are trying to parse is somehow ill defined. - -This is the case, for instance, of our simple arithmetic expressions. Our plain-text, intuitive -definition of what is an arithmetic expression is *bad* because it doesn't say which of `(1+2)*3` -or `1+(2*3)` should be understood when reading `1+2*3`, that is, it contains no operator priority -information. But it also lacks something else, as we will see. - -> Note: one might wonder why Beans did not report this. After all, if it's often an actual mistake to -> define an ambiguous grammar, it would make sense for Beans to at least warn you about that. In -> fact, there is some work being done in that direction, but there is a fundamental issue: -> ambiguity is undecidable, that is, there *can't exist* an algorithm which, given a grammar, tells -> us whether it's ambiguous or not. -> Actually, Beans will perform much better if the grammar in unambiguous, and even better if it -> belongs to a more restrictive class of grammars called `LR(k)`. If you have ever used tools -> like Bison, Menhir or Yacc, and you are trying to port grammars from them to Beans, good new! -> These tools force your grammars to be in such a restricted class (to have excellent +A grammar is said to be *ambiguous* when there exists an input that +can be parsed in two different ways, that is, there are two +*derivation tree* for that input. Most of the time, an ambiguity in +the grammar is symptomatic of a semantic ambiguity, that is, the +language that we are trying to parse is somehow ill defined. + +This is the case, for instance, of our simple arithmetic +expressions. Our plain-text, intuitive definition of what is an +arithmetic expression is *bad* because it doesn't say which of +`(1+2)*3` or `1+(2*3)` should be understood when reading `1+2*3`, that +is, it contains no operator priority information. But it also lacks +something else, as we will see. + +> Note: one might wonder why Beans did not report this. After all, if +> it's often an actual mistake to define an ambiguous grammar, it +> would make sense for Beans to at least warn you about that. In fact, +> there is some work being done in that direction, but there is a +> fundamental issue: ambiguity is undecidable, that is, there *can't +> exist* an algorithm which, given a grammar, tells us whether it's +> ambiguous or not. Actually, Beans will perform much better if the +> grammar in unambiguous, and even better if it belongs to a more +> restrictive class of grammars called `LR(k)`. If you have ever used +> tools like Bison, Menhir or Yacc, and you are trying to port +> grammars from them to Beans, good news! These tools force your +> grammars to be in that restricted class (for the sake of > performance), and so will also lead to fast parsing with Beans. ## Priority -The first issue is operator priority. Beans has a very simple rule to determine priority: rules -that come first have higher priority. So, simply moving the division and multiplication rules up -will patch our example: +The first issue is operator priority. Beans has a very simple rule to +determine priority: rules that come first have higher priority. So, +simply moving the division and multiplication rules up will patch our +example: ```beans-gr @Expression ::= Expression@left MULTIPLY Expression@right @@ -282,11 +329,13 @@ Expression(Add) └─ right: Expression(Literal) └─ value: 3 ``` -Much better! However, we now have a more subtle issue. Usually, multiplication and division have -the same priority, and the leftmost operator is chosen (same for addition and subtraction). -However, as is, multiplication will be prioritized over division: `1/2*3` should be parsed -`(1/2)*3` but will be parsed as `1/(2*3)`. To solve this, we need to merge the multiplication -and division rules, by introducing other non terminals. +Much better! However, we now have a more subtle issue. Usually, +multiplication and division have the same priority, and the leftmost +operator is chosen (same for addition and subtraction). However, as +is, multiplication will be prioritized over division: `1/2*3` should +be parsed `(1/2)*3` but will be parsed as `1/(2*3)`. To solve this, we +need to merge the multiplication and division rules, by introducing +other non terminals. ```beans-gr @Expression ::= @@ -331,8 +380,13 @@ $ ``` They correspond, respectively, to `(1/2)*3` and `(1*2)/3`. Victory! -> Note that makes the information a little more nested, which is fine for now, but will make some -> pretty ugly pattern matching in the future. In fact, this technique produces some artifacts of +> Note that makes the information a little more nested, which is fine +> for now, but will make some pretty ugly pattern matching in the +> future. In fact, this technique produces some artifacts of + +of?.. + +E il prossimo esempio che è? ```beans-gr @Expression ::= diff --git a/doc/src/details/tradeoff.md b/doc/src/details/tradeoff.md index 07c1423..999d131 100644 --- a/doc/src/details/tradeoff.md +++ b/doc/src/details/tradeoff.md @@ -1,37 +1,44 @@ # Tradeoffs -Several tradeoffs have been made while developping Beans. You can find here some I remembered to -write down. +Several tradeoffs have been made while developping Beans. You can find +here some of them that I remembered to write down. +Er... # Scannerless parsing -There are some parsers, called -[scannerless parsers](https://en.wikipedia.org/wiki/Scannerless_parsing), that do not rely on a -lexer. Indeed, a parser is *more powerful* than a lexer, meaning that anything -that a lexer could do, a parser could also do. So, in fact, one might wonder why Beans bothers -having a lexer at all. There are several reasons for this. +There are some parsers, called [scannerless +parsers](https://en.wikipedia.org/wiki/Scannerless_parsing), that do +not rely on a lexer. Indeed, a parser is *more powerful* than a lexer, +meaning that anything that a lexer could do, a parser could also +do. So, in fact, one might wonder why Beans bothers having a lexer at +all. There are several reasons for this. ## Performance -The first reason for this separation is *performance*. Parsers could do what lexers do, but -because lexing is simpler than parsing, there is more space for specific optimizations. In fact, -Beans ships its own regex library which is tailored for the lexing use case. +The first reason for this separation is *performance*. Parsers could +do what lexers do, but because lexing is simpler than parsing, there +is more space for specific optimizations. In fact, Beans ships its own +regex library which is tailored for the lexing use case. ## Error reporting -Usually, lexing errors are very much different than syntax errors. It's -quite rare to encounter a lexing error in practice, because it's quite hard to write invalid -token. This means that lexing errors should be reported differently than syntax errors, and -this would be harder (if not impossible) in scannerless parsers. +Usually, lexing errors are very much different than syntax +errors. It's quite rare to encounter a lexing error in practice, +because it's quite hard to write invalid tokens. This means that lexing +errors should be reported differently than syntax errors, and this +would be harder (if not impossible) in scannerless parsers. -An other aspect to be taken into account is that parsers may have a recovery mode, which triggers -when encountering a syntax error. In this special mode, the parser cannot fully understand the -input but will try to guess how to correct the input so that it can provide better user -feedback. This is much easier to perform if the parser works on tokens, rather than characters. +Another aspect to be taken into account is that parsers may have a +recovery mode, which triggers when encountering a syntax error. In +this special mode, the parser cannot fully understand the input but +will try to guess how to correct the input so that it can provide +better user feedback. This is much easier to perform if the parser +works on tokens, rather than characters. ## Logical separation -Parsing and lexing are two logically distinct steps, even though there is quite some interleaving -in Beans. Having them kept as different steps make it easier to debug a grammar one is writing, -as it's easier to see what happens step by step, where each step is easier than the whole parsing -operation. +Parsing and lexing are two logically distinct steps, even though there +is quite some interleaving in Beans. Having them kept as different +steps makes it easier to debug a grammar one is writing, as it's easier +to see what happens step by step, where each step is simpler than the +whole parsing operation. diff --git a/doc/src/getting-started/README.md b/doc/src/getting-started/README.md index 68c0e0d..76d9429 100644 --- a/doc/src/getting-started/README.md +++ b/doc/src/getting-started/README.md @@ -1,8 +1,8 @@ # Getting Started -In order to get started with beans, you will need to following: +In order to get started with beans, you will need the following: * having Beans installed as a helper tool * learning Beans' concepts - * write an interface between what other parts of your program expect your parser to give, and what - Beans actually provides you + * write an interface between what other parts of your program expect + your parser to give[che significa?], and what Beans actually provides you diff --git a/doc/src/getting-started/compile.md b/doc/src/getting-started/compile.md index 8f4ca31..8aa1bd9 100644 --- a/doc/src/getting-started/compile.md +++ b/doc/src/getting-started/compile.md @@ -1,37 +1,52 @@ # Compilation -Beans can be used in two ways, which are very much related but, in practice, will require entierly -different approaches. - -Beans can be used as a on-the-fly parser-generator, meaning that you expect your -end user to give you a grammar for a language they just though of, and you have to parse files -written in that language. This is mainly useful for -[domain-specific languages](https://en.wikipedia.org/wiki/Domain-specific_language). An example of -this is Beans itself, which has to parse the grammars you feed it. Since this aspect of Beans is not -as mature as the other one, it's not the one this book will focus on. - -The other purpose of Beans is to be used as a regular parser-generator (think -[Yacc](https://en.wikipedia.org/wiki/Yacc), [Bison](https://fr.wikipedia.org/wiki/GNU_Bison), -[Menhir](http://gallium.inria.fr/~fpottier/menhir/), ...). The main difference is that, unlike these tools, Beans will -never generate Rust code to be compiled alongside your code. Instead, it does its own compilation: it -compiles a grammar to a binary blob, which is then included in the final binary. This means that you -need to compile Beans grammars "by hand", using `beans`. `beans` is also useful for debugging -purposes, as it can give you helpful insights or advices on your grammars. +Beans can be used in two ways, which are very much related but, in +practice, will require entirely different approaches. + +Il paragrafo seguente non lo capisco. Chi è +"you"? Chi è lo end user? + +Beans can be used as an on-the-fly parser-generator, meaning that you +expect your end user to give you a grammar for a language they just +thought of, and you have to parse files written in that language. This +is mainly useful for [domain-specific +languages](https://en.wikipedia.org/wiki/Domain-specific_language). An +example of this is Beans itself, which has to parse the grammars you +feed it. Since this aspect of Beans is not as mature as the other one, +it's not the one this book will focus on. + +The other purpose of Beans is to be used as a regular parser-generator +(think [Yacc](https://en.wikipedia.org/wiki/Yacc), +[Bison](https://fr.wikipedia.org/wiki/GNU_Bison), +[Menhir](http://gallium.inria.fr/~fpottier/menhir/), ...). The main +difference is that, unlike these tools, Beans will never generate Rust +code to be compiled alongside your code. Instead, it does its own +compilation: it compiles a grammar to a binary blob, which is then +included in the final binary. This means that you need to compile +Beans grammars "by hand", using `beans`. `beans` is also useful for +debugging purposes, as it can give you helpful insights or advices on +your grammars. # The grammars -Beans contains two kind of grammars: the lexer grammars (extension `.lx`), and the parser grammars (extension `.gr`). -They are written in different languages, and are compiled separatly, although the parser grammar relies on the lexer -grammar, because the terminals defined in the lexer grammar are used in the parser grammar. +Beans contains two kind of grammars: the lexer grammars (extension +`.lx`), and the parser grammars (extension `.gr`). They are written +in different languages, and are compiled separatly, although the +parser grammar relies on the lexer grammar, because the terminals +defined in the lexer grammar are used in the parser grammar. ## Lexer grammars -The lexer grammar can be compiled with `beans compile lexer path/to/grammar.lx`. It will produce a binary blob at +The lexer grammar can be compiled with `beans compile lexer +path/to/grammar.lx`. It will produce a binary blob at `path/to/grammar.clx`. ## Parser grammars -The parser grammar can be compiled with `beans compile parser --lexer path/to/grammar.clx path/to/grammar.gr`. It will -produce a binary blob at `path/to/grammar.cgr`. Note that we had to provide a lexer grammar (so that Beans can find -the definitions of the terminals used in the parser grammar), and in this case it was a *compiled* lexer grammar. -A non-compiled lexer grammar will also be accepted, but the process will be slower because Beans has to interpret it. +The parser grammar can be compiled with `beans compile parser --lexer +path/to/grammar.clx path/to/grammar.gr`. It will produce a binary blob +at `path/to/grammar.cgr`. Note that we had to provide a lexer grammar +(so that Beans can find the definitions of the terminals used in the +parser grammar), and in this case it was a *compiled* lexer grammar. +A non-compiled lexer grammar will also be accepted, but the process +will be slower because Beans has to interpret it. diff --git a/doc/src/getting-started/install.md b/doc/src/getting-started/install.md index da0c3ec..4b6e5f4 100644 --- a/doc/src/getting-started/install.md +++ b/doc/src/getting-started/install.md @@ -1,17 +1,22 @@ # Installation -Using Beans as a library in Rust is done as with any other library: adding it in the dependencies in the Cargo -manifest is enough. However, as explained in the [Compiling Section](compile.md), a command line tool is also -required to use Beans. It allows compilation of Beans grammars, some introspection and debugging information. -There are ways to install `beans`: - * Installing with [Nix](https://nixos.org). This is the preferred way if you already have nix. - * Installing with Cargo. This is the preferred way is you don't have nix. +Using Beans as a library in Rust is done as with any other library: +adding it in the dependencies of the Cargo manifest is +enough. However, as explained in the [Compiling Section](compile.md), +its command line tool is also required to use Beans. It allows +compilation of Beans grammars, some introspection and debugging +information. There are several ways to install `beans`: + * Installing with [Nix](https://nixos.org). This is the preferred way + if you already have nix. + * Installing with Cargo. This is the preferred way is you don't have + nix. * Installing manually. # Nix installation -Beans is flake-packaged for nix. You can find the appropriate flake at -[Beans' repo](https://github.com/jthulhu/beans). The actual installation procedure depends on how you use nix. +Beans is flake-packaged for nix. You can find the appropriate flake at +[Beans' repo](https://github.com/jthulhu/beans). The actual +installation procedure depends on how you use nix. # Cargo installation @@ -23,15 +28,18 @@ cargo install beans # Manual compilation and installation -Beans has three dependencies: Cargo, the latest rustc compiler and make. Optionally, having git makes it easy to -download the source code. +Beans has three dependencies: Cargo, the latest rustc compiler and +make. Optionally, having git makes it easy to download the source +code. ## Downloading the source code -If you have git, you can `git clone https://github.com/jthulhu/beans` which will download a copy of the source code -in the directory `beans`. +If you have git, you can `git clone https://github.com/jthulhu/beans` +which will download a copy of the source code in the directory +`beans`. -Otherwise, you need to download it from the [git repository](https://github.com/jthulhu/beans). +Otherwise, you need to download it from the [git +repository](https://github.com/jthulhu/beans). ## Building and installing the source code @@ -39,5 +47,9 @@ Once the `beans` directory entered, you simply need to run ```bash make install ``` -This will install a single binary at `/usr/local/bin/beans`. You can overwrite the target destination using the -environment variables `DESTDIR` and `PREFIX`. +This will install a single binary at `/usr/local/bin/beans`. You can +overwrite the target destination using the environment variable +`PREFIX`, e.g.: +```bash +make PREFIX=$HOME install +```