## GENERAL NOTES ABOUT GRAMMARS AND NOM General notes about formal language grammars and nom. In the Medieval period in Europe the four pillars of education were Rhetoric, Logic, Theology and Grammar. Today this seems a strange classification of knowledge and study, but Logic and Grammar are the basis of computer science. I should state here that I am not a great formal-language theorist or high-level mathematician who can give insights into grammar theory. This page is just a set of heuristic notes about things I have discovered while parsing languages with the pep/nom system. ### LOOK AHEAD You can use the begins-with and ends-with modifiers to achieve token lookahead in [nom]. The script /eg/maths.parse.pss and also /eg/maths.tolatex.pss have good examples of using lookahead. The document /doc/howto/nom.lookahead.html has detailed information about how to write lookahead into a nom script . There is also the pep://peep register which provide a single character look-ahead and which is not used directly by the script writer. This register is used by the nom://while and nom://whilenot commands to read the input-stream only while the next character matches or does not match the criteria given. ### ASSOCIATIVITY This is how we parse "a + b + c" .... Do we parse this as "(a + b) + c" (left-associative) or as "a + (b + c)" (right-associative) With nom we need to build this associativity into the grammar. See the "arithmetic parser" nomsf://eg/maths.parse.pss example for an example of associativity (with +/- signs which associate right) . ### PRECEDENCE This is how we parse for example "a + b * c" We say that certain operators have *precedence* over others. Do we parse this as "(a + b) * c" ( '+' precedence ) or as "a + (b * c)" ( * precedence ) Again, this has to be factored into the nom grammar. Its not that difficult. ### PARSER GENERATORS A "parser generator" wp://comparison.of.parser.generators is basically a way to avoid having to write a "recursive descent parser" wp://recursive.descent.parser (and compiler) or some other style of hand coded compiler. Lex, Yacc, Bison, Antlr and many others are examples of parser generators and they seem to be widely used for simple languages. An old example would be the *EQN* language for formatting mathematical expressions. ### KLEENE STAR The *kleene star* is a grammar construct which basically means "zero or more of something". This exists in regular expressions * example kleene star in a sed regular expression >> s/ab*c// This means: 'a' followed by zero or more 'b's followed by 'c' The same construct is used in [ebnf] grammars (check!). The rule below means that "A 'statement' is a 'keyword' followed by zero or more 'parameters' followed by a semi colon. * kleene star in ebnf grammar rule >> statement = keyword parameter* ';' ; How can we translate this into [nom] ? We cannot create a single nomsyn://block which represents the rule above but by using several blocks we can create a *nom* fragment which represents it. The code below represents a complete "lexer" and "recogniser" for statements as shown below. Notice that white-space is completely ignored except in parameters. * example of statements recognised below ---- say 'good'; fizz; bang 'crash' ; whirr 'ok' 'okay'; grow'big' 'small' 'tiny' ; bird 'tweet' 'fly ' 'nest' ; ,,,, The script that follows is a *recogniser* because it doesn't actually compile/transpile/transform the input text, it just determines if the input text consists of valid "statements". Also notice, that the single [ebnf] rule with a *kleene star* is broken up into 3 [nom] "blocks" nomsyn://block ( statements between braces ). * equivalent ebnf rules as used by nom ----- statement = keyword ';' ; statement = keyword parameter ';' ; statement = keyword parameterset ';' ; parameterset = parameter parameter ; parameterset = parameterset parameter ; ,,,, Although their are 5 ebnf rules above, [nom] only requires 3 blocks because it can use *OR logic* concatenation in the tests (using the ',' comma character) * Ebnf rule with kleene star ----- # "lexing" read; # literal token, the character is also the parse-token ';' { add "*"; push; } [:space:] { while [:space:]; clear; } [:alpha:] { while [:alpha:]; put; clear; add "keyword*"; push; } "'" { until "'"; put; clear; add "parameter*"; push; } !"" { put; clear; add "?? bad char ["; get; add "]\n"; print; quit; } # The nom version of the ebnf rule # statement = keyword parameter* ';' ; parse> # show the stack reductions for debugging # add "line "; lines; add " char "; chars; add ": "; print; clear; # unstack; print; stack; add "\n"; print; clear; pop; pop; "keyword*;*" { clear; add "statement*"; push; .reparse } "parameter*parameter*","paramset*parameter*" { clear; add "paramset*"; push; .reparse } pop; "keyword*parameter*;*","keyword*paramset*;*" { clear; add "statement!\n"; print; clip; clip; add "*"; push; .reparse } push;push;push; ,,,, ### MORE KLEENE STAR ANALYSIS Take a simple regular expression and attempt to translate into nom * a none,one,or many asterisk in a regular expression >> ab*c This needs to be recognised with a much more verbose nom script. The script below will no doubt seem ridiculously verbose, but the strength of pep/nom is not in parsing regular languages. * the equivalent in nom -------+ read; "a" { put; clear; add "A*"; push; } "b" { put; clear; add "B*"; push; } "c" { put; clear; add "C*"; push; } !"" { clear; add "no"; print; quit; } parse> pop;pop; "B*B*" { clear; get; ++; get; --; put; clear; add "B*"; push; } "A*C*" { clear; get; ++; get; --; put; clear; add "pattern*"; push; .reparse } pop; "A*B*C*" { clear; get; ++; get; ++; get; --; --; put; clear; add "pattern*"; push; .reparse } (eof) { !"pattern*" { clear; add "no"; print; quit; } "pattern*" { clear; add "yes"; print; quit; } } push;push;push; ,,,, ### PARSE TOKEN MATCHING OR RECOGNISING Nom can only match or recognise a fixed number of parse-tokens at any point in the script. This is because it needs to nom://pop the parse tokens off the pep://stack before it can match parse token sequences. This is why ebnf rules in the form ----- a = b+ c ; a = c b* ; ,,,, need to be broken up into several [nom] blocks, because b+ and b* represent a variable number of parse tokens. ### RECURSIVE DESCENT PARSING LL grammars (same?) Although grammar theorists talk about all sorts of different types of parsing, in practice, the type of parsers that are written by jobbing programmers seem to be *recursive descent* . There is a very surprising statement on Robert Nystrom's blog that all compilers for modern languages in common use are recursive descent. Mr. Nystrom is involved in creating (designing?) the [dart] language and has written a great "book" www.craftinginterpreters.com about parsing and compiling, so I am pretty convinced he knows what he is talking about. Recursive descent is also the type of parser/compiler that seems to be taught in University Comp-Sci courses. Nicholas Wirth of Pascal fame was a strong proponent of this type of parser/compiler. I don't wish to criticise recursive descent parsers, because they obviously work, but it does seem odd to me that this is still the main way to recognise and translate formal-languages. [nom] does not do recursive descent parsing: in fact I wrote *nom* because I wanted a way to understand the grammars of language that was more "natural" for my way of thinking. The [pep] & [nom] system does google:"shift reduce parsing" but it does it purely as a (unix-style) text-stream filter. The pep/nom lexing/parsing/compiling process is as follows: In the *lexical analysis* phase of the nom script, pep/nom "reads" nom://read the input stream (more or less) one Unicode character at a time, and creates parse-tokens and their "attributes" in a "text buffer" pep://workspace . The parse-tokens are "pushed" nom://push onto a pep://stack and the "attributes" (the partially compiled/translated text) are nom://put into the pep://tape . Then, during the parsing or "shift-reduce" phase of the script, the parse-tokens are "popped" nom://pop off the stack back into the nom://workspace buffer. If a particular parse-token sequence is *matched* or recognised, then the *workspace* is "cleared" nom://clear and the token attributes are "got" nom://get from the pep://tape and compiled and then *put* again onto the *tape* . Then the *workspace* (string) buffer is "cleared" nom://clear and the new parse-token is created in the workspace with the nom://add command. Finally, the new parse-token or tokens is "pushed" nom://push onto the *stack* . This entire process is *text-oriented* meaning that everything is a "string" including the parse-tokens and the parse tokens and their attributes are manipulated in the same virtual pep://machine with the same commands. ### RECURSION IN GRAMMARS * Left recursive grammars ------ E := E + E E := T ,,,, ### GRAMMAR AND SCRIPT CONSTRUCTION My knowledge of formal language grammar theory is quite limited. I am more interested in practical techniques. But there is a reasonably close correlation between bnf-type grammar rules and pep/nom script construction. The right-hand-side of a (E)BNF grammar rule is represented by the quoted text before a brace block, and the left-hand-side correlates to the new token pushed onto the stack. * the rule " ::=
;" in a parse script >> "article*noun*" { clear; add "nounphrase*"; push; } ### TERMINOLOGY Terminology, productions (same as rules?) terminals, non-terminals, tokens. Tokens are the same as terminals? BNF backus-naur form ### AUTOMATONS There is a relationship between virtual machines, automatons and grammars and formal languages. wp://Deterministic_finite_automaton