Yet Another Parser Parser

Overview

YAPP XSLT is a lexical scanner and recursive descent parser generator, implemented in XSLT. No language extensions or non-standard features are used apart from the nodeset() function. Grammars are expressed in XML form and transformed by the generator stylesheet into another XSLT. A lexical scanner may also be generated from the same grammar.

The generated parser stylesheet has named templates for each construct in the grammar. It also incorporates any xslt code that was declared in the grammar, and can thus be used to directly process input files containing unparsed text.

The package also consists of a grammar for BNF that can be compiled into a parser capable of creating YAPP grammars in XML form from BNF notation, and a stylesheet that performs left-recursion elimination.

The Grammar

A YAPP XML grammar takes the (informal) form:

<grammar> <import href="..."/> * <ignore char="."/> * <terminal name="..."> * ... </terminal> <construct name="..."> * <option> * <part name="..."/> * </option> </construct> <templates> * ... </templates> </grammar>

Terminals are declared both for the lexical scanner and parser generator.

The ignore element is an instruction to the lexical scanner to skip all occurences of a single character.

A construct is a grammar production, with the option elements being production rules.

The templates and import elements are used to control the resulting stylesheet (the parser). Any child nodes of a templates element will be copied straight into the top level of the parser stylesheet. All import declarations will be inserted at the head of the stylesheet as xsl:import elements.

Construct options (production rules) are made up from part elements that simply name another construct or terminal.

Here's a simple example from arithmetics:

<grammar> <terminal name="number">...</terminal> <terminal name="plus">...</terminal> <terminal name="minus">...</terminal> <construct name="AdditiveExpression"> <option> <part name"number"/> </option> <option> <part name="AdditiveExpression"/> <part name"plus"/> <part name"number"/> </option> <option> <part name="AdditiveExpression"/> <part name"minus"/> <part name"number"/> </option> </construct> </grammar>

Terminals and lexical scanners

Terminals can be defined in the following ways: as a set of characters, a specific string, as a group of characters delimited by given start and end patterns, or as the end terminal. The end terminal is returned by the lexical scanner when no more input is available.

<terminal> <end/> ? <equals>text</equals> * <any>chars</any> * <delimited> * <begin>text</begin> <end>text</end> </delimited> </terminal>

To generate a lexical scanner from the grammar given in the first example, replace the empty terminal definitions with:

<terminal name="number"> <any>.0123456789</any> </terminal> <terminal name="plus"> <equals>+</equals> </terminal> <terminal name="minus"> <equals>-</equals> </terminal>

Now add an end terminal and a start construct:

<terminal name="end"> </end> </terminal> <construct name="Expression"> <option> <part name="AdditiveExpression"/> <part name="end"/> </option> </construct>

The resulting grammar file is complete and can be compiled into a lexical scanner and parser. To call the parser from an XSL stylesheet, use something like:

<xsl:call-template name="p:Expression"> <xsl:with-param name="in" select="'123 + 456 - 789'"/> </xsl:call-template>

The problem with left-recursion

However, our arithmetic parser will never succeed in parsing any additive expressions, because of the way we wrote the rule using left-recursion: The first part of an AdditiveExpression might be.. an AdditiveExpression. Hence the recursive descent parser will be stuck in infinite recursion.

Luckily, left-recursive rules can be replaced with right-recursive counterparts. To save you the hassle of changing your grammars, YAPP comes with an XSL stylesheet that will do it for you.

Parser output

The generated parser will produce a balanced expression tree, with each terminal or non-terminal as an element called term:

<term name="..."> <term name="...">* </term>

Terminals come with a text value of that individual token, whereas non-terminals will contain all terms (if any) that made up the production. This is what an additive expression might look like:

<term name="Expression"> <term name="AdditiveExpression"> <term name="number">123</term> <term name="AdditiveExpression"> <term name="plus">+</term> <term name="number">456</term> </term> </term> <term name="end"> </term>

The file bnf-grammar.xml has a couple of generic templates that will turn term elements into something more readable:

<Expression> <AdditiveExpression> <number>123</number> <AdditiveExpression> <plus>+</plus> <number>456</number> </AdditiveExpression> </AdditiveExpression> <end/> </Expression>

BNF parser

BNF (Backus Naur Form) is a very convenient way of expressing grammars. The arithmetic example above could be expressed in BNF form as follows:

plus ::= '+'; minus ::= '-'; number ::= [.0123456789]; Expression ::= AdditiveExpression end; AdditiveExpression ::= number | AdditiveExpression minus number | AdditiveExpression plus number;

Most people will probably find this much easier to both read and write than the YAPP XML format. To make it really easy to write grammars, YAPP comes with a grammar for BNF that will produce a parser that reads BNF grammars, parses and translates into YAPP XML form.

Using YAPP

YAPP consists of the following parts:

generator.xsl

the parser generator

tokenizer.xsl

lexical scanner generator

eliminator.xsl

left-recursion eliminator

bnf-grammar.xml

BNF grammar in YAPP XML form

To run the stylesheets, you will need an XSLT engine. The stylesheets are written using the Xalan nodeset extension (most XSLT engines have an equivalent nodeset function). To find more information about or download Xalan, please see xml.apache.org.

The stylesheets generator.xsl and tokenizer.xsl both require YAPP XML grammars as input, and produce as output a parser and lexical scanner, respectively.

To run the parser generator, run Xalan (or any other XSLT engine) from the command line with something similar to this (instead of 'xalan' you might have to write 'java -jar xalan.jar'):

xalan -XSL generator.xsl -IN bnf-grammar.xml -OUT bnf-parser.xsl

This will create a file called bnf-parser.xsl that is a top-down parser of the language expressed in bnf-grammar.xml, ie BNF. Next you will probably want to create the lexical scanner:

xalan -XSL tokenizer.xsl -IN bnf-grammar.xml -OUT bnf-lexer.xsl

Now you've got both a parser and a scanner, you can run them like this:

xalan -XSL bnf-parser.xsl -IN MyGrammar.bnf

Provided that MyGrammar.bnf contains a valid BNF grammar (inside a bnf element) this will produce a YAPP grammar as output. Of course, while here the result is another grammar the procedure is the same regardless what grammar you use as input.

Download and Licence

YAPP XSLT is free software and is published under the GNU General Public Licence which comes as part of the download. To download a copy please see the download page.

Known Issues and Problems

The parsers generated by YAPP XSLT are not capable of any error reporting. This is a common problem with recursive descent parsers.

The generated lexical scanners operate on a 'first-match' basis when identifying tokens. The consequence of which is that when a character pattern (any) competes with a direct match (equals), the direct match will win even if the pattern matched token would have been longer. This may be resolved in a future release.

BNF productions can't mix terminals and terms. That is because all terminals have to have names - a terminal declared as part of a composite production would be 'anonymous'.