Parsing absolutely anything in JavaScript using Earley algorithm
Let me start by saying — I was surprised how easy it was to write grammar for an Earley parser. I have been using regular expressions for over a decade. And I am used to parse things using regular expressions. Its fragile, its not always possible, etc. But it is fast and for the most part it serves the purpose.
Familiarising with parsing algorithms changed this attitude forever.
This is a long article. Therefore, I have used the He Man meme to keep you entertained throughout the journey. I promise you an all-powerful weapon at the end of the article.
The reason for writing a parser
I am working on a declarative HTML scraper. The syntax of the scraper depends on a custom DSL, which is an extension of the CSS3 selector specification.
Here is an example of a scraper manifest used to declare what to parse, and how to validate and format the resulting data:
selector: body
properties:
title: title
articles:
selector: article {0,}
properties:
body: .body::property(innerHTML)
summary: .body p {0,}[0]
imageUrl: img::attribute(src)
title: .title::text()::test(/foo/)::format(upperCase)
There is a bunch of stuff going on there that isn’t part of the CSS spec:
- A quantifier expression (
{0,}
) - An accessor expression (
[0]
) - An XPath-esque function invocation (
::test(/foo/)::format(upperCase)
)
I needed a way to parse it.
Picking the right tool for the job
My first thought was to use regular expressions. In fact, I have used regular expressions to write a prototype of the parser. After all, in the early stages of the program design, you need to be able to quickly prototype the solution; the prototype stage is not the time to be anal about the edge cases.
This is not to say that regexp cannot be used as a parser. Regular expressions can be used to parse regular languages; CSS selectors are context-free.
By the way, if terms “context-free” or “regular language” do not make much sense, I recommend reading The true power of regular expressions (5 minutes worth of reading).
However, for the production releases, I needed a strict, extendable parser.
I have started looking for parsers in JavaScript and I found Jison and PEG.js. However, neither of the algorithms support left-recursion. I wanted a parser that supports left-recursions!
I kid you not — I didn’t even know what left-recursion is at the time of making this decision. However, I found it odd that it was emphasised that these algorithms to do not support it. Nevertheless, it was a good hunch — as I have learned later, left-recursion allows to keep the parser grammar simple and it can be a lot more performant.
Long story short, on the second page of Google search for “JavaScript parser” I found http://nearley.js.org/, an implementation of Earley parsing algorithm.
The author describes it as:
Earley parsers are awesome, because they will parse anything you give them. Depending on the algorithm specified, popular parsers such as lex/yacc, flex/bison, Jison, PEGjs, and Antlr will break depending on the grammar you give it. And by break, I mean infinite loops caused by left recursion, crashes, or stubborn refusals to compile because of a “shift-reduce error”.
– Better Earley than never (http://hardmath123.github.io/earley.html)
It sounded like a skill I want to learn.
So I continued reading.
Setup
Install nearley
package.
$ npm install nearley
nearley
consists of the main package (the parser API) and several CLI programs:
$ ls -1 ./node_modules/.bin
nearley-railroad
nearley-test
nearley-unparse
nearleyc
These programs are:
nearley-railroad
is used to generate railroad diagrams.nearley-test
is used to test an arbitrary input against the compiled grammar.nearley-unparse
is used to generate random strings that satisfy the grammar.nearleyc
is used to compile Nearley grammar to JavaScript script.
To make these programs available to your shell, add
./node_modules/.bin
to your$PATH
(export PATH=./node_modules/.bin:$PATH
) or installnearley
with--global
option.
We are only going to be using nearleyc
and nearley-test
.
Parsing “1+2+3”
A parser needs a grammar to parse the input.
The Earley algorithm parses a string based on a grammar in Backus-Naur Form (BNF). A BNF grammar consists of a set of production rules, which are expansions of nonterminals.
A grammar to parse “1+2+3” input is:
expression -> "1+2+3"
In layman terms, this grammar says: match “1+2+3” as “expression”.
A nonterminal is a construction of the language. A nonterminal has a name (expression
) and a list of production rules. A production rule defines what to match. A production rule consists of series of either other nonterminals or strings (1+2+3
is a production rule consisting of a single terminal).
Note:
expression
is an arbitrary name. It does not have a semantic meaning.
Testing grammar
To test it, compile the grammar using nearleyc
:
$ cat <<'EOF' > ./grammar.ne
expression -> "1+2+3"
EOF
$ nearleyc ./grammar.ne --out ./grammar.js
Instruct nearley-test
to use the resulting ./grammar.js
to parse an input:
nearley-test ./grammar.js --input '1+2+3'
Table length: 6
Number of parses: 1
Parse Charts
Chart: 0
0: {expression → ● expression$string$1}, from: 0
1: {expression$string$1 → ● "1" "+" "2" "+" "3"}, from: 0Chart: 1
0: {expression$string$1 → "1" ● "+" "2" "+" "3"}, from: 0Chart: 2
0: {expression$string$1 → "1" "+" ● "2" "+" "3"}, from: 0Chart: 3
0: {expression$string$1 → "1" "+" "2" ● "+" "3"}, from: 0Chart: 4
0: {expression$string$1 → "1" "+" "2" "+" ● "3"}, from: 0Chart: 5
0: {expression$string$1 → "1" "+" "2" "+" "3" ● }, from: 0
1: {expression → expression$string$1 ● }, from: 0Parse results:
[ [ '1+2+3' ] ]
Hooray! our program parsed a string literal “1+2+3”.
Table of partial parsings
It is important to understand the output of nearley-test
, because this is the tool you will be using to debug the grammars.
Earley works by producing a table of partial parsings, i.e. the nth column of the table contains all possible ways to parse s[:n]
, the first n
characters of s
.
Chart: 0
0: {expression → ● expression$string$1}, from: 0
1: {expression$string$1 → ● "1" "+" "2" "+" "3"}, from: 0
In the above example, “Chart: 0” is the first column. It shows that we have a single nonterminal expression
that can be equal to expression$string$1
.
As far as I understand,
expression$string$1
is just a temporary variable used to represent terminal structures to avoid repeating them. More about that later.
•
is a marker used to denote how far we have parsed. At the moment, we are at the beginning of the string.
As we progress, we continue to match the terminal character by character.
Chart: 1
0: {expression$string$1 → "1" ● "+" "2" "+" "3"}, from: 0Chart: 2
0: {expression$string$1 → "1" "+" ● "2" "+" "3"}, from: 0Chart: 3
0: {expression$string$1 → "1" "+" "2" ● "+" "3"}, from: 0Chart: 4
0: {expression$string$1 → "1" "+" "2" "+" ● "3"}, from: 0Chart: 5
0: {expression$string$1 → "1" "+" "2" "+" "3" ● }, from: 0
If the entire terminal is matched, the program produces a token for the match.
1: {expression → expression$string$1 ● }, from: 0Parse results:
[ [ '1+2+3' ] ]
To broaden your understanding, add another terminal to the expression
production rule and use nearley-test
to test different inputs, e.g.
expression ->
"1+2+0"
| "1+2+3"
Multiple production rules (nonterminal expansions) are separated using the vertical bar (
|
) character.
If you are still feeling lost on the subject, then Better Earley than never article takes you through an equivalent journey documenting every step.
Postprocessors
If you are paying a close attention, you have noticed that the result of the program (the “1+2+3”) is contained in an array, which itself is contained in an array.
[
[
'1+2+3'
]
]
There are two reasons for this.
The first reason is that each production rule of a nonterminal is a list itself. In the original example, this list contains a single terminal “1+2+3”. We could have written it as:
expression ->
"1" "+" "2" "+" "3"
Note: white spaces surrounding the nonterminals and terminals have no meaning in the grammar declaration.
This grammar is equivalent to the original example. However, the result is different:
[
[
'1',
'+',
'2',
'+',
'3'
]
]
This is because Earley algorithm operates on characters. Internally, nearley
converts strings longer than one character into a parser rule declared as a series of symbols (string literals) that comprise the original string. If this parser rule is matched, the resulting series of symbols are joined back into the original string.
Tip: Have a look at the compiled grammar file (
grammar.js
) to further understand the internal format of the parser rules.
We can use a postprocessor to do this ourself.
Seriously, whats a postprocessor?
A postprocessor is a JavaScript function that is used to process the result of a production rule. A postprocessor can format the result or it can reject the match. A postprocessor rule is declared after the production rule surrounded by {% … %}
, e.g.
expression ->
"1" "+" "2" "+" "3" {% d => d.join('') %}
EOF
A postprocessor is invoked with 3 parameters. The first parameter is a list of all the matches. In the above example, thats: 1
, +
, 2
, +
, 3
.
The result of the postprocessor becomes the value of the production rule. In the above example, a parser executed with “1+2+3” input will produce:
[
'1+2+3'
]
We are going to come back to the other two parameters later in the article. If you are feeling impatient, refer to the official documentation.
Using a nonterminal inside a nonterminal
Lets return to the definition of nonterminal I gave earlier:
A nonterminal is a construction of the language. A nonterminal has a name and a list of production rules. A production rule defines what to match. A production rule consists of series of either other nonterminals or strings
That means we can declare a nonterminal for matching the numbers and a nonterminal to match the mathematical symbols.
expression ->
N MS N MS N {% d => d.join('') %}MS ->
"+" {% d => d[0] %}
| "-" {% d => d[0] %}N ->
"1" {% d => d[0] %}
| "2" {% d => d[0] %}
| "3" {% d => d[0] %}
| "4" {% d => d[0] %}
| "5" {% d => d[0] %}
| "6" {% d => d[0] %}
| "7" {% d => d[0] %}
| "8" {% d => d[0] %}
| "9" {% d => d[0] %}
| "0" {% d => d[0] %}
Hopefully, this is self-explanatory.
Recursive patterns
As it is, the N
nonterminal from the previous example can parse only single digit numbers. However, we can refer toN
itself in the production rules.
Once or more
N ->
"0" {% d => d[0] %}
| "1" {% d => d[0] %}
| "2" {% d => d[0] %}
| "3" {% d => d[0] %}
| "4" {% d => d[0] %}
| "5" {% d => d[0] %}
| "6" {% d => d[0] %}
| "7" {% d => d[0] %}
| "8" {% d => d[0] %}
| "9" {% d => d[0] %}
| "0" N {% d => d[0] + d[1] %}
| "1" N {% d => d[0] + d[1] %}
| "2" N {% d => d[0] + d[1] %}
| "3" N {% d => d[0] + d[1] %}
| "4" N {% d => d[0] + d[1] %}
| "5" N {% d => d[0] + d[1] %}
| "6" N {% d => d[0] + d[1] %}
| "7" N {% d => d[0] + d[1] %}
| "8" N {% d => d[0] + d[1] %}
| "9" N {% d => d[0] + d[1] %}
This is simply telling the parser that an N
is one of [0–9]
or N
is one of [0–9]
followed by N
.
Zero or more
Similar to the above, we can implement a zero or more quantifier.
To match nothing we use epsilon rule. An epsilon rule is denoted using null
.
N ->
null
| "0" N {% d => d[0] + d[1] %}
| "1" N {% d => d[0] + d[1] %}
| "2" N {% d => d[0] + d[1] %}
| "3" N {% d => d[0] + d[1] %}
| "4" N {% d => d[0] + d[1] %}
| "5" N {% d => d[0] + d[1] %}
| "6" N {% d => d[0] + d[1] %}
| "7" N {% d => d[0] + d[1] %}
| "8" N {% d => d[0] + d[1] %}
| "9" N {% d => d[0] + d[1] %}
This is telling the parser that an N
is nothing or one of [0-9]
followed by N
.
These are examples of a BNF recursive repeat construct.
Character sets and quantifiers
If you are thinking [0–9]
and quantifiers then you are in luck: nearley
supports character sets and quantifiers (+*?
).
Note: This is
nearley
specific feature.nearley
compiles character sets and quantifiers into BNF.
Therefore, the above once or more example can be rewritten as:
N ->
[0-9]:+ {% d => d[0].join('') %}
The above zero or more example can be rewritten as:
N ->
[0-9]:* {% d => d[0].join('') %}
Recap
Parsing a literal “1+2+3” has taught us:
- what is a nonterminal
- what is a terminal
- what is a production rule
- how to debug grammar
- what is a
nearley
postprocessor - what are character sets
- what are quantifiers
- what is a BNF recursive repeat construct
I know what you are thinking: all this to parse a literal string “1+2+3”? I can parse it using regex (\d+\+?)
.
Parsing (and evaluating) mathematical expressions
Here is a grammar to parse any valid mathematical expression that consists of addition, subtraction, multiplication and division actions.
main -> AS {% d => d[0] %}# Parentheses
P -> "(" AS ")" {% d => d[1] %}
| N {% d => d[0] %}# Multiplication and division
MD -> MD "*" P {% d => d[0]*d[2] %}
| MD "/" P {% d => d[0]/d[2] %}
| P {% d => d[0] %}# Addition and subtraction
AS ->
AS "+" MD {% d => d[0]+d[2] %}
| AS "-" MD {% d => d[0]-d[2] %}
| MD {% d => d[0] %}N ->
[0-9]:+ {% d => parseInt(d[0].join(''), 10) %}
You already know all components that make this grammar. Understanding it is only a matter of compiling the grammar and using nearley-test
to debug several expressions, e.g. 10/((1+4)*8)*4
yields 1
.
Ambiguity
When I introduced the postprocessors, I talked about how the result is contained in an array of arrays.
[
[
'1+2+3'
]
]
I have explained one level of nesting – its because a production rule is a list.
However, after accommodating the list handling, we are still left with the parsing result being an array.
[
'1+2+3'
]
This is how neatley
informs you about grammar ambiguity. If there is more than one result, that means that the grammar is ambiguous, i.e. there is more than one route to parse the input. This is a bad sign.
Ambiguity will present performance issues (multiple routes need to be analysed to complete parsing). It will also lead to hard to catch bugs – different routes can produce different parsing results.
Here is an example of an ambiguous grammar:
attributeName ->
[_a-zA-Z]:* [_a-zA-Z0-9-]:* {% d => d[0].join('') + d[1].join('') %}
Provided an input that matches the first and the second character sets, the output of the parser is:
[
'foobar',
'foobar'
]
This is because both [_a-zA-Z]:*
and [_a-zA-Z0–9-]:*
match the “foobar” string to completion.
Whenever you encounter ambiguity, use nearley-test
program to trace the parsing route and see where the parser diverges.
The above example can be fixed by simply ensuring that the first pattern appears exactly once.
attributeName ->
[_a-zA-Z] [_a-zA-Z0-9-]:* {% d => d[0] + d[1].join('') %}
Rejecting a match
When I really introduced the postprocessor functions, I have said that the postprocessor function is invoked with three parameters:
- The first parameter is a list of the matched tokens.
- The second is the index at which the rule was found.
- The third is a sentinel value used to reject the match.
When I read the nearley
documentation the first time, it left me puzzled – when would I want to reject a valid match? I have come across at least one scenario when working on the CSS parser I have mentioned earlier.
Consider this example:
selectorBody ->
typeSelector:?
idSelector:?
classSelector:*
attributeValueSelector:*
attributePresenceSelector:*
pseudoClassSelector:* {%d flatten %d}
Note: This is a single production rule. There are no
|
.
A CSS selector can consist of:
- a type selector (optional)
- an id selector (optional)
- multiple class selectors (optional)
- multiple attribute value selectors (optional)
- multiple presence selectors and (optional)
- multiple pseudo class electors (optional)
Notice the pattern? All of them are optional. This means that a CSS selector must have at least one of the above components, but neither is required.
My first and naïve attempt was:
Note: Using abbreviations only for formatting purposes.
selectorBody ->
TS IS:? CS:* AVS:* APS:* PCS:* {% flatten %}
| TS:? IS CS:* AVS:* APS:* PCS:* {% flatten %}
| TS:? IS:? CS AVS:* APS:* PCS:* {% flatten %}
| TS:? IS:? CS:* AVS APS:* PCS:* {% flatten %}
| TS:? IS:? CS:* AVS:* APS PCS:* {% flatten %}
| TS:? IS:? CS:* AVS:* APS:* PCS {% flatten %}
This ensures that at least one selector is present. The problem is that (you have guessed it!) this can produce more than one result (ambiguous grammar).
This is where the rejection sentinel comes in.
We can filter out null
results (when a quantifier does not match anything, it produces null
; remember how we have written quantifiers using BNF). And if there are no matches, we return the rejection sentinel to reject the match.
selectorBody ->
TS:? IS:? CS:* AVS:* APS:* PCS:* {% (d, i, reject) => { const tokens = d.filter((token) => { return token !== null; }); if (!tokens.length) return reject; return flatten(tokens); } %}
Note: This parser can still produce ambiguous results if either of the nonterminals used in the expression merge into one another; just like in the earlier example that demonstrated ambiguity.
Helper functions
It is not good for code readabilty to write postprocessor functions after each production rule. Use @{% … %}
syntax to declare helper functions. In fact, the above code example is already using a predefined flatten
function.
@{%
const flatten = d => {
return d.reduce(
(a, b) => {
return a.concat(b);
},
[]
);
};const filter = d => {
return d.filter((token) => {
return token !== null;
});
};const selectorBody = (d, i, reject) => {
const tokens = filter(d);if (!tokens.length) {
return reject;
}return flatten(tokens);
}
%}selectorBody ->
TS:? IS:? CS:* AVS:* APS:* PCS:* {% selectorBody %}
Note: All code written between
@{% … %}
is hoisted to the top of the generated grammar file. Therefore, where you write them does not matter. However, I prefer to define all helpers at the top of the file.
Parsing and tokenising
Now that you know how to parse a string, how do you tokenise it?
The answer is simple: a postprocessor can return anything. Just like we used a postprocessor to convert strings to numbers in the calculator example, we can return objects describing the tokens and delegate the math logic to whatever program that understands those tokens, e.g.
N ->
[0-9]:+ {% d => {type: 'NUMBER', value: parseInt(d[0].join(''), 10)} %}
Bonus: IDE grammar
There are plugins for different IDEs that enable Nearley syntax highlighting:
- Atom users can write nearley grammars with this plugin by Bojidar Marinov.
- Sublime Text users can write nearley grammars with this syntax by liam4.
- Vim users can use this plugin by Andrés Arana.
The all-powerful weapon!
I have promised you an all-powerful weapon: you have it! Now you have a skill to parse absolutely any string!
The CSS selector parser
At the beginning I have said that I have started learning about parsing because I needed to build a CSS selector parser. I have built it using nearley
and I am pretty sure it is the most thorough CSS selector parser available to JavaScript.
Meet Scalpel, a CSS selector parser.
If you are looking for more examples of using Earley algorithm to parse context-free grammar, I encourage you to take a look at the source code of Scalpel.
Good luck! :-)