Parsing absolutely anything in JavaScript using Earley algorithm

Enjoy the journey ;-)

The reason for writing a parser

selector: body
properties:
title: title
articles:
selector: article {0,}
properties:
body: .body::property(innerHTML)
summary: .body p {0,}[0]
imageUrl: img::attribute(src)
title: .title::text()::test(/foo/)::format(upperCase)

Picking the right tool for the job

The shark-like tank, with powerful jaws.
Who needs all of these other parsers when there is He-Man.

Setup

$ npm install nearley
$ ls -1 ./node_modules/.bin
nearley-railroad
nearley-test
nearley-unparse
nearleyc

Parsing “1+2+3”

expression -> "1+2+3"

Testing grammar

$ cat <<'EOF' > ./grammar.ne
expression -> "1+2+3"
EOF
$ nearleyc ./grammar.ne --out ./grammar.js
nearley-test ./grammar.js --input '1+2+3'
Table length: 6
Number of parses: 1
Parse Charts
Chart: 0
0: {expression → ● expression$string$1}, from: 0
1: {expression$string$1 → ● "1" "+" "2" "+" "3"}, from: 0
Chart: 1
0: {expression$string$1 → "1" ● "+" "2" "+" "3"}, from: 0
Chart: 2
0: {expression$string$1 → "1" "+" ● "2" "+" "3"}, from: 0
Chart: 3
0: {expression$string$1 → "1" "+" "2" ● "+" "3"}, from: 0
Chart: 4
0: {expression$string$1 → "1" "+" "2" "+" ● "3"}, from: 0
Chart: 5
0: {expression$string$1 → "1" "+" "2" "+" "3" ● }, from: 0
1: {expression → expression$string$1 ● }, from: 0
Parse results:
[ [ '1+2+3' ] ]

Table of partial parsings

Chart: 0
0: {expression → ● expression$string$1}, from: 0
1: {expression$string$1 → ● "1" "+" "2" "+" "3"}, from: 0
Chart: 1
0: {expression$string$1 → "1" ● "+" "2" "+" "3"}, from: 0
Chart: 2
0: {expression$string$1 → "1" "+" ● "2" "+" "3"}, from: 0
Chart: 3
0: {expression$string$1 → "1" "+" "2" ● "+" "3"}, from: 0
Chart: 4
0: {expression$string$1 → "1" "+" "2" "+" ● "3"}, from: 0
Chart: 5
0: {expression$string$1 → "1" "+" "2" "+" "3" ● }, from: 0
1: {expression → expression$string$1 ● }, from: 0Parse results:
[ [ '1+2+3' ] ]
expression ->
"1+2+0"
| "1+2+3"

Postprocessors

[
[
'1+2+3'
]
]
expression ->
"1" "+" "2" "+" "3"
[
[
'1',
'+',
'2',
'+',
'3'
]
]

Seriously, whats a postprocessor?

expression ->
"1" "+" "2" "+" "3" {% d => d.join('') %}
EOF
[
'1+2+3'
]

Using a nonterminal inside a nonterminal

expression ->
N MS N MS N {% d => d.join('') %}
MS ->
"+" {% d => d[0] %}
| "-" {% d => d[0] %}
N ->
"1" {% d => d[0] %}
| "2" {% d => d[0] %}
| "3" {% d => d[0] %}
| "4" {% d => d[0] %}
| "5" {% d => d[0] %}
| "6" {% d => d[0] %}
| "7" {% d => d[0] %}
| "8" {% d => d[0] %}
| "9" {% d => d[0] %}
| "0" {% d => d[0] %}

Recursive patterns

Once or more

N ->
"0" {% d => d[0] %}
| "1" {% d => d[0] %}
| "2" {% d => d[0] %}
| "3" {% d => d[0] %}
| "4" {% d => d[0] %}
| "5" {% d => d[0] %}
| "6" {% d => d[0] %}
| "7" {% d => d[0] %}
| "8" {% d => d[0] %}
| "9" {% d => d[0] %}
| "0" N {% d => d[0] + d[1] %}
| "1" N {% d => d[0] + d[1] %}
| "2" N {% d => d[0] + d[1] %}
| "3" N {% d => d[0] + d[1] %}
| "4" N {% d => d[0] + d[1] %}
| "5" N {% d => d[0] + d[1] %}
| "6" N {% d => d[0] + d[1] %}
| "7" N {% d => d[0] + d[1] %}
| "8" N {% d => d[0] + d[1] %}
| "9" N {% d => d[0] + d[1] %}

Zero or more

N ->
null
| "0" N {% d => d[0] + d[1] %}
| "1" N {% d => d[0] + d[1] %}
| "2" N {% d => d[0] + d[1] %}
| "3" N {% d => d[0] + d[1] %}
| "4" N {% d => d[0] + d[1] %}
| "5" N {% d => d[0] + d[1] %}
| "6" N {% d => d[0] + d[1] %}
| "7" N {% d => d[0] + d[1] %}
| "8" N {% d => d[0] + d[1] %}
| "9" N {% d => d[0] + d[1] %}

Character sets and quantifiers

N ->
[0-9]:+ {% d => d[0].join('') %}
N ->
[0-9]:* {% d => d[0].join('') %}

Recap

  • what is a nonterminal
  • what is a terminal
  • what is a production rule
  • how to debug grammar
  • what is a nearley postprocessor
  • what are character sets
  • what are quantifiers
  • what is a BNF recursive repeat construct
(\d+\+?)

Parsing (and evaluating) mathematical expressions

main -> AS {% d => d[0] %}# Parentheses
P -> "(" AS ")" {% d => d[1] %}
| N {% d => d[0] %}
# Multiplication and division
MD -> MD "*" P {% d => d[0]*d[2] %}
| MD "/" P {% d => d[0]/d[2] %}
| P {% d => d[0] %}
# Addition and subtraction
AS ->
AS "+" MD {% d => d[0]+d[2] %}
| AS "-" MD {% d => d[0]-d[2] %}
| MD {% d => d[0] %}
N ->
[0-9]:+ {% d => parseInt(d[0].join(''), 10) %}
Match this using your regular expression.

Ambiguity

[
[
'1+2+3'
]
]
[
'1+2+3'
]
attributeName ->
[_a-zA-Z]:* [_a-zA-Z0-9-]:* {% d => d[0].join('') + d[1].join('') %}
[
'foobar',
'foobar'
]
attributeName ->
[_a-zA-Z] [_a-zA-Z0-9-]:* {% d => d[0] + d[1].join('') %}

Rejecting a match

  • The first parameter is a list of the matched tokens.
  • The second is the index at which the rule was found.
  • The third is a sentinel value used to reject the match.
selectorBody ->
typeSelector:?
idSelector:?
classSelector:*
attributeValueSelector:*
attributePresenceSelector:*
pseudoClassSelector:* {%d flatten %d}
  • a type selector (optional)
  • an id selector (optional)
  • multiple class selectors (optional)
  • multiple attribute value selectors (optional)
  • multiple presence selectors and (optional)
  • multiple pseudo class electors (optional)
selectorBody ->
TS IS:? CS:* AVS:* APS:* PCS:* {% flatten %}
| TS:? IS CS:* AVS:* APS:* PCS:* {% flatten %}
| TS:? IS:? CS AVS:* APS:* PCS:* {% flatten %}
| TS:? IS:? CS:* AVS APS:* PCS:* {% flatten %}
| TS:? IS:? CS:* AVS:* APS PCS:* {% flatten %}
| TS:? IS:? CS:* AVS:* APS:* PCS {% flatten %}
selectorBody ->
TS:? IS:? CS:* AVS:* APS:* PCS:* {% (d, i, reject) => { const tokens = d.filter((token) => { return token !== null; }); if (!tokens.length) return reject; return flatten(tokens); } %}

Helper functions

@{%
const flatten = d => {
return d.reduce(
(a, b) => {
return a.concat(b);
},
[]
);
};
const filter = d => {
return d.filter((token) => {
return token !== null;
});
};
const selectorBody = (d, i, reject) => {
const tokens = filter(d);
if (!tokens.length) {
return reject;
}
return flatten(tokens);
}
%}
selectorBody ->
TS:? IS:? CS:* AVS:* APS:* PCS:* {% selectorBody %}

Parsing and tokenising

N ->
[0-9]:+ {% d => {type: 'NUMBER', value: parseInt(d[0].join(''), 10)} %}

Bonus: IDE grammar

  • Atom users can write nearley grammars with this plugin by Bojidar Marinov.
  • Sublime Text users can write nearley grammars with this syntax by liam4.
  • Vim users can use this plugin by Andrés Arana.

The all-powerful weapon!

The all-powerful weapon of parsing!

The CSS selector parser

--

--

--

Tech / Product Founder — building https://contra.com/

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Setup Behat with Symfony 5 Rest API

10 Random topics from React.js

PHP VS NODE.JS — Which One is Better for You?

Introduction to ImmutableJs

How to use a single instance of Socket.IO in your React app

How JavaScript Engines Work

Let’s build a React app in TDD

Organize NodeJS code using custom modules

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Gajus Kuizinas

Gajus Kuizinas

Tech / Product Founder — building https://contra.com/

More from Medium

LeetCode’s Combinations Problem- Recursive Backtracking in JavaScript

Typescript Quick Introduction

HackerRank-Easy-Staircase(Typescript)

.map(), .filter() and .reduce() in JavaScript