There are many different problems that large scale parsing projects impose that do not exist for…

2 min readJan 22, 2017

There are many different problems that large scale parsing projects impose that do not exist for small scrapers.

In a most general scenario, I agree with you. If you are writing a program to parse a dozen websites, cheerio.js will do. However, a large scale project imposes new requirements.

Lets define a large scale project. When I say a “large scale project” I am referring to the project that I have been working for the past 3 years, and the problems that I have observed. https://applaudience.com/ is collecting showtime data from 2000+ cinema websites. Thats a lot of code, a lot of freelancers working on the individual scrapers, a lot of bugs to catch.

The cause of the first problem is — inconsistent code style. In a large project, there are many developers working on the same code base at the same time. A loose code style is bad–it increases the churn rate, PRs take longer to approve, etc. This can be solved in part with ESLint. However, there is still a thousand ways to write the same logic. Using a custom DSL allows absolutely full control over the style of the parser schema.

The second problem is separation of concerns. Ideally, you want to separate data fetching, data parsing (extracting data from the document) and parsing logic (rules that dictate how one piece of information is used to get another piece of information, e.g. how movie name and location name can be used to get showtimes). A clear separation allows for simple code, easy to test code, and the most important – it allows to separate developer roles. Now I can have a team that works only with data fetching and parsing logic separate from the data paring team. This is important because — the two tasks require different skills, and the cost of the skills are very different.

The third problem is scalability. This goes back to the last point about separation of concerns – in the current setup, a single problem is responsible for doing all three things (fetching, parsing data, parsing logic). The most resource intensive is fetching and parsing logic. This restricts us quite a bit in terms of what we can do on a single machine at once, e.g. we chose to use jsdom over a headless browser. Being able to simply send a manifest instruction to a worker agent and get a JSON response makes scaling a lot simpler.

The last problem is debug-ability. Having a script that works on browser and Node.js that simply interprets instructions in a form of JSON, makes it a lot easier for developers to develop and debug the parsers.

Regarding:

Its not clear what happens. [..]

All in-built features are strict, i.e. if anything unexpected happens, break the parser.

However, developers will be able to define their own custom methods to use in CSS selectors.

Hope this answers your questions!

There are many different problems that large scale parsing projects impose that do not exist for small scrapers.

Written by Gajus Kuizinas