Scraping data using NodeJS and PhantomJS

If you are interested in data analysis or machine learning, and have given at least your first steps into this fascinating area, one of the very first things you tend to ask yourself is, where to get data from ?

In many occasions, the data that you are looking for may be freely available on the Internet, and as it turns out, there are well known methods of extracting such data in a procedural way. One of those methods is provided by NodeJS and PhantomJS

In this case, we chose those technologies for the following reasons:

  • Familiarity
  • PhantomJS allows to inject jQuery in the scraped web page, which then can be used to query the contents.
  • NodeJS provides a set of great open source tools readily available to extend and support your application, ie. we can provide things like caching or attach a UI (nw.js)
  • PhantomJS executes a headless webkit browser. Acts as if you were in fact browsing the page but with no UI. This can be useful for hiding your tracks, ie. other methods like plain GET requests tend to be rejected by some scraped web servers. A disadvantage though, is that since PhantomJS runs as mentioned before, as a webkit headless browser, it runs in its own process, hence it tends to be more difficult to debug the injected jQuery code from Node.

A workaround for this is to create and test the actual jquery code using a tool like jQuerify Chrome App on the page itself and then copy/paste the actual code in Node.

The following image shows jQuerify in action:

In order to bridge NodeJS and PhantomJS we use phantomjs-node. https://github.com/sgentle/phantomjs-node.

You can get the source code of this project from this link using git: https://github.com/electronicbits/scraper

What the provided code does is actually very simple, however gives an extensible foundation for a larger API

Once the code is copied locally from the repo and you made sure you have installed NodeJS with access from your local PATH, execute the following command in the directory where the code is

> npm install

After making sure all npm modules are installed correctly, you can run the web service as follows:

>  node server.js
scraper listening at http://[::]:8080

The current implementation as mentioned before is a basic example, ie. it extracts the title from Wikipedia pages. However, my initial plan was to provide for example an actual API that scrapes real estate data from a range of web sites given an Australian postcode

But then wanted to avoid any legal issues concerning any IP rights that those websites may have. The purpose of this article is to demonstrate a way to extract data for personal use, not to provide such an API

Finally, after running the server, open a web browser and go to:

http://localhost:8080/wikipedia/Spain

And the result of this query should be in JSON format:

{"country":"Spain"}

If you want to extend this application to your needs, you can add more modules with their respective url and jquery code, following the example given in the wikipedia.js module. Furthermore, additional routes need to be provided for the new modules. The routes are in the server.js file

Later on, we will be adding support for caching using MongoDB, as scraping is a costly process, specially if data is to be requested on a frequent basis. But that's left for another post