China Map Project - Part 1: Nokogiri, Regular Expressions, and a JSON API

Over the past couple months I’ve been working on a side project to create an interactive, data-rich map of China. (Check out the source code on Github!) To accomplish this goal, I made a Ruby on Rails app that scrapes Wikipedia for data about all of the regions in China, stores this information in a database, and outputs the data as a JSON API. On the frontend, I used JavaScript to create a vector map of China and populate the map with the data consumed from the API. I saw this project as something that could both challenge me technically and bring together two of my main interests: China and coding.

Gathering the Data

I used the Nokogiri gem to do all of the web scraping in this project. Starting on the main China page on Wikipedia, I was able to scrape the links to the individual pages for every province, autonomous region, municipality, and special administrative region in China. On each page I was able to take in some basic data about the region, including its population, population density, and GPD per capita.

Nokogiri and OpenURI make the process of web scraping very simple. In my Gemfile, I required the Nokogiri gem:

gem 'nokogiri'

and in the file I did the scraping in (db/seeds.rb), I required the OpenURI module of the Ruby standard library:

require 'open-uri'

(As a side note, I have been unable to find out if there is a better, more centralized place to put the OpenURI requirement in a Rails application. It is a module in Ruby’s standard library, so it can’t be placed in the Gemfile, but it still seems somewhat un-Railslike to just throw the requirement into whatever file you happen to be using it in.)

Once these requirements were declared, it was straighforward to use OpenURI open the URL I was targeting and use Nokogiri to capture the HTML contents of the page.

china_main_page = Nokogiri::HTML(open("http://en.wikipedia.org/wiki/China"))

In the case of my app, I first scraped the contents of the China page on Wikipedia and stored the links to each region listed on that page into an array named region_links. I then iterated over these links (skipping one as there were two links to Taiwan) to create new Region objects.

def make_regions
  region_links.each_with_index do |url, i|
    next if i == 22
    region = Region.new.tap {|r| r.url = url }
    region.save
  end
end

Behind the scenes in app/models/region.rb, I used the last part of the URL to assign names to each region using the before_create hook.

class Region < ActiveRecord::Base
  before_create :assign_name_from_url

  def assign_name_from_url
    self.name = url[29..-1].split('_').join(' ')
    self.name.sub!(' Autonomous Region', '') if self.name.ends_with?(' Autonomous Region')
  end
end

Once I had all of the regions and their Wikipedia URLs stored in the database, the next step was to iterate over the regions and use Nokogiri to scrape the HTML contents of each region’s page.

def scrape_all_regions
  Region.all.each do |region|
    page = Nokogiri::HTML(open(region.url))
    # ...
  end
end

I wanted to use best practices and make the code adhere to object-oriented design principles, so I put the above methods into a class called ChinaScraper, the outline of which roughly looks like this:

class ChinaScraper
  def run
    scrape_index
    make_regions
    scrape_all_regions
  end

  def scrape_index
    # ...
  end

  def make_regions
    # ...
  end

  def scrape_all_regions
    # ...
  end
end

Finally, in db/seeds.rb I seeded the database simply by instantiating a new ChinaScraper object.

ChinaScraper.new.run

Parsing the Data

If you look at the Wikipedia articles of a few Chinese provinces, you will notice that the structure of each article is fairly similar. Each page has a sidebar on the right with some basic data regarding the specific region: its capital, governor, latitude and longitude, GDP in US dollars and Chinese yuan, etc. These data are all seemingly laid out in the same format, but actually the CSS selectors are slightly different in different articles.

At first, I constructed a large conditional statement that chose the accurate CSS selector to use given some hardcoded information to identify the region. Eventually this hardcoding didn’t sit right with me, so I found a more elegant solution: regular expressions.

One good example of this was with the data on the page indicating a region’s area and population density. Given the page of a particular region, I isolated all tr HTML elements with the CSS mergedrow class that contained the text km2 and stored this into a local variable.

area_info = page.search("tr.mergedrow").select { |t| t.text.match(/km2/i) }

I then split the appropriate string from the area_info array into an array of strings using a regular expression, selected the string with the relevant info, removed all commas from the string, and converted it to an integer.

region.area_km_sq         = area_info.first.text.
                              split(/\s| /)[3].
                              gsub(',', '').to_i
region.population_density = area_info.last.text.
                              split(/\s| |\//)[3].
                              gsub(',', '').to_i

I originally used only \s to split the string on all of its whitespaces, but found that a few whitespaces were still showing up in the resulting array of strings! After some digging, I figured out that whitespaces from the Chinese character set were included in the text of some of the Wikipedia articles, but they were not picked up by the regexp’s whitespace identifier.

I was able to solve this problem by adding the Chinese whitespace as one of the options in the above regexp. However, when I tested this program on a different computer that did not have a Chinese language package installed, rake db:seed blew up on this line. After some more digging, I was able to resolve this problem by adding one commented line to the top of the db/seeds.rb file:

# encoding: UTF-8

Outputting the Data as a JSON API

Once the data had been scraped from Wikipedia, parsed using Nokogiri and regular expressions, and persisted in the database, it was then just a matter of outputting the data as a JSON API for easy consumption by the JavaScript frontend. The app/controllers/regions_controller.rb file is as follows:

class RegionsController < ApplicationController
  def index
    @regions = Region.all
    render json: @regions
  end
end

Closing Thoughts

Though this was not my first web scraping project, it was challenging given the HTML/CSS inconsistencies across different Wikipedia articles. After solving the problem in a brute force fashion using hardcoded region data, I was able to bring the code to a more abstract level and learn a lot about parsing text with regular expressions in the process.

Check out the next post in this series where I talk about the JavaScript frontend and the third and fourth posts where I talk about refactoring the db/seeds.rb file!