Benjamin Johnston - Prolog Web-scraping

In 2005, on comp.lang.prolog, I claimed that Prolog is excellent for screen-scraping web-pages. Last week - almost decade later - I received an email asking, “Why?”

The internet never forgets!

My claim was based on Prolog’s powerful pattern matching features. In Prolog, pattern matching is referred to as unification.

Assume we have a variable called Page whose value is as follows:

Page = html(body(h1('Sydney Weather'), p('Sunny')))

Then if we have a variable called Forecast we can match it to a value using the following command:

html(body(_, p(Forecast))) = Page

Prolog will ‘unify’ both sides and output the result that Forecast = 'Sunny'.

Pattern matching (or unification) is central to screen-scraping. In the remainder of this article I’ll show how to combine unification with a HTTP and HTML parser. I’ll then look briefly at the potential for natural language parsing.

Prolog

I use SWI-Prolog for these examples.

I assume you already know Prolog. For example, you may have studied it briefly at university. My objective is to show a practical application of Prolog.

If you are not familiar with Prolog, then this is not the place to learn it. I hope that the examples below will inspire you to learn Prolog. To be frank, it isn’t an easy language for beginners. However, it will change the way you think about programming for the better. Two good books are Prolog Programming for Artificial Intelligence and Prolog Programming in Depth (the authors provide a free download).

Loading and Parsing HTML

Assume we have the following HTML (sydney.html):

<html>
  <body>
    <h1>Sydney Forecast</h1>
    <p>Forecast for Thursday:</p>
    <div class="max-temp">Max 25</div>
    <div class="description">Sunny</div>
  </body>
</html>

The following Prolog query reads the file and parses it as HTML:

load_structure('sydney.html', Page, []).

Alternately, the following query will retrieve the file directly from the internet:

use_module(library(http/http_open)).

http_open('http://www.benjaminjohnston.com.au/downloads/sydney.html', DataStream, []),
load_structure(DataStream, Page, []).

If you run this query yourself, you'll realize that I simplified things in the introduction. In fact, the true value of the Page variable is the following:

Page = [
  element(html, [], [
    element(body, [], [
      element(h1, [], ['Sydney Forecast']), 
      element(p, [], ['Forecast for Thursday:']), 
      element(div, [class='max-temp'], ['Max 25']), 
      element(div, [class=description], ['Sunny'])
    ])
  ])
].

Simple Pattern Matching

Now that Prolog has loaded and parsed the HTML file, we can use unification to extract information. I can extract the forecast from Page using this query:

load_structure('sydney.html', Page, []),
[element(_, _, [element(_, _, [_, _, _, element(_, _, [Forecast])])])] = Page.

The result is Forecast = 'Sunny'.

This kind of pattern-matching might be suitable for extracting data from a simple page that regularly updates.

More Sophisticated Matching

Prolog can perform more sophisticated pattern matching. Prolog allows for recursive queries that can search deep inside pages. These queries may express complex logical conditions to perform intelligent searches.

Assume we have a slightly more complex HTML page (australia.html):

<html>
  <head>
    <title>Forecasts</title>
  </head>
  <body>
    <h1>Australian Capital Cities Forecast</h1>
    <table>
      <tr><th>City</th><th>Description</th></tr>
      <tr><td>Adelaide</td><td>21</td></tr>
      <tr><td>Brisbane</td><td>28</td></tr>
      <tr><td>Canberra</td><td>16</td></tr>
      <tr><td>Darwin</td><td>33</td></tr>
      <tr><td>Hobart</td><td>11</td></tr>
      <tr><td>Melbourne</td><td>19</td></tr>
      <tr><td>Perth</td><td>21</td></tr>
      <tr><td>Sydney</td><td>25</td></tr>
    </table>
  </body>
</html>

Instead of writing a fixed query, the following program can be used to search the entire tree:

weather([H|_], Out) :- weather(H, Out).
weather([_|T], Out) :- weather(T, Out).
weather(element(_, _, Children), Out) :- weather(Children, Out).

weather(element(tr, _, [element(td, _, [City]), element(td, _, [Forecast])]), City-Forecast).

The left hand side of the ‘:-’ symbol is similar to a function declaration in other programming languages. Notice that pattern matching can be used even in the declaration.

The first three lines are a common pattern that encode the following rules:

If you encounter a list, then check the first element of the list.
If you encounter a list, then check the rest of the list.
If you encounter an element that has children, then check its list of children.

Note that if a list is encountered, both the first and second condition will apply. The Prolog search procedure will automatically try all valid rules. Both rules will be executed.

The last line of the program provides a matching condition. It looks for pairs of td inside a tr.

Now, the program can be executed with this query:

load_structure('australia.html', Page, []),
weather(Page, City-Forecast).

Prolog will apply all the rules in the program to find city and forecast pairs. It will generate the following output:

City = 'Adelaide', Forecast = 21;
City = 'Brisbane', Forecast = 28;
City = 'Canberra', Forecast = 16;
City = 'Darwin', Forecast = 33;
City = 'Hobart', Forecast = 11;
City = 'Melbourne', Forecast = 19;
City = 'Perth', Forecast = 21;
City = 'Sydney', Forecast = 25.

As an aside, an equivalent XSLT program would be very similar. However, Prolog gives you far greater power than XSLT and cleaner syntax. In fact, you can use the full expressive power of Prolog to guide the search. This includes mathematical calculations, queries to other data sources and even (as we’ll see in a moment) natural language processing.

Parsing Natural Language

Prolog also has built-in syntax for creating custom parsers. A custom parser can be used to interpret the text in HTML elements. In Prolog, this built-in feature is called a definite clause grammar (DCG).

For example, consider strings similar to the following:

Sydney: min 15 celsius, max 25 celsius

A classic BNF (Backus-Naur Form) specification of its grammar might look like this:

<Weather-Sentence> ::= <City> ":" "min" <Temperature> "," "max" <Temperature>

<Weather-Sentence> ::= <City> ":" "max" <Temperature> "," "min" <Temperature>

<Temperature> ::= <Number> "celsius"

<Temperature> ::= <Number> "fahrenheit"

With Prolog we can translate the grammar directly into a program. The Prolog parser is very similar to the BNF specification:

weather(Place, Min, Max) --> 
    [Place, ':', min], 
    temperature(Min), 
    [',', max], 
    temperature(Max).

weather(Place, Min, Max) --> 
    [Place, ':', max], 
    temperature(Max), 
    [',', min], 
    temperature(Min).

temperature(Celsius) --> 
    [Celsius, celsius].

temperature(Celsius) --> 
    [Fahrenheit, fahrenheit], 
    {Celsius is (Fahrenheit - 32) * 5/9}.

If this DCG is entered into a program file (.pl) and loaded, the following query will work:

Input = 'Sydney: min 15 celsius, max 25 celsius',
tokenize_atom(Input, Tokens),
weather(Place, Min, Max, Tokens, []).

The output will be the following:

Place = Sydney, Min = 15, Max = 25.

The program performs appropriate conversions if the input is in fahrenheit:

Input = 'Sydney: min 59 fahrenheit, max 77 fahrenheit',
tokenize_atom(Input, Tokens),
weather(Place, Min, Max, Tokens, []).

It can also deal with a reversal of the min/max order and mixed units:

Input = 'Sydney: max 77 fahrenheit, min 15 celsius',
tokenize_atom(Input, Tokens),
weather(Place, Min, Max, Tokens, []).

In both cases, the output is the same:

Place = Sydney, Min = 15, Max = 25.

Prolog can be used to parse more complex grammars. Open-ended natural language processing is still a research problem. However, Prolog is ideal when the language used is simple and predictable. For example, most weather forecasts in a region folow a predictable template: meteorologists are not poets. It should be relatively easy to automatically interpret the text of weather forecasts. Most introductory books, including the two that I listed above, explain how to write such natural language parsers.

Summary

The low-level requirements of a screen-scraper are not complex. Most languages, including SWI-Prolog include HTTP client libraries.

The interesting challenge of screen-scraping is pattern matching and parsing. This is where Prolog excels.