Sample wordcount streaming job using PHP on Commoncrawl dataset.

The easiest way to start working on Commoncrawl dataset is probably using Amazon’s own hadoop framework called Elastic Mapreduce. For it to use you need to sign in to amazonaws.com services, and be aware that EMR is not free and you can’t use micro EC2 instances free tier when running EMR job via amazonaws console, which is basically user friendly GUI for those who have not time/skills to setup their own hadoop cluster.
This sample job will cost 2x $0.06(m1.small regular price) + 2x $0.015 (EMR surcharge) = $0.15 or alternatively ca. 2×0.007(depends on current market price) + 2x $0.015 = $0.05 when using spot instances, which are significantly cheaper and I’ve been using them for every job and also in this example. Price seems to be extremely low but imagine running this example on 200k other files on dataset to get complete picture.

The mapper/reducer scripts plus output files have to be stored on your own Amazon S3 bucket(=folder) which you can create easily via amazonaws console, first 5 GB of data storage is for free + there is free 15GB out transfer included.
So here is sample php mapper script I named wcmapper.php – you need to save it as a file and upload to your S3 bucket:

#!/usr/bin/php
<?php
//sample mapper for hadoop streaming job
$word2count = array();

// input comes from STDIN (standard input)
while (($line = fgets(STDIN)) !== false) {
   // remove leading and trailing whitespace and lowercase
   $line = strtolower(trim($line));
   // split the line into words while removing any empty string
   $words = preg_split('/\W/', $line, 0, PREG_SPLIT_NO_EMPTY);
   // increase counters
   foreach ($words as $word) {
       $word2count[$word] += 1;
   }
}

// write the results to STDOUT (standard output)

foreach ($word2count as $word => $count) {
   // tab-delimited
   echo "$word\t$count\n";
}

?>

And we need reducer script too named wcreducer.php and do the same with it:

#!/usr/bin/php
<?php
//reducer script for sample hadoop job
$word2count = array();

// input comes from STDIN
while (($line = fgets(STDIN)) !== false) {
    // remove leading and trailing whitespace
    $line = trim($line);
    // parse the input we got from mapper.php
    list($word, $count) = explode("\t", $line);
    // convert count (currently a string) to int
    $count = intval($count);
    // sum counts
    if ($count > 0) $word2count[$word] += $count;
}

ksort($word2count);  // sort the words alphabetically

// write the results to STDOUT (standard output)
foreach ($word2count as $word => $count) {
    echo "$word\t$count\n";
}

?>

Scripts are taken from here. I just did cosmetic fixes to them.
What is important to say is that these sample scripts are good only for small sample testing purposes, for larger datasets they will crash due to memory issues.

Let’s go to the setup:

From aws console go to Elastic Map Reduce.
On EMR homepage upper right corner next to your account name, choose proper region to work in. This region should be US-East, because Commoncrawl files are stored there too. Otherwise you will be charged for datatransfer between regions.
Create a new job flow and fill the form:
On the next page you’ll have to enter location of source data and scripts. For input location I just used text file mentioned in Commoncrawl wiki (note the file locations should be entered without ‘s3://’)
```
aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/textData-00112
```
Beware, sample php scripts shown will work ONLY with textData files, they will not work with raw arc.gz files.
The other fields should be filled as shown, of course with applying your own scripts location on S3. I named my bucket “wctest”. Do not create output folder, it should be created by mapreduce job automatically. Entering existing folder as output folder will cause job to crash.
Extra args contains line
```
-inputformat SequenceFileAsTextInputFormat
```
which just tells the job which format is the inputfile stored in so it can be read properly.
On the next page we need to define desired cluster. We will create smallest and cheapest cluster possible, using 1 m1.small master node and 1 m1.small core node, both as spot instances. For spot bid price I entered $0.06 which is above regular price to make sure the instance will be never terminated due to spot price changes:
On Advanced Options page we will just enable debugging, hadoop extensive logs are very useful when fixing possible issues. I created ‘logs’ folder on my s3 bucket as a target location for them:
On Bootstrap Actions page we will choose Proceeding with no Bootstrap Actions:
And finally on the last setup page we can review whole job setup and edit previous steps when necessary. When everything is ok, we can click on Create Job Flow and Job will start:

The whole job will take approximately 17 minutes to finish (11 minutes to setup cluster and the rest mapreduce job itself) so it is clear in this case the job would be done faster on ordinary PC but the point was to show how it works. After job completes, you’ll find the resulting 5.6MB file called part-00000 in your S3 output location. The number of resulting files depends on how many reducers have been setup, which depends on amount and type of running instances – this can be found here (in our case 1 m1.small instance = 1 reducer).

After initial optimism how easy are things going there are challenges to fight with considering huge number of files in dataset (my rough estimate is about 200k), especially

costs when processing whole set via EMR,
time needed to process whole set.

I have already mentioned using spot instances as a solution for reducing costs. Choosing the right instance types helps to decrease time and costs too. Further decrease of costs can be achieved by deploying your own cluster, which will be covered in one of future articles.

Latest Images

Trending Articles

Latest Images