Perl after Swine: Using Pig Streaming with Perl (Part 1)

Back in the days before Hadoop, before Big Data was a big thing, large data sets still existed, and practitioners still needed to wrestle useful information from them. And for a large number of those practitioners, Perl was the tool of choice.

Fast forward to today: Hadoop has emerged as the premier platform for Big Data storage and processing. Java, Python and Ruby dominate the programming landscape. So what of the humble Perl? You can argue that Perl has gone into maintenance mode – there’s not much of an active community out there making updates for Perl. Yet, it still is installed by default on many Linux distributions, and is ‘always on’, waiting for you to do something useful with it. That’s what this series of posts is all about: how to use the Pig streaming technique with Perl, to get the best out of both Hadoop and Perl. We’ll set up the basics today, and then go into the details in upcoming posts.

The basic idea is this: You have a large data set (say, a set of HTML documents such as emails) that you want to do some analysis on. The data has some structure to it, but you need to parse those documents to extract the relevant bits, like how many text elements, words and images are in the documents. Pig alone isn’t much help here – what you need is the ability to ingest each document, process it, and return the results back to Pig. Enter Perl. You can use Pig streaming to do the heavy lifting in Pig, and then pass each document through a Perl program to extract the detail data that you need, and send the results back through Pig for final processing and storage in HDFS.

The basic setup looks like this:

1. Get your perl program ready for streaming. Typically that means writing a perl script that reads from stdin and does something useful, and creating a jar file (yes, a jar file!) that contains your perl script, as well as any Perl libraries that might be needed that are not included with your Perl distribution. It also typically means creating a simple shell script that explodes the jar file and then calls the Perl script.

2. Add a line to your Pig script to define how to execute the shell script. A simple example looks like this:

DEFINE taperl `count_images.sh` SHIP(‘ta_v2.jar’);

There’s a lot going on in this little line. It defines an alias ‘taperl’ that you will use later in Pig to do the actual streaming. It tells Pig that the alias when invoked should run ‘count_images.sh’, the shell script you created. And finally, it instructs Pig to ship the jar file that you created out to all the Hadoop nodes so that Pig can find your code.

3. Use the Pig STREAM mechanism to invoke the streaming job. In our example, something like this would work:

structure_stats = STREAM email_structure THROUGH taperl AS (email_id:long, create_on:long, subject_line:chararray, subject_line_length:long, num_text_elements:long, num_images:long);

This is taking an existing alias in Pig called ’email_structure’ and streaming it through ‘taperl’, the DEFINE’d alias we created earlier, and placing the results back into another Pig alias called ‘structure_stats’.

Intrigued? I thought you were. Next time we’ll take a look at the internals of the perl program that does the actual HTML parsing.

Written by Matthew D. Laudato

May 12, 2015 at 4:30 pm

Posted in Big Data, Hadoop, Perl, Pig

Technistas