Matthew D. Laudato writes about software and technology

Archive for the ‘Big Data’ Category

Perl after Swine: Using Pig Streaming with Perl (Part 1)

leave a comment »

Back in the days before Hadoop, before Big Data was a big thing, large data sets still existed, and practitioners still needed to wrestle useful information from them. And for a large number of those practitioners, Perl was the tool of choice.

Fast forward to today: Hadoop has emerged as the premier platform for Big Data storage and processing. Java, Python and Ruby dominate the programming landscape. So what of the humble Perl? You can argue that Perl has gone into maintenance mode – there’s not much of an active community out there making updates for Perl. Yet, it still is installed by default on many Linux distributions, and is ‘always on’, waiting for you to do something useful with it. That’s what this series of posts is all about: how to use the Pig streaming technique with Perl, to get the best out of both Hadoop and Perl. We’ll set up the basics today, and then go into the details in upcoming posts.

The basic idea is this: You have a large data set (say, a set of HTML documents such as emails) that you want to do some analysis on. The data has some structure to it, but you need to parse those documents to extract the relevant bits, like how many text elements, words and images are in the documents. Pig alone isn’t much help here – what you need is the ability to ingest each document, process it, and return the results back to Pig. Enter Perl. You can use Pig streaming to do the heavy lifting in Pig, and then pass each document through a Perl program to extract the detail data that you need, and send the results back through Pig for final processing and storage in HDFS.

The basic setup looks like this:

1. Get your perl program ready for streaming. Typically that means writing a perl script that reads from stdin and does something useful, and creating a jar file (yes, a jar file!) that contains your perl script, as well as any Perl libraries that might be needed that are not included with your Perl distribution. It also typically means creating a simple shell script that explodes the jar file and then calls the Perl script.

2. Add a line to your Pig script to define how to execute the shell script. A simple example looks like this:

DEFINE taperl `` SHIP(‘ta_v2.jar’);

There’s a lot going on in this little line. It defines an alias ‘taperl’ that you will use later in Pig to do the actual streaming. It tells Pig that the alias when invoked should run ‘’, the shell script you created. And finally, it instructs Pig to ship the jar file that you created out to all the Hadoop nodes so that Pig can find your code.

3. Use the Pig STREAM mechanism to invoke the streaming job. In our example, something like this would work:

structure_stats = STREAM email_structure THROUGH taperl AS (email_id:long, create_on:long, subject_line:chararray, subject_line_length:long, num_text_elements:long, num_images:long);

This is taking an existing alias in Pig called ’email_structure’ and streaming it through ‘taperl’, the DEFINE’d alias we created earlier, and placing the results back into another Pig alias called ‘structure_stats’.

Intrigued? I thought you were. Next time we’ll take a look at the internals of the perl program that does the actual HTML parsing.

Written by Matthew D. Laudato

May 12, 2015 at 4:30 pm

Posted in Big Data, Hadoop, Perl, Pig

Working with JSON REST APIs from R

with 3 comments

My regular readers know that I have a passion for  a couple of technical areas – among them, data, analysis and APIs. In my earlier 3 part series on R programming and XML REST APIs (Part 1, Part 2, Part 3) I focused on obtaining email campaign data from the leading online marketing service, Constant Contact, through their XML-based REST APIS. Talk about combining my interests! The approach was pretty straight forward, but there is no question that the hardest part was working with the XML output from the REST APIs. As APIs have evolved however, most vendors have switched over to using JSON as their main data protocol. Constant Contact recently released their v2 APIs, which now support JSON, so I took an hour to go back over the work I did on using REST APIs from R to rework it to use these new JSON APIs. Here are the basics.

The overall approach is the same as before – we want to obtain campaign data by calling a REST API, and then perform some basic manipulations in R to ready the data for more sophisticated analysis. The two significant differences between the v1 and v2 APIS are:

  1. The URL format and authentication parameters are different
  2. The results are returned in a JSON document

The impact of these differences was that I needed to change how I set up the RCurl calls to the APIs, and I had to change how I parsed the results differently to obtain the campaign data and get it into an R data frame object. The main driver script that gets the JSON data looks like this:

campaignJSON = getURL(url = paste("", access_token, "&api_key=", api_key, sep=""))
campaign.dataframe <- getCampaignDataframeFromJSON(campaignJSON)

A couple of comments. First, I use the ‘rjson’ library to assist in JSON parsing. Also, since I plan to access my account fairly often, I wrote a simple R script called OAuthAccessToken.R that stores values for ‘api_key’ and ‘access_token’ and just source it to make those values available. Next, the URL format is a bit different from the v1 APIs, but nothing earth shattering. (Click for details on the new Constant Contact v2 APIs – the new developer portal is very cool!). Finally, I wrote a new R function called getCampaignDatafromFromJSON that accepts a JSON document (as returned from RCurl) and transforms it into a data frame with one row for each campaign found. Here is the source code for the new function:

getCampaignDataframeFromJSON <- function(campaignJSON) {
namelist <- NULL
urllist <- NULL
statuslist <- NULL
datelist <- NULL
JSONList <- fromJSON(CampaignJSON)
results <- JSONList$results
for (i in 1:length(results)) {
namelist <- c(namelist, results[i][[1]]$name )
urllist <- c(urllist, paste("", results[i][[1]]$id, sep="", collapse=NULL))
statuslist <- c(statuslist, results[i][[1]]$status)
datelist <- c(datelist, results[i][[1]]$modified_date)
campaignDF = data.frame(name=namelist,url=urllist,status=statuslist,date=datelist,stringsAsFactors=FALSE)

The basic algorithm is the same as with the XML APIs, but the code is a whole lot simpler with JSON! The resulting data frame looks something like this (fake data to protect my accounts of course!):

name                                                                                 url                                                                                                                                                status      date

1          Email Created 2013/02/18, 9:53 AM            {someid1}    SENT      2013-02-25T16:45:41.191Z

2          Email Created 2012/12/20, 1:01 PM             {someid2}    SENT      2012-12-20T18:15:06.565Z

3          APR 15 2011 Attend Help The Humane Society{someid3}    DRAFT   2011-03-03T15:22:16.407Z

4          About Me Newsletter                                          {someid4}    DRAFT   2012-12-04T18:51:02.667Z

Once you have the data frame available, it’s pretty simple to iterate over each campaign and get the detail data like sends, opens and clicks – more on that in an upcoming post!

Written by Matthew D. Laudato

April 3, 2013 at 5:03 pm

Using REST APIs from R – Campaign statistics

with 2 comments

In my previous posts in this series, we looked at how to call REST APIs from R. Now let’s get serious and get real email campaign data from the Constant Contact APIs. To do this, we’ll write a new function to do more sophisticated parsing of the XML returned from the Constant Contact campaigns resource. The algorithm looks like this:

  • GET the campaigns collection, which provides high level XML data on all campaigns
  • Transform the raw XML into a DOM object
  • Iterate over each campaign in the DOM and make a second call to the APIs to get the detail data for that campaign, and assemble an R dataframe object that holds the data for all campaigns

To get started, here’s a simple R script that we’ll use to set up the call to our new function:

campaignXML = getURL(url = "{username}/campaigns?access_token=your_token_goes_here")
campaignDOM = xmlRoot(xmlTreeParse(campaignXML))
campaignStats = getCampaignDataframe(campaignDOM)

You should review the previous posts (here and here) if this script doesn’t make sense to you.

The heavy lifting is done by our new function, getCampaignDataframe. The code is pretty straightforward, and amounts to parsing the DOM object to extract per-campaign data:

getCampaignDataframe <- function(doc) {
 namelist <- NULL
 urllist <- NULL
 sends <- NULL
 opens <- NULL
 for (i in 1:xmlSize(doc)) {
    node <- doc[[i]]
    namelist <- c(namelist, node$children[["content"]]$children$Campaign[["Name"]]$children$text$value)
    url = node$children[["content"]]$children$Campaign$attributes[["id"]]
    url = sub ("http","https",url)
    urllist <- c(urllist, url)
    if (length(url) > 0) {
       url = paste(url,"?access_token=your_token_goes_",sep="",collapse=NULL)
       campaignDetailXML = getURL(url = url)
       campaignDetailDOM = xmlRoot(xmlTreeParse(campaignDetailXML))
       sends <- strtoi(c(sends, campaignDetailDOM[["content"]]$children$Campaign[["Sent"]]$children$text$value))
       opens <- strtoi(c(opens, campaignDetailDOM[["content"]]$children$Campaign[["Opens"]]$children$text$value))
 campaignstats = data.frame(name=namelist,url=urllist,sends=sends,opens=opens,stringsAsFactors=FALSE)

The function takes a single parameter ‘doc’, which is a DOM object containing the list of campaigns – including the URLs to get to the campaign detail. The for() loop simply iterates over the DOM and extracts the campaign name and URL. It then issues a GET request to the URL to obtain the sends and opens for that campaign. Once this is done, the data is assembled into the data frame and returned. Here is an example of the dataframe contents when I run the script against one of my Constant Contact accounts (with specific identifying data for username and campaign removed):

name                                                                url                                                                                                                                                                           sends    opens

1 Email Created 2012/11/10, 9:03 AM{username}/campaigns/{campaignid}           3             1
2 Created via API as UTF-8 v6       {username}/campaigns/{campaignid}           3             1
3 Created via API as UTF-8 v5       {username}/campaigns/{campaignid}           3             2

Given this raw data, now in a useful form in a dataframe, we can use R to help us calculate the open rate for emails. In R, this is as simple as:

openRate <- campaignStats$opens/campaignStats$sends
campaignStats$openRate <- openRate

This adds a new column called openRate to the dataframe that contains the calculated open rates for each email campaign. Admittedly this is a simple example, but at this point, you have all the tools you need to pull data into R from a REST API, and do some basic manipulations on it.

Happy Modeling!

– Matt

Written by Matthew D. Laudato

March 13, 2013 at 4:30 pm

Using REST APIs from R – XML operations

with 3 comments

In the first part of this series, I showed you how to make calls to REST APIs from R. In this part, we’ll look at how to work with the XML documents that the REST APIs return.  I’ll stick with the Constant Contact v1 APIs, since I’m most familiar with those and since the data (campaign statistics) is appropriate for analysis in R.

Once you have the raw XML from a REST API in an R variable, you need to parse it in order to extract the data that you’re interested in. To do this, we use the ‘XML’ package in R. First, load the package with:

> library(‘XML’)

If we start with the ‘campaignsXML’  vector from the previous post, we can easily create a DOM object that contains the XML in a form that is useful for extracting data. The ‘XML’ library in R makes this easy:

> campaignDOM = xmlRoot(xmlTreeParse(campaignXML))

This creates an DOM object called ‘campaignDOM’ that represents the contents of campaignXML. To get data from this object, we’ll write our own function that iterates over the nodes in the DOM object and extracts the data. As a simple example, let’s say we wanted a vector of all the campaign names, perhaps to use later as labels on campaign statistics graphs. The function to do this looks like:

getCampaignNames <- function(doc) {
 namelist <- NULL
 for (i in 1:xmlSize(doc)) {
    node <- doc[[i]]
    namelist <- c(namelist, node$children[["content"]]$children$Campaign[["Name"]]$children$text$value)

There are two key parts to this function. First, the for loop iterates over all nodes in the DOM object – the function xmlSize(doc) from the ‘XML’ package returns an integer representing the number of nodes in the object. Then for each node, getCampaignNames extracts the value of the campaign name and adds it to the ‘namelist’ vector, which is returned when the function completes. The syntax for how to access nodes and children can be a little daunting, but remember, it’s just XML and all you’re really doing is walking the tree. One useful fact: the node functions in the ‘XML’ package are fairly forgiving, so in our example, even though there are several nodes in the DOM object that aren’t of type ‘content’, we don’t need to do any special checking for that condition. Nodes that don’t have ‘content’ will be silently ignored and thus the c(namelist, …) function call will not push them onto the ‘namelist’ vector.

For convenience, you should place the function definition in a file called ‘getCampaignNames.R’ and load it as needed with:

> source (“getCampaignNames.R”)

Putting this all together, you can get the vector of campaign names by simply calling the function:

> namelist <- getCampaignNames(campaignDOM)

If we do this on my Constant Contact account, the ‘namelist’ vector will contain the following:

> namelist
[1] “Created via API30”    “Created via API205”   “Created via API24”
[4] “Created via API23”    “Created via API22”    “Created via API21”
[7] “Created via API20”    “Created via API19”    “Created via API18”
[10] “Created via API17”    “Created via API16”    “Created via API15”
[13] “Created via API13”    “Created via API12”    “Created via API11”
[16] “Created via API10”    “Created via API9”     “Created via API8”
[19] “Created via API6”     “Created via API5”     “Created via API4”
[22] “Created via API3”     “Created via API2”     “Created via API”
[25] “BlockTest 20110425”   “Social Test 20110407” “Feb 16 2011”

Overall, R combined with the RCurl and XML packages makes for a powerful system to get data from REST APIs and then process the resulting XML. In the next installment in this series, we’ll look at actual campaign statistics and use R to do some basic campaign analysis.

Happy Model Building!

– Matt

Written by Matthew D. Laudato

June 16, 2012 at 5:15 pm

Using REST APIs from R

with 2 comments

It’s hard to read a website, blog post or even mainstream business press article without coming across the term ‘big data’. Big Data is one of those terms that means nothing and everything all at once, and for that reason alone, you should pay attention to it. When it comes to the how of big data, it’s equally hard to avoid bumping into R, the open source statistics and computational environment. If you’re going to describe and model the behavior of your customers in a big data initiative, R is one tool that you need in your software toolbox.

Data is everywhere, and increasing, data of all kinds on customer engagement is available through REST APIs. I started a little project in my spare time to bring data available via REST interfaces into R, to set the stage for doing what I expect to be some fairly sophisticated model building. The rest of this post is a quick introduction to how to work with REST APIs in R.

At its most basic, calling a REST API to obtain data involves making an HTTP GET request to a server. If the call succeeds, you’ll have a document that contains the requested data. In R, the best way to make these requests is by using RCurl. The RCurl package is – you guessed it – an R interface to curl. Once you’ve installed it into your R environment, getting data from REST APIs is pretty straightforward.

For my project, I started with the Constant Contact API, partly because I work for Constant Contact on the Web Services team, and party because the API makes available exactly the kind of data that you typically want to analyse in a marketing big data project – specifically, sends, clicks, opens and the like for marketing campaigns. The current v1 API returns XML, so I also installed the XML package into my R environment (though I haven’t done much with it yet). To install the packages use the following command in R:

> install.packages(‘RCurl’, ‘XML’)

To load these packages, use:

> library(‘RCurl’)

> library(‘XML’)

Once these preliminary tasks are taken care of, there are just 2 steps required to get campaign data from the API and into R:

1. Obtain an access token. Constant Contact uses OAuth 2.0 for authentication, as do many other public REST APIs. There’s no good way to get a token from inside R, so I used the client flow with a little bit of javascript to get the token in my browser, and then just saved it for use in R. See here for details on how to get access tokens. If you’re building an app that analyses data from multiple Constant Contact accounts, you’ll need the owners of those accounts grant access to your app in order for you to obtain access tokens. But for now, sign up for a trial and use your own account.

2. Call an HTTP endpoint using RCurl. This is very easy. For my initial test, I wanted to get the list of available email campaigns, so that I could later iterate over the list and get the campaign statistics for analysis. The call is:

campaignsXML = getURL(“{username}/campaigns?access_token={token}”)

{username} : replace with the name of the account for which you have the access token

{token} : replace with the actual access token granted to you by the account owner

This issues the HTTP GET request, and puts the resulting XML response into the R vector ‘campaignsXML’, ready to be processed further. That’s all there is to it.

In my next post on this topic, I’ll show you how to parse the XML to get it into a more usable form using the R XML package.

Happy Model Building!

– Matt

Written by Matthew D. Laudato

June 11, 2012 at 1:57 am