Technistas

Matthew D. Laudato writes about software and technology

Perl after Swine: Using Pig Streaming with Perl (Part 1)

leave a comment »

Back in the days before Hadoop, before Big Data was a big thing, large data sets still existed, and practitioners still needed to wrestle useful information from them. And for a large number of those practitioners, Perl was the tool of choice.

Fast forward to today: Hadoop has emerged as the premier platform for Big Data storage and processing. Java, Python and Ruby dominate the programming landscape. So what of the humble Perl? You can argue that Perl has gone into maintenance mode – there’s not much of an active community out there making updates for Perl. Yet, it still is installed by default on many Linux distributions, and is ‘always on’, waiting for you to do something useful with it. That’s what this series of posts is all about: how to use the Pig streaming technique with Perl, to get the best out of both Hadoop and Perl. We’ll set up the basics today, and then go into the details in upcoming posts.

The basic idea is this: You have a large data set (say, a set of HTML documents such as emails) that you want to do some analysis on. The data has some structure to it, but you need to parse those documents to extract the relevant bits, like how many text elements, words and images are in the documents. Pig alone isn’t much help here – what you need is the ability to ingest each document, process it, and return the results back to Pig. Enter Perl. You can use Pig streaming to do the heavy lifting in Pig, and then pass each document through a Perl program to extract the detail data that you need, and send the results back through Pig for final processing and storage in HDFS.

The basic setup looks like this:

1. Get your perl program ready for streaming. Typically that means writing a perl script that reads from stdin and does something useful, and creating a jar file (yes, a jar file!) that contains your perl script, as well as any Perl libraries that might be needed that are not included with your Perl distribution. It also typically means creating a simple shell script that explodes the jar file and then calls the Perl script.

2. Add a line to your Pig script to define how to execute the shell script. A simple example looks like this:

DEFINE taperl `count_images.sh` SHIP(‘ta_v2.jar’);

There’s a lot going on in this little line. It defines an alias ‘taperl’ that you will use later in Pig to do the actual streaming. It tells Pig that the alias when invoked should run ‘count_images.sh’, the shell script you created. And finally, it instructs Pig to ship the jar file that you created out to all the Hadoop nodes so that Pig can find your code.

3. Use the Pig STREAM mechanism to invoke the streaming job. In our example, something like this would work:

structure_stats = STREAM email_structure THROUGH taperl AS (email_id:long, create_on:long, subject_line:chararray, subject_line_length:long, num_text_elements:long, num_images:long);

This is taking an existing alias in Pig called ’email_structure’ and streaming it through ‘taperl’, the DEFINE’d alias we created earlier, and placing the results back into another Pig alias called ‘structure_stats’.

Intrigued? I thought you were. Next time we’ll take a look at the internals of the perl program that does the actual HTML parsing.

Written by Matthew D. Laudato

May 12, 2015 at 4:30 pm

Posted in Big Data, Hadoop, Perl, Pig

Working with JSON REST APIs from R

with 3 comments

My regular readers know that I have a passion for  a couple of technical areas – among them, data, analysis and APIs. In my earlier 3 part series on R programming and XML REST APIs (Part 1, Part 2, Part 3) I focused on obtaining email campaign data from the leading online marketing service, Constant Contact, through their XML-based REST APIS. Talk about combining my interests! The approach was pretty straight forward, but there is no question that the hardest part was working with the XML output from the REST APIs. As APIs have evolved however, most vendors have switched over to using JSON as their main data protocol. Constant Contact recently released their v2 APIs, which now support JSON, so I took an hour to go back over the work I did on using REST APIs from R to rework it to use these new JSON APIs. Here are the basics.

The overall approach is the same as before – we want to obtain campaign data by calling a REST API, and then perform some basic manipulations in R to ready the data for more sophisticated analysis. The two significant differences between the v1 and v2 APIS are:

  1. The URL format and authentication parameters are different
  2. The results are returned in a JSON document

The impact of these differences was that I needed to change how I set up the RCurl calls to the APIs, and I had to change how I parsed the results differently to obtain the campaign data and get it into an R data frame object. The main driver script that gets the JSON data looks like this:


library('RCurl')
library('rjson')
source("OAuthAccessToken.R")
source("getCampaignDataframeFromJSON.R")
campaignJSON = getURL(url = paste("https://api.constantcontact.com/v2/emailmarketing/campaigns?access_token=", access_token, "&api_key=", api_key, sep=""))
campaign.dataframe <- getCampaignDataframeFromJSON(campaignJSON)

A couple of comments. First, I use the ‘rjson’ library to assist in JSON parsing. Also, since I plan to access my account fairly often, I wrote a simple R script called OAuthAccessToken.R that stores values for ‘api_key’ and ‘access_token’ and just source it to make those values available. Next, the URL format is a bit different from the v1 APIs, but nothing earth shattering. (Click for details on the new Constant Contact v2 APIs – the new developer portal is very cool!). Finally, I wrote a new R function called getCampaignDatafromFromJSON that accepts a JSON document (as returned from RCurl) and transforms it into a data frame with one row for each campaign found. Here is the source code for the new function:


getCampaignDataframeFromJSON <- function(campaignJSON) {
namelist <- NULL
urllist <- NULL
statuslist <- NULL
datelist <- NULL
JSONList <- fromJSON(CampaignJSON)
results <- JSONList$results
for (i in 1:length(results)) {
namelist <- c(namelist, results[i][[1]]$name )
urllist <- c(urllist, paste("https://api.constantcontact.com/v2/emailmarketing/campaigns/", results[i][[1]]$id, sep="", collapse=NULL))
statuslist <- c(statuslist, results[i][[1]]$status)
datelist <- c(datelist, results[i][[1]]$modified_date)
}
campaignDF = data.frame(name=namelist,url=urllist,status=statuslist,date=datelist,stringsAsFactors=FALSE)
return(campaignDF)
}

The basic algorithm is the same as with the XML APIs, but the code is a whole lot simpler with JSON! The resulting data frame looks something like this (fake data to protect my accounts of course!):


name                                                                                 url                                                                                                                                                status      date

1          Email Created 2013/02/18, 9:53 AM                      https://api.constantcontact.com/v2/emailmarketing/campaigns/{someid1}    SENT      2013-02-25T16:45:41.191Z

2          Email Created 2012/12/20, 1:01 PM                       https://api.constantcontact.com/v2/emailmarketing/campaigns/{someid2}    SENT      2012-12-20T18:15:06.565Z

3          APR 15 2011 Attend Help The Humane Society   https://api.constantcontact.com/v2/emailmarketing/campaigns/{someid3}    DRAFT   2011-03-03T15:22:16.407Z

4          About Me Newsletter                                                    https://api.constantcontact.com/v2/emailmarketing/campaigns/{someid4}    DRAFT   2012-12-04T18:51:02.667Z

Once you have the data frame available, it’s pretty simple to iterate over each campaign and get the detail data like sends, opens and clicks – more on that in an upcoming post!

Written by Matthew D. Laudato

April 3, 2013 at 5:03 pm

Using REST APIs from R – Campaign statistics

with 2 comments

In my previous posts in this series, we looked at how to call REST APIs from R. Now let’s get serious and get real email campaign data from the Constant Contact APIs. To do this, we’ll write a new function to do more sophisticated parsing of the XML returned from the Constant Contact campaigns resource. The algorithm looks like this:

  • GET the campaigns collection, which provides high level XML data on all campaigns
  • Transform the raw XML into a DOM object
  • Iterate over each campaign in the DOM and make a second call to the APIs to get the detail data for that campaign, and assemble an R dataframe object that holds the data for all campaigns

To get started, here’s a simple R script that we’ll use to set up the call to our new function:

library('RCurl')
library('XML')
source("getCampaignDataframe.R")
campaignXML = getURL(url = "https://api.constantcontact.com/ws/customers/{username}/campaigns?access_token=your_token_goes_here")
campaignDOM = xmlRoot(xmlTreeParse(campaignXML))
campaignStats = getCampaignDataframe(campaignDOM)

You should review the previous posts (here and here) if this script doesn’t make sense to you.

The heavy lifting is done by our new function, getCampaignDataframe. The code is pretty straightforward, and amounts to parsing the DOM object to extract per-campaign data:

getCampaignDataframe <- function(doc) {
 namelist <- NULL
 urllist <- NULL
 sends <- NULL
 opens <- NULL
 for (i in 1:xmlSize(doc)) {
    node <- doc[[i]]
    namelist <- c(namelist, node$children[["content"]]$children$Campaign[["Name"]]$children$text$value)
    url = node$children[["content"]]$children$Campaign$attributes[["id"]]
    url = sub ("http","https",url)
    urllist <- c(urllist, url)
    if (length(url) > 0) {
       url = paste(url,"?access_token=your_token_goes_",sep="",collapse=NULL)
       campaignDetailXML = getURL(url = url)
       campaignDetailDOM = xmlRoot(xmlTreeParse(campaignDetailXML))
       sends <- strtoi(c(sends, campaignDetailDOM[["content"]]$children$Campaign[["Sent"]]$children$text$value))
       opens <- strtoi(c(opens, campaignDetailDOM[["content"]]$children$Campaign[["Opens"]]$children$text$value))
    }
 }
 campaignstats = data.frame(name=namelist,url=urllist,sends=sends,opens=opens,stringsAsFactors=FALSE)
 campaignstats
}

The function takes a single parameter ‘doc’, which is a DOM object containing the list of campaigns – including the URLs to get to the campaign detail. The for() loop simply iterates over the DOM and extracts the campaign name and URL. It then issues a GET request to the URL to obtain the sends and opens for that campaign. Once this is done, the data is assembled into the data frame and returned. Here is an example of the dataframe contents when I run the script against one of my Constant Contact accounts (with specific identifying data for username and campaign removed):

name                                                                url                                                                                                                                                                           sends    opens

1 Email Created 2012/11/10, 9:03 AM  https://api.constantcontact.com/ws/customers/{username}/campaigns/{campaignid}           3             1
2 Created via API as UTF-8 v6                 https://api.constantcontact.com/ws/customers/{username}/campaigns/{campaignid}           3             1
3 Created via API as UTF-8 v5                 https://api.constantcontact.com/ws/customers/{username}/campaigns/{campaignid}           3             2

Given this raw data, now in a useful form in a dataframe, we can use R to help us calculate the open rate for emails. In R, this is as simple as:

openRate <- campaignStats$opens/campaignStats$sends
campaignStats$openRate <- openRate

This adds a new column called openRate to the dataframe that contains the calculated open rates for each email campaign. Admittedly this is a simple example, but at this point, you have all the tools you need to pull data into R from a REST API, and do some basic manipulations on it.

Happy Modeling!

– Matt

Written by Matthew D. Laudato

March 13, 2013 at 4:30 pm

Using REST APIs from R – XML operations

with 3 comments

In the first part of this series, I showed you how to make calls to REST APIs from R. In this part, we’ll look at how to work with the XML documents that the REST APIs return.  I’ll stick with the Constant Contact v1 APIs, since I’m most familiar with those and since the data (campaign statistics) is appropriate for analysis in R.

Once you have the raw XML from a REST API in an R variable, you need to parse it in order to extract the data that you’re interested in. To do this, we use the ‘XML’ package in R. First, load the package with:

> library(‘XML’)

If we start with the ‘campaignsXML’  vector from the previous post, we can easily create a DOM object that contains the XML in a form that is useful for extracting data. The ‘XML’ library in R makes this easy:

> campaignDOM = xmlRoot(xmlTreeParse(campaignXML))

This creates an DOM object called ‘campaignDOM’ that represents the contents of campaignXML. To get data from this object, we’ll write our own function that iterates over the nodes in the DOM object and extracts the data. As a simple example, let’s say we wanted a vector of all the campaign names, perhaps to use later as labels on campaign statistics graphs. The function to do this looks like:

getCampaignNames <- function(doc) {
 namelist <- NULL
 for (i in 1:xmlSize(doc)) {
    node <- doc[[i]]
    namelist <- c(namelist, node$children[["content"]]$children$Campaign[["Name"]]$children$text$value)
 }
 namelist
}

There are two key parts to this function. First, the for loop iterates over all nodes in the DOM object – the function xmlSize(doc) from the ‘XML’ package returns an integer representing the number of nodes in the object. Then for each node, getCampaignNames extracts the value of the campaign name and adds it to the ‘namelist’ vector, which is returned when the function completes. The syntax for how to access nodes and children can be a little daunting, but remember, it’s just XML and all you’re really doing is walking the tree. One useful fact: the node functions in the ‘XML’ package are fairly forgiving, so in our example, even though there are several nodes in the DOM object that aren’t of type ‘content’, we don’t need to do any special checking for that condition. Nodes that don’t have ‘content’ will be silently ignored and thus the c(namelist, …) function call will not push them onto the ‘namelist’ vector.

For convenience, you should place the function definition in a file called ‘getCampaignNames.R’ and load it as needed with:

> source (“getCampaignNames.R”)

Putting this all together, you can get the vector of campaign names by simply calling the function:

> namelist <- getCampaignNames(campaignDOM)

If we do this on my Constant Contact account, the ‘namelist’ vector will contain the following:

> namelist
[1] “Created via API30”    “Created via API205”   “Created via API24”
[4] “Created via API23”    “Created via API22”    “Created via API21”
[7] “Created via API20”    “Created via API19”    “Created via API18”
[10] “Created via API17”    “Created via API16”    “Created via API15”
[13] “Created via API13”    “Created via API12”    “Created via API11”
[16] “Created via API10”    “Created via API9”     “Created via API8”
[19] “Created via API6”     “Created via API5”     “Created via API4”
[22] “Created via API3”     “Created via API2”     “Created via API”
[25] “BlockTest 20110425”   “Social Test 20110407” “Feb 16 2011”

Overall, R combined with the RCurl and XML packages makes for a powerful system to get data from REST APIs and then process the resulting XML. In the next installment in this series, we’ll look at actual campaign statistics and use R to do some basic campaign analysis.

Happy Model Building!

– Matt

Written by Matthew D. Laudato

June 16, 2012 at 5:15 pm

Using REST APIs from R

with 2 comments

It’s hard to read a website, blog post or even mainstream business press article without coming across the term ‘big data’. Big Data is one of those terms that means nothing and everything all at once, and for that reason alone, you should pay attention to it. When it comes to the how of big data, it’s equally hard to avoid bumping into R, the open source statistics and computational environment. If you’re going to describe and model the behavior of your customers in a big data initiative, R is one tool that you need in your software toolbox.

Data is everywhere, and increasing, data of all kinds on customer engagement is available through REST APIs. I started a little project in my spare time to bring data available via REST interfaces into R, to set the stage for doing what I expect to be some fairly sophisticated model building. The rest of this post is a quick introduction to how to work with REST APIs in R.

At its most basic, calling a REST API to obtain data involves making an HTTP GET request to a server. If the call succeeds, you’ll have a document that contains the requested data. In R, the best way to make these requests is by using RCurl. The RCurl package is – you guessed it – an R interface to curl. Once you’ve installed it into your R environment, getting data from REST APIs is pretty straightforward.

For my project, I started with the Constant Contact API, partly because I work for Constant Contact on the Web Services team, and party because the API makes available exactly the kind of data that you typically want to analyse in a marketing big data project – specifically, sends, clicks, opens and the like for marketing campaigns. The current v1 API returns XML, so I also installed the XML package into my R environment (though I haven’t done much with it yet). To install the packages use the following command in R:

> install.packages(‘RCurl’, ‘XML’)

To load these packages, use:

> library(‘RCurl’)

> library(‘XML’)

Once these preliminary tasks are taken care of, there are just 2 steps required to get campaign data from the API and into R:

1. Obtain an access token. Constant Contact uses OAuth 2.0 for authentication, as do many other public REST APIs. There’s no good way to get a token from inside R, so I used the client flow with a little bit of javascript to get the token in my browser, and then just saved it for use in R. See here for details on how to get access tokens. If you’re building an app that analyses data from multiple Constant Contact accounts, you’ll need the owners of those accounts grant access to your app in order for you to obtain access tokens. But for now, sign up for a trial and use your own account.

2. Call an HTTP endpoint using RCurl. This is very easy. For my initial test, I wanted to get the list of available email campaigns, so that I could later iterate over the list and get the campaign statistics for analysis. The call is:

campaignsXML = getURL(“https://api.constantcontact.com/ws/customers/{username}/campaigns?access_token={token}”)

{username} : replace with the name of the account for which you have the access token

{token} : replace with the actual access token granted to you by the account owner

This issues the HTTP GET request, and puts the resulting XML response into the R vector ‘campaignsXML’, ready to be processed further. That’s all there is to it.

In my next post on this topic, I’ll show you how to parse the XML to get it into a more usable form using the R XML package.

Happy Model Building!

– Matt

Written by Matthew D. Laudato

June 11, 2012 at 1:57 am

Building an OAuth enabled website using Java

leave a comment »

For me, it’s always been about the APIs. I write them. I use them. I have built and continue to grow my career around them. So as it became clear over the past few years that OAuth has become the de facto standard for API authentication, I took the plunge and figured out how to do practical things with OAuth-based APIs. Since Java is my preferred language, and since I’m now the Product Manager at Constant Contact for the Platform and Partner Integrations, the natural path for my latest project was to build a website using Java that lets users authenticate to Constant Contact via OAuth.

That said, here’s the basics of the application and how it works. The technology stack looks like this:

Web UI: HTML, CSS, JavaScript
Application Services: Java Servlets, JavaScript (Ajax), Scribe-Java (OAuth Java library), REST calls to the Constant Contact API
Database (for storing OAuth tokens and secrets): Hibernate, MS-SQL Server

These are all tools that should be in any programmers toolbox, so the application also serves as a good training app and sample code if you’re trying to get familiar with this stack. The code is available on Github (you do have a Github account, don’t you??). Just fork my repo and you should be ready to go.

Since the main point of this article building an OAuth based website, let’s get right into it. If you want your users to display their data on your website, you as the website owner must let your users grant access to their data. In my case, the data is coming from Constant Contact, so I implemented what is known as the ‘web flow’ for authentication. The flow uses a series of callbacks and goes roughly like this:

– Get a request token from Constant Contact
– Redirect the user to Constant Contact’s OAuth endpoint and ask the end user to authenticate and grant access to their data
– Exchange the authenticated request token for a valid access token and secret and have OAuth redirect the user back to your site.

Now, most of us are not hard-core security programmers, and trust me, if you (like me) prefer to focus on the functionality of your app and not on plumbing, you should use an OAuth library. In my case I chose scribe-java, an excellent open source library for Java OAuth programmers. I submitted the Constant Contact API to the project so that the rest of you can easily authenticate to Constant Contact. You can get scribe-java from github at this url.

I handle the callbacks through two Java servlets. AuthServlet.java makes the initial request for an OAuth Request token, and if successful, redirects to Constant Contact to let the user authenticate and grant access. The basic code looks like this:


OAuthService service = new ServiceBuilder()
.provider(ConstantContactApi.class)
.callback("http://localhost:8080/CTCTWeb/OAuthCallbackServlet.do")
.apiKey(apiKeyProperties.getProperty("apiKey"))
.apiSecret(apiKeyProperties.getProperty("apiSecret"))
.build();
httpsession.setAttribute("oauth.service", service);

Token requestToken = service.getRequestToken();
httpsession.setAttribute(“oauth.request_token”, requestToken);

String confirmAccessURL = service.getAuthorizationUrl(requestToken);

System.out.println(confirmAccessURL);
try {
res.sendRedirect(res.encodeRedirectURL(confirmAccessURL));
} catch (Exception e) {
System.out.println(e.getMessage());
}

 

The simplicity of the code is due to scribe-java. A couple of things to notice. First, we tell scribe-java what the callback URL is when we create the service object. Second, we store the request_token in the httpsession object. This isn’t strictly necessary (we could just store the request token secret, and then reconstruct the request token in the callback), but it is convenient. The reason for this has to do with the scribe-java Token class – it isn’t really a Token per-se, it’s a token-secret pair. I suspect that the author of the library did it this way because token-secret pairs are ubiquitous in OAuth. In any case, when the callback is invoked, it winds up in OAuthCallbackServlet.java. The relevant code looks like this:


String oauth_token = req.getParameter("oauth_token");
String oauth_verifier = req.getParameter("oauth_verifier");
String username = req.getParameter("username");

if (oauth_verifier.length() > 0) {
Verifier verifier = new Verifier(oauth_verifier);

HttpSession httpsession = req.getSession(true);
OAuthService service = (OAuthService) httpsession.getAttribute(“oauth.service”);
Token requestToken = (Token) httpsession.getAttribute(“oauth.request_token”);
httpsession.setAttribute(“username”, username);

Token accessToken = service.getAccessToken(requestToken, verifier);
httpsession.setAttribute(“oauth.access_token”, accessToken);

Long accessTokenId = null;
AccessToken at = new AccessToken();
at.setLoginName(username);
at.setAccessToken(accessToken.getToken());
at.setSecret(accessToken.getSecret());
Date dt = new Date();
Timestamp ts = new Timestamp(dt.getTime());
at.setModifiedDate(ts);

Session session = HibernateUtil.getSessionFactory().openSession();
Transaction transaction = null;
try {
transaction = session.beginTransaction();
accessTokenId = (Long) session.save(at);
transaction.commit();
} catch (HibernateException e) {
transaction.rollback();
e.printStackTrace();
} finally {
session.close();
}
}

Again, refer to the full source code on github. There’s several things going on here. First, the entry point to the call back is a servlet URL, which the Constant Contact OAuth implementation has called with three parameters. The oauth_token is the same request token that you generated earlier (but just the token, not the secret). Because I stored the request-token/secret pair earlier, I just ignore this parameter (but see the earlier comment – you could store the secret and then reconstruct the scribe-java Token object). The verifier is a string that the OAuth server creates when the user authenticates and then grants access. The magic is in the call to service.getAccessToken, where you trade in the (now verified) request token for a valid access token. Once you have the token, you need to do something with it so that the next time your user visits the website they can access their Constant Contact data. In my case, I opted to store the username, access token and the token secret in a SQL database. The schema for the database is pretty simple, and Hibernate makes it fairly trivial to store and retrieve the token and secret.

That’s about it for now. Happy coding!

– Matt

Written by Matthew D. Laudato

June 28, 2011 at 9:46 pm

Posted in Java, OAuth

Tagged with , , ,

A Framework for Evaluating Continuous Integration Tools

leave a comment »

For those of you interested in the methodology behind my ongoing series on CI tools, you should check out a new article that I wrote for CMCrossroads. In it I provide a framework for evaluating CI tools, and give you a checklist and ranking system to help you organize and rate your evalution.

The 2nd installment of the hands-on tool evaluation is a few days away – stay tuned for how the four tools (Hudson, Mojo, Bamboo and TeamCity) fare on providing access to common development tools and on enabling you to assemble complex build workflows.

Happy Building!

– Matt

Written by Matthew D. Laudato

June 16, 2010 at 6:36 pm

Comparing Continuous Integration Tools, Part 1

with 6 comments

One of the more enjoyable parts of my job at OpenMake Software is getting to examine and analyze the various build tools on the market. This is partly to see what the competition is up to, and partly to make sure that I can effectively communicate the technical bits with our customers, many of whom have multiple build tools in their environments.

To that end, I recently embarked on a continuous integration tool evaluation. I chose to look at Hudson, an opensource tool commercially supported by Sun Microsystems;  TeamCity, a commercial tool from JetBrains;  Bamboo, a commercial tool from Atlassian; and Mojo, a freeware and commercially supported tool from OpenMake Software. My goal was to compare the tools along several vectors:

  • Installation
  • Configuration
  • Running a simple job
  • Viewing logs
  • Interacting with source control
  • Performing complex distributed build workflows

I decided to break the effort into two parts. The first part, covered in this post, is the ‘getting my feet wet’ portion of the evaluation. I tackled the first four bullets above to get a sense of how the tools were installed and configured, and to see if I could get them each to do something useful. The useful thing was to run a job that spits out the current environment, the equivalent of running the ‘set’ command from a DOS prompt in Windows.

The table below summarizes my findings, and below that, I give some general impressions about the tools and the evaluation process.

PRODUCT: Mojo Bamboo Hudson Team City
FEATURE:        
Version 7.31 2.5.5 1.332 5.1.1
Installation
method
Windows installer Windows installer Executable war file Windows installer
Download
size
50M 84M 27M 268M
Need
License?
Free, unlimited
single-server license
30 day trial Free, unlimited
single-server license
Free, unlimited
single-server license
Installation
notes
Does not ask for default
port as part of install. That is configured once you have started the client.
Server starts as part of installation and gets installed as Windows service.
Start Menu group and icons installed for access to thick and web client.
Asks for default port as
part of install. Does not start server as part of startup. When you do start
the server, does not recognize your port choice.
No issues. Hard to figure
out how to change the default ports.
No issues. Asked for
default port in install wizard. Starts server and build agent as windows
service as part of install and then runs web interface.
Initial
setup
None. If you like the
defaults, then you can create a workflow immediately through the thick client
Asks you to ‘Create a Plan’
as the first activity. Did not like this as it forces me to digest their
meaning of the generic word ‘Plan’
None. If you like the
defaults, then you can create a workflow immediately.
Wants you to create
projects and build configurations but does not define exactly what these are.
Configuring
a simple job (ENVPEEK – prints build server environment vars to build log)
Easy. Create a workflow,
add a ‘Mojo | Execute shell command’ activity, and type in the command
(‘set’).
Difficult. In order to
‘Create a Plan’ you need to go through an 8 step wizard. The second wizard
screen requires you to select an SCM system and a repository location. I had
to give it a repository location from my Subversion server to get past this
screen. Annoying, since for this job I don’t care about SCM. Rest of the
wizard was OK, but way too many steps just to set up a simple job.
Easy. Create a new build
job. Use the ‘Execute Windows batch command’ option and type in the command
(‘set’).
Moderate. Team City asks
you to create a project, which is pretty easy. You then have to create at
least one Build Configuration. There is a web-based wizard that like Bamboo
has an SCM screen, but you can choose to ignore it. You can then choose a
command line Build Runner, in which you specify the ‘set’ command.
Running
a simple job
Easy. Open the workflow,
either in the thick client or in the web interface, and press the run button.
Runs successfully
Moderate. From the Bamboo
home, select the Plan, then select ‘Run Build’ from the Plan Actions menu on
the right. Because of the SCM choice, even jobs that don’t require SCM will
check out from Subversion. Tool is geared for building code projects – does not appear to be a general workflow tool.
Easy. Select the job and
select the ‘Schedule a build’ button.
Easy. From the Project tab,
find the project that you want to run and then click on the Run… button.
Viewing
job logs
Easy. In the thick client,
open the workflow, and go to the History/Trends tab. Select the run that you
want to see and double-click. In the web interface, select the workflow and
submit a query to retrieve the run information. Select the specific run you
want to view.
Moderate. You have to click
on the plan, then the Completed Builds tab, then click on the build you want,
then click on its Logs tab. Lots of drilling down required.
Easy. Click on the Job name
and then select any link from the Build History link.
Easy. From the Projects
tab, click on the Project that you want to view. Select the link for the run
that you want to view.

 

Overall, Hudson and Mojo were the easiest tools to install and use. Hudson definitely takes the cake when it comes to installation, since you don’t have to install it – you just run the executable war file from the command line. Mojo, TeamCity and Bamboo have more traditional installers, of which the Mojo install was the most straight-forward, asking the fewest questions before proceeding with the install. Atlassian’s Bamboo has the most restrictive trial license, but Mojo, Hudson and TeamCity all have a more open approach – you can use them in very useful forms without any cost or special licensing.

Once the tools were installed, I next looked to do any initial configuration, which I define loosely as ‘stuff the tool requires me to do before it lets me do what I really want to do’. On this measure, I again put Mojo and Hudson in the lead, as I didn’t have to do anything – I just went straight to thinking about the job I wanted to run. TeamCity wanted me to create a project and a build configuration, which was fairly easy – but I had to figure out what they meant by ‘project’ and ‘build configuration’. Bamboo was by far the most difficult tool to configure. Any time I see an 8-step wizard just to turn the engine over and get the motor running, my initial response is ‘who wrote this thing’?

Getting the actual job configured was again easy in Mojo and Hudson. The Mojo interface is very straight-forward – you select a machine to run on, and then start adding workflow steps (called activities). There is a large built-in list of activities (around 50) for interacting with commercial and opensource tools. I used the ‘Execute shell script’ activity type to run the set command, and that constituted the entirely of my ‘ENVPEEK’ job. Hudson was also easy to set up. TeamCity and Bamboo were the most painful to set up for actual jobs – you are forced into their concepts, instead of just being able to think about the job at hand. The other comment on both TeamCity and Bamboo is that they are both very ‘source code biased’. By that I mean that they have an implicit assumption that your jobs require interaction with source control. In both tools I was required to specify a location in a source control tool (I used Subversion from Collabnet). Since my initial job was a codeless one, this was annoying.

Running jobs in all tools is fairly easy, as is reviewing the logs – though in Bamboo I did have drill down quite a bit to get to my logs. Going back to my ‘source control bias’ comment, Bamboo needed to check out code from a repository location that I specified – and then ignored it since my inital job was just to run ‘set’.

Next installment: doing actual code builds with each of the tools, and then putting together complex build processes.

Happy Building!

– Matt

Written by Matthew D. Laudato

June 7, 2010 at 4:09 pm

The Build Engineer’s Desktop

with 3 comments

Programming environments have come a long way from when I started in this business. I can recall loading programs from cassette tape into my Timex Sinclair computer in high school, and fumbling with the VAX editor in college. By the early 80’s, I found my self in graduate school with a mix of new (a microVAX) and old (a military surplus Raytheon 700, on which debugging amounted to reading hex codes from lights on the front panel and literally pressing the ‘step’ switch to move through the program).

Fast forward through the 90’s and into the new century, and things have changed quite a bit. Java programmers have Eclipse and other rich programming environments. If you work with the Microsoft technologies, Visual Studio has given you an increasingly powerful and convenient desktop over the past 15 years. Even database engineers have integrated environments where they can program and manage their database deployments. It seems that no matter what your role in the software business, there is a desktop tool for you. Which brings me to the topic of today’s post – the build engineer’s desktop.

It seems to me that the build engineer has drawn the short straw from software vendors. From this engineer we expect solutions to hard problems – complex compile and link sequences, deployments to test, staging and production environments, and a great deal of programming to make it all happen. But as a build engineer, your tool set is limited. You are expected to code, debug and deploy using a plain text editor, and cobble your scripts together ad hoc, with no centralized platform or desktop environment to act as your command center.

Enter OpenMake Meister. If you’re a build engineer, sitting down at the Meister client is like stepping into the cockpit of a 747. In one powerful desktop environment, you can assemble complex compile, link and archive services, manage deployments, do dependency analysis, create distributed workflows, write reusable scripts, and fully control the build, test and deploy services that your company demands of you.

I won’t go into all the details here, but build engineers, here’s a tip for you: stop scripting and start managing your build process. Take a look at OpenMake Meister, the build engineer’s desktop.

Happy Building!

– Matt

Written by Matthew D. Laudato

March 22, 2010 at 2:00 pm

Continuous Build Automation with Subversion and Meister

with one comment

I recently got a chance to work on a project using Collabnet Subversion and OpenMake Meister and put together a short demo on how to get the two tools to work together doing continuous integration. You can view it at http://www.openmakesoftware.com/flashdemo/Meister-SVN/omsvn_small/omsvn_small.html

Meister like most CI tools has several ways to kick off a CI build. You can do a scheduled build, or you can poll the SCM system. The third way of doing a CI build is to call the build from a Subversion hook. In the demo I show two of these methods: a scheduled build in Meister, and calling Meister from the Subversion post commit hook.

The setup is pretty simple. I have a repository in Subversion that has working copies for developers, and what I’ll call a ‘hands off’ working copy that only the build process uses (meaning, no developers are ever in that copy making changes. It receives changes strictly through a ‘svn update’ command run by the CI process). In Meister, I have a workflow that knows how to build a small DOS application from some code in the repository.

In the demo, I first show Meister running a build on a schedule. Meister updates the ‘hands off’ working copy and then compiles and links the code. In the second case, I turn off the scheduler, and instead activate the post commit hook in the Subversion repository. The hook code calls the Meister command line, which looks like this:

 

java -cp c:\openmake-meister\client\bin\omcmdline.jar com.openmake.cmdline.Main
-BUILD "WINDOWS BUILD WITH SVN"

 

The same workflow runs in both cases. The advantage of running from the hook is that you are always guaranteed that every transaction in Subversion gets built. On the other hand, setting a scheduler to run every hour is easy and might be more appropriate for shops with less frequent code changes. In both cases Meister is driving the build with its dependency analysis engine, so the builds are fast and highly parallelized.

Overall it was pretty easy both to get the Subversion repository configured, and to get the Meister workflow up and running. The Meister command line lets you do things like set environment variables (not shown above), so you can control the workflow at a fine level of detail.

Happy Building!
– Matt

Written by Matthew D. Laudato

January 22, 2010 at 7:59 pm