Saturday, July 31, 2010

Cloud Computing Gotchas

I've been using Rackspace cloud for testing some server builds and ESB solutions and recently ran into a "gotcha". First off, it looks like maybe the machine was compromised... I HOPE it was an inside job by one of my developer "friends" who happened to know the userid/password. If not, that means the default install of ubuntu 10.04, apache tomcat6, apache2.2, and servicemix is able to be compromised in less that 3 days when left out on the internet.

In any event, that particular problem notwithstanding, I now have a different problem... That is, rackspace has suspended my account and I cannot access my server, nor create another one until Monday. Thank god I was only using this machine to test things, I can't image what I would have done if I was actually depending on it to be running.

Another problem I'm finding is that I cannot find any reference on Rackspace's web site about acceptable use. They suspended the account for outbound ssh activity which is pretty silly considering any sane server admin uses ssh for EVERYTHING. I'm a little concerned because without ssh capability, I don't really have a good secure option to connect to any other server.

Worse yet, I cannot access my log files, server images, or any other information to try and discover what happened. While they claim "fanatical customer service", I'm a bit disappointed that I have to wait 48 hours to get information about a problem with legal implications. It seems like it would be pretty simple to let me see at LEAST my log files as well as get some information about WHO thinks I'm hacking. As it stands it sounds like all I need to do is call rackspace and complain and they will disable an account.

Wish me luck on Monday, I'm really curious about what actually happened here.

Sunday, July 18, 2010

garage sales part two (geocoding and rendering)

Early Results

Here are some early results:

Port Huron, MI
Rockford, IL

These maps show the first page of garage sales on craigslist with about a 50% accuracy rate (meaning, only about 1/2 of time can I find an address). That having been said, it's still pretty impressive as manually entering these things into google maps is.... tedious. This process takes about 60 seconds per city using the script I've written.

Back to Geocoding

Note, geocoding is the process of attaching geographic coordinates to data. In my case I can find a "reasonable" address in about 1/2 of the entries. This means there is a string somewhere that looks like an address, notably, it has a few numbers, and then a couple of words.

To get this data and geocode it, I wrote an awk script
$1 ~ /\([0-9]+ .+\)/ {match($1,/([0-9]+ [^\)\(]+)\)/,out); printdata( "\""out[1]"\"")}
$1 !~ /\([0-9]+ .+\)/ && $2 ~ /[0-9]+ [a-zA-Z]+ [a-zA-Z]+ /{match($2,/([0-9]+ [a-zA-Z]+ [a-zA-Z]+)/,out); printdata( "\""out[1]"\"")}
function printdata (mydata) {
curl = "/usr/bin/curl -s \"http://maps.google.com/maps/api/geocode/xml?address=" gensub(" ","%20","g",substr(mydata,2,length(mydata)-2)) ",+Port+Huron,+MI&sensor=false\" | /usr/bin/xmlstarlet tr geocode.xsl"
curl | getline loc
markers = markers "|" loc
close(curl)
print mydata",\""gensub("\"","\"\"","g",$2)"\","loc
}

What this script does is use a regular expression match to find rows in the data that look like addresses, replace spaces with %20, then send a geocoding request to http://maps.google.com/maps/api/geocode/xml, take those xml results, use an xslt to extract the latitude,longitude coordinates, then reoutput the rows with the latitude and longitude tagged on the end of the row.

What I'm left with in my output2.csv file is some data that looks like this:

name,description,latitude,longitude
"603 N 3RD ST, ST CLAIR","Yard Garage Estate Sale....July 8 Thursday to July 11 Sunday...9AM to 6PM....We have Antiques, Tools, Furniture, Tons of stuff for EVERYONE...We plan on having a BAG SALE on Sunday with whats left....But be there before then for best choices!!!!!",42.9741483,-82.4225020


As it turns out google has an api to take a file just like this and build a nice little map. I post to this api and out comes a pretty map.
The final product:
Main shell script

#!/bin/bash
rm -f *.html
rm -f *.xml
rm output.csv
wget -q -O craig1.xml - "http://porthuron.craigslist.org/search/?areaID=223&subAreaID=&query=garage+sale&catAbb=sss"
hxwls craig1.xml | grep http.*\.html | wget -i -
for f in *.html; do tidy -q -utf8 --quote-nbsp no --doctype omit --output-xml yes $f | xmlstarlet tr ripper.xsl | sed 's/PostingID: //' >> output.csv; sleep 1s; done;
cat output.csv | gawk -F \",\" -f go.awk > output2.csv
curl -X POST -H "GData-Version: 2.0" -H "Content-type: text/csv" -H "Authorization: GoogleLogin auth=\"secret key you get from google"" -H "Slug: port huron craigslist garage sales" --data-binary @output2.csv http://maps.google.com/maps/feeds/maps/default/full


Additional awk stuff

BEGIN { markers =""; print "name,description,latitude,longitude"}
$1 ~ /\([0-9]+ .+\)/ {match($1,/([0-9]+ [^\)\(]+)\)/,out); printdata( "\""out[1]"\"")}
$1 !~ /\([0-9]+ .+\)/ && $2 ~ /[0-9]+ [a-zA-Z]+ [a-zA-Z]+ /{match($2,/([0-9]+ [a-zA-Z]+ [a-zA-Z]+)/,out); printdata( "\""out[1]"\"")}
function printdata (mydata) {
curl = "/usr/bin/curl -s \"http://maps.google.com/maps/api/geocode/xml?address=" gensub(" ","%20","g",substr(mydata,2,length(mydata)-2)) ",+Port+Huron,+MI&sensor=false\" | /usr/bin/xmlstarlet tr geocode.xsl"
curl | getline loc
markers = markers "|" loc
close(curl)
print mydata",\""gensub("\"","\"\"","g",$2)"\","loc
}
#END {print markers}


Wrapup

One thing I'll note. While this certainly works, there are only a few people I've ever met who will be able to figure it out. I'm ultimately probably better off switching to Ruby/Python/Groovy at some point, but I wanted to get something working first.

Some of the problems I had with these tools are that they don't just "work"... for example, to fetch the url via groovy, I started with this code snippet

def http = new groovyx.net.http.HTTPBuilder( 'http://rockford.craigslist.org' )
http.{ uri.path = '/search/?areaID=223&subAreaID=&query=garage+sale&catAbb=sss'
response.success = { resp, xml -> xml.responseData.results.each {
println " ${it.titleNoFormatting} : ${it.visibleUrl}"
} } }


My first problem was that my classpath was somehow screwed up and I couldn't get this to compile. In addition, even when I do get it to work, it's much more complicated than the command line equivalent:

wget -q -O craig1.xml - "http://porthuron.craigslist.org/search/?areaID=223&subAreaID=&query=garage+sale&catAbb=sss"


Why would I want to write all the gobbldy-gook at the top (that adds NO value) when the bottom version just works? If I get energetic, I'll probably start to port this to groovy once I get it working better as I believe this will be more compatible with my ultimate platform (android).

Saturday, July 17, 2010

Garage sale maps

The Backstory

My wife is an avid garage sailer. She finds garage sales she thinks have promise, then cruises by them to see if there is anything of interest. She is so accomplished at this that she routinely turns a profit by snagging things that folks didn't realize had resale value, and flipping them at local consignment shops. While this doesn't pay the bills, it DOES provide enough extra cash to actually have her garage sailing at least pay for itself with a little left over.

This is, however, not without it's share of problems:

First, the postings online (or paper) for garage sales are scattered to the four winds. At this point, craigslist.org is the hands down winner for quality and quantity of posts. Local newspapers/classified also have a good quantity, but the few sites I found on google geared toward this are suffering terribly from a strategic chicken and egg problem.

Second, while craigslist has a fairly high quality set of sales established, it has no functionality to map these things so you're on your own. Our current process is to plug through each one, look at the address, plug the address into google maps (or the gps), then lather rinse repeat.

The Idea


Enter the programmer husband, I saw what she was doing and said "I can do better with some software". My ultimate vision was to have a location aware android application that could tell one the nearest 5 garage sales that have "interesting" things as well as an online application to solve the obvious traveling salesmen problem. While getting to this (nobel? foolish?) goal involves solving a huge number of other problems, without a good solution to the first two mentioned above, everything else is secondary.

In order to solve the "chicken and egg" problem, I simply told myself "she's using craigslist as her primary source of information". Problem solved... yes, she's probably missing hundreds if not thousands of sources of garage sales, but since she had already been using craigslist as the primary source, I couldn't see any reason to change that. As a small caveat, since we did need to do some interpretation of the data from CL, it turns out I had to build an intermediate data format anyway. This means the input data is pluggable and we aren't necessarily bound to CL as the sole source of information.

For the "mapping" problem, my initial reaction was to use google maps. I already happened to understand the web apis and it's fairly easy to use. While there may be a hundred other tools that might do the job, I didn't even really look for alternate solution.

The build Part 1


From a solution design perspective, I am not a big fan of boiling the ocean. This type of solution is really only a good idea if you're a consultant who is being paid to come up with a brilliant idea... it's not so good for a schmuck who's trying to build something useful.

To that end, I reached into my toolbox and asked myself "I know python, ruby, and java all have html processing tools, which should I use?". My initial guess was to use ruby, hpricot and rubyful soup are both very capable screen scraping/html processing tools and showed early promise. In practice, it took a few hours with both to get them working and they where just a bit more clunky that I was looking for.

After an hour or so, I switched to Python. The advantage python seemed to have was that the html/http stuff was built right in. I've used python and jython in the past for screen scraping mainframe systems and was actually kicking myself for not remembering this and using it as my first choice. My initial enthusiasm began to wane quickly as I realized the slightly odd nature with which the CL pages where formatted was difficult to process using the default HTML parser.

I then went to java (actually groovy) and made what I would admit is an almost halfhearted attempt at the problem. By this point I was a bit disheartened because I had spent the better part of the day trying this stuff out and every time I changed tools my IDE (eclipse) would require a bunch of reconfiguration to get everything working properly. In addition, the package management and syntax of all these things where so radically different that I had to spend time googling and mentally changing gears to get started again.

At this point I took a break and didn't resume until the next weekend. The next weekend I took a totally different approach. Instead of relying on "programming" languages, I asked myself the question "what is the simplest possible way to get and extract this content?".

My answer: bash.

Enter Bash

When I sat back down the following week I realized that in my search for utility packages in various languages I had been struggling with a couple of problems. First off the software in Ruby/Python/Groovy (RPG) was not really geared for text processing or html processing. They were more like swiss army knives that I could get Yet Another Plugin (YAP) to do what I wanted. This in my head I heard the late night infomercial huckster say "But wait, there's more..."

What I had been ignoring is that while developing the RPG solutions, I was actually using command line utilities to verify the code was working problem. DOH!

So, simply put I took my existing wget command line and used that to extract the html for the posts.

wget -q -O craig1.xml - "http://rockford.craigslist.org/search/?areaID=223&subAreaID=&query=garage+sale&catAbb=sss"


Then walk the links with hxwls and pull them down. While I know this could have been a one step process, but I'm was using the intermediate files for troubleshooting.

hxwls craig1.xml | grep http.*\.html | wget -i -

At that point I had a set of html files on my drive that I was using to try and extract content. My first step was to change them into xml, then use an xslt to convert them into flat files.
html tidy seemed a good choice for cleaning up the html

tidy -q -utf8 --quote-nbsp no --doctype omit --output-xml yes


Use xmlstarlet to do the xslt

xmlstarlet tr ripper.xsl


The ripper.xsl consisted of the following:


<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/><xsl:template match="/">"<xsl:value-of select="normalize-space(//h2)"/>","<xsl:for-each select="//div[@id='userbody']"><xsl:value-of select='normalize-space(text())'/>","<xsl:value-of select="normalize-space(following::text())"/></xsl:for-each>"
</xsl:template></xsl:stylesheet>



A little bit of sed to extract the posting id

sed 's/PostingID: //'



Put it all together:


for f in *.html; do tidy -q -utf8 --quote-nbsp no --doctype omit --output-xml yes $f | xmlstarlet tr ripper.xsl | sed 's/PostingID: //' >> output.csv; done;


At this point I was pretty satisfied with the resulting csv file. In a couple of hours I had a working solution that was simple (if not a big opaque). The output.csv had posting id, a subject, and a bunch of textual data about the posting. It was ready to be imported to mysql, geocoded, or whatever other things I needed to have happen.

The entire program was two files (an xsl) and one or two lines of a bash script all pipelined together:


!/bin/bash
wget -q -O craig1.xml - "http://rockford.craigslist.org/search/?areaID=223&subAreaID=&query=garage+sale&catAbb=sss"
hxwls craig1.xml | grep http.*\.html | wget -i -
for f in *.html; do tidy -q -utf8 --quote-nbsp no --doctype omit --output-xml yes $f | xmlstarlet tr ripper.xsl | sed 's/PostingID: //' >> output.csv; done;


In my Next Post, geocoding this info and building a map.


Examples of useless garage sale sites that will never succeed