Saturday, July 17, 2010

Garage sale maps

The Backstory

My wife is an avid garage sailer. She finds garage sales she thinks have promise, then cruises by them to see if there is anything of interest. She is so accomplished at this that she routinely turns a profit by snagging things that folks didn't realize had resale value, and flipping them at local consignment shops. While this doesn't pay the bills, it DOES provide enough extra cash to actually have her garage sailing at least pay for itself with a little left over.

This is, however, not without it's share of problems:

First, the postings online (or paper) for garage sales are scattered to the four winds. At this point, is the hands down winner for quality and quantity of posts. Local newspapers/classified also have a good quantity, but the few sites I found on google geared toward this are suffering terribly from a strategic chicken and egg problem.

Second, while craigslist has a fairly high quality set of sales established, it has no functionality to map these things so you're on your own. Our current process is to plug through each one, look at the address, plug the address into google maps (or the gps), then lather rinse repeat.

The Idea

Enter the programmer husband, I saw what she was doing and said "I can do better with some software". My ultimate vision was to have a location aware android application that could tell one the nearest 5 garage sales that have "interesting" things as well as an online application to solve the obvious traveling salesmen problem. While getting to this (nobel? foolish?) goal involves solving a huge number of other problems, without a good solution to the first two mentioned above, everything else is secondary.

In order to solve the "chicken and egg" problem, I simply told myself "she's using craigslist as her primary source of information". Problem solved... yes, she's probably missing hundreds if not thousands of sources of garage sales, but since she had already been using craigslist as the primary source, I couldn't see any reason to change that. As a small caveat, since we did need to do some interpretation of the data from CL, it turns out I had to build an intermediate data format anyway. This means the input data is pluggable and we aren't necessarily bound to CL as the sole source of information.

For the "mapping" problem, my initial reaction was to use google maps. I already happened to understand the web apis and it's fairly easy to use. While there may be a hundred other tools that might do the job, I didn't even really look for alternate solution.

The build Part 1

From a solution design perspective, I am not a big fan of boiling the ocean. This type of solution is really only a good idea if you're a consultant who is being paid to come up with a brilliant idea... it's not so good for a schmuck who's trying to build something useful.

To that end, I reached into my toolbox and asked myself "I know python, ruby, and java all have html processing tools, which should I use?". My initial guess was to use ruby, hpricot and rubyful soup are both very capable screen scraping/html processing tools and showed early promise. In practice, it took a few hours with both to get them working and they where just a bit more clunky that I was looking for.

After an hour or so, I switched to Python. The advantage python seemed to have was that the html/http stuff was built right in. I've used python and jython in the past for screen scraping mainframe systems and was actually kicking myself for not remembering this and using it as my first choice. My initial enthusiasm began to wane quickly as I realized the slightly odd nature with which the CL pages where formatted was difficult to process using the default HTML parser.

I then went to java (actually groovy) and made what I would admit is an almost halfhearted attempt at the problem. By this point I was a bit disheartened because I had spent the better part of the day trying this stuff out and every time I changed tools my IDE (eclipse) would require a bunch of reconfiguration to get everything working properly. In addition, the package management and syntax of all these things where so radically different that I had to spend time googling and mentally changing gears to get started again.

At this point I took a break and didn't resume until the next weekend. The next weekend I took a totally different approach. Instead of relying on "programming" languages, I asked myself the question "what is the simplest possible way to get and extract this content?".

My answer: bash.

Enter Bash

When I sat back down the following week I realized that in my search for utility packages in various languages I had been struggling with a couple of problems. First off the software in Ruby/Python/Groovy (RPG) was not really geared for text processing or html processing. They were more like swiss army knives that I could get Yet Another Plugin (YAP) to do what I wanted. This in my head I heard the late night infomercial huckster say "But wait, there's more..."

What I had been ignoring is that while developing the RPG solutions, I was actually using command line utilities to verify the code was working problem. DOH!

So, simply put I took my existing wget command line and used that to extract the html for the posts.

wget -q -O craig1.xml - ""

Then walk the links with hxwls and pull them down. While I know this could have been a one step process, but I'm was using the intermediate files for troubleshooting.

hxwls craig1.xml | grep http.*\.html | wget -i -

At that point I had a set of html files on my drive that I was using to try and extract content. My first step was to change them into xml, then use an xslt to convert them into flat files.
html tidy seemed a good choice for cleaning up the html

tidy -q -utf8 --quote-nbsp no --doctype omit --output-xml yes

Use xmlstarlet to do the xslt

xmlstarlet tr ripper.xsl

The ripper.xsl consisted of the following:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="">
<xsl:output method="text"/><xsl:template match="/">"<xsl:value-of select="normalize-space(//h2)"/>","<xsl:for-each select="//div[@id='userbody']"><xsl:value-of select='normalize-space(text())'/>","<xsl:value-of select="normalize-space(following::text())"/></xsl:for-each>"

A little bit of sed to extract the posting id

sed 's/PostingID: //'

Put it all together:

for f in *.html; do tidy -q -utf8 --quote-nbsp no --doctype omit --output-xml yes $f | xmlstarlet tr ripper.xsl | sed 's/PostingID: //' >> output.csv; done;

At this point I was pretty satisfied with the resulting csv file. In a couple of hours I had a working solution that was simple (if not a big opaque). The output.csv had posting id, a subject, and a bunch of textual data about the posting. It was ready to be imported to mysql, geocoded, or whatever other things I needed to have happen.

The entire program was two files (an xsl) and one or two lines of a bash script all pipelined together:

wget -q -O craig1.xml - ""
hxwls craig1.xml | grep http.*\.html | wget -i -
for f in *.html; do tidy -q -utf8 --quote-nbsp no --doctype omit --output-xml yes $f | xmlstarlet tr ripper.xsl | sed 's/PostingID: //' >> output.csv; done;

In my Next Post, geocoding this info and building a map.

Examples of useless garage sale sites that will never succeed


Anonymous said...

Love the idea. We live in the middle of no where and I'd like find a way not to back track while looking for garage sale "finds".
Somehow our gps doesn't aways take the short cut??? I too have been able to may some extra cash with resale. Good luck. jmm-mom

blogger_sanyer said...

Thank you for sharing the commands. What was the problem in groovy?