Sunday, July 18, 2010

garage sales part two (geocoding and rendering)

Early Results

Here are some early results:

Port Huron, MI
Rockford, IL

These maps show the first page of garage sales on craigslist with about a 50% accuracy rate (meaning, only about 1/2 of time can I find an address). That having been said, it's still pretty impressive as manually entering these things into google maps is.... tedious. This process takes about 60 seconds per city using the script I've written.

Back to Geocoding

Note, geocoding is the process of attaching geographic coordinates to data. In my case I can find a "reasonable" address in about 1/2 of the entries. This means there is a string somewhere that looks like an address, notably, it has a few numbers, and then a couple of words.

To get this data and geocode it, I wrote an awk script
$1 ~ /\([0-9]+ .+\)/ {match($1,/([0-9]+ [^\)\(]+)\)/,out); printdata( "\""out[1]"\"")}
$1 !~ /\([0-9]+ .+\)/ && $2 ~ /[0-9]+ [a-zA-Z]+ [a-zA-Z]+ /{match($2,/([0-9]+ [a-zA-Z]+ [a-zA-Z]+)/,out); printdata( "\""out[1]"\"")}
function printdata (mydata) {
curl = "/usr/bin/curl -s \"http://maps.google.com/maps/api/geocode/xml?address=" gensub(" ","%20","g",substr(mydata,2,length(mydata)-2)) ",+Port+Huron,+MI&sensor=false\" | /usr/bin/xmlstarlet tr geocode.xsl"
curl | getline loc
markers = markers "|" loc
close(curl)
print mydata",\""gensub("\"","\"\"","g",$2)"\","loc
}

What this script does is use a regular expression match to find rows in the data that look like addresses, replace spaces with %20, then send a geocoding request to http://maps.google.com/maps/api/geocode/xml, take those xml results, use an xslt to extract the latitude,longitude coordinates, then reoutput the rows with the latitude and longitude tagged on the end of the row.

What I'm left with in my output2.csv file is some data that looks like this:

name,description,latitude,longitude
"603 N 3RD ST, ST CLAIR","Yard Garage Estate Sale....July 8 Thursday to July 11 Sunday...9AM to 6PM....We have Antiques, Tools, Furniture, Tons of stuff for EVERYONE...We plan on having a BAG SALE on Sunday with whats left....But be there before then for best choices!!!!!",42.9741483,-82.4225020


As it turns out google has an api to take a file just like this and build a nice little map. I post to this api and out comes a pretty map.
The final product:
Main shell script

#!/bin/bash
rm -f *.html
rm -f *.xml
rm output.csv
wget -q -O craig1.xml - "http://porthuron.craigslist.org/search/?areaID=223&subAreaID=&query=garage+sale&catAbb=sss"
hxwls craig1.xml | grep http.*\.html | wget -i -
for f in *.html; do tidy -q -utf8 --quote-nbsp no --doctype omit --output-xml yes $f | xmlstarlet tr ripper.xsl | sed 's/PostingID: //' >> output.csv; sleep 1s; done;
cat output.csv | gawk -F \",\" -f go.awk > output2.csv
curl -X POST -H "GData-Version: 2.0" -H "Content-type: text/csv" -H "Authorization: GoogleLogin auth=\"secret key you get from google"" -H "Slug: port huron craigslist garage sales" --data-binary @output2.csv http://maps.google.com/maps/feeds/maps/default/full


Additional awk stuff

BEGIN { markers =""; print "name,description,latitude,longitude"}
$1 ~ /\([0-9]+ .+\)/ {match($1,/([0-9]+ [^\)\(]+)\)/,out); printdata( "\""out[1]"\"")}
$1 !~ /\([0-9]+ .+\)/ && $2 ~ /[0-9]+ [a-zA-Z]+ [a-zA-Z]+ /{match($2,/([0-9]+ [a-zA-Z]+ [a-zA-Z]+)/,out); printdata( "\""out[1]"\"")}
function printdata (mydata) {
curl = "/usr/bin/curl -s \"http://maps.google.com/maps/api/geocode/xml?address=" gensub(" ","%20","g",substr(mydata,2,length(mydata)-2)) ",+Port+Huron,+MI&sensor=false\" | /usr/bin/xmlstarlet tr geocode.xsl"
curl | getline loc
markers = markers "|" loc
close(curl)
print mydata",\""gensub("\"","\"\"","g",$2)"\","loc
}
#END {print markers}


Wrapup

One thing I'll note. While this certainly works, there are only a few people I've ever met who will be able to figure it out. I'm ultimately probably better off switching to Ruby/Python/Groovy at some point, but I wanted to get something working first.

Some of the problems I had with these tools are that they don't just "work"... for example, to fetch the url via groovy, I started with this code snippet

def http = new groovyx.net.http.HTTPBuilder( 'http://rockford.craigslist.org' )
http.{ uri.path = '/search/?areaID=223&subAreaID=&query=garage+sale&catAbb=sss'
response.success = { resp, xml -> xml.responseData.results.each {
println " ${it.titleNoFormatting} : ${it.visibleUrl}"
} } }


My first problem was that my classpath was somehow screwed up and I couldn't get this to compile. In addition, even when I do get it to work, it's much more complicated than the command line equivalent:

wget -q -O craig1.xml - "http://porthuron.craigslist.org/search/?areaID=223&subAreaID=&query=garage+sale&catAbb=sss"


Why would I want to write all the gobbldy-gook at the top (that adds NO value) when the bottom version just works? If I get energetic, I'll probably start to port this to groovy once I get it working better as I believe this will be more compatible with my ultimate platform (android).

No comments: