Monday, August 10, 2015

Please use ANSI-92 SQL Join Syntax

Way back in 1992, a standard was published for SQL. In it, the syntax for specifying joins in SQL Database was finally standardized. This was a good thing because (for folks who've been doing this a while), trying to decode the vagaries of how various Databases handled outer joins could drive you bonkers yielding different results depending on the parse order of things in the from and where clauses. Unfortunately, some DBMS vendors took a while to adopt this standard (Oracle didn't support it until 2001), but at this point if you're using a major RDBMS and NOT using it, you're probably mired in a tradition or habit that bears changing.

As an example of what I'm describing, here are some examples of what I mean.

Old School (no standard syntax + implementation and evaluation of predicates varied by vendor [and query]):

select * from customer, order
where customer.customer = order.customerid

ANSI way (agreed to standard, predicate evaluation has some rules that are followed):

select * from customer
  inner join order on customer.customerid = order.customerid

Now for the crusty folks who are going to argue "I've always done it this way and there's no quantifiable reason to switch" I give you a photo to illustrate my perspective

For folks who don't understand why it matters, I'll give a quick conceptual overview (that is slightly inaccurate, but serves to illustrate my point). In the above trivial example it IS a little arbitrary, but lets take a more involved example:

select * from customer, order, address, product
  where customer.customerid = order.customerid and customer.customerid = address.customerid and order.status = 'SHIPPED' and product.size = 'XL'

Can you spot the defect (this is still pretty trivial, but hopefully you've been burned by this a few times and will catch on. Compare this to:

select * from order
  inner join customer on order.customerid = customer.customerid
  inner join address on address.customerid = customer
  inner join product on <--wait what are we joining on here...cartesian product time...what are we actually joining on?
where order.status = 'SHIPPED' and product.color = 'WHITE' and product.size = 'XL'

While this is a quantitative evaluation and the readability of the second form is pretty subjective. If one makes the conceptual leap to say that "join" clause describes how to link the tables together and "where" clause describes how to filter the joined rows, the second form begins to make sense. The first form with 16 tables and a few out joins thrown in becomes an exercise in cognitive overload, where the second form (in more cases) is easier to mentally parse and reason about.

Monday, June 15, 2015

Amdahl's Law, Gustafson's Law, and user experience stretching to optimize parallel web design

There are two widely known laws in parallel computing that are counterpoints to each other, but with thought they can be applied together to achieve dramatic improvements in parallel computing. They two laws are Amdahl's Law and Gustafson's Law. Roughly speaking, Amdahl's says you cannot speed a system up by parallelization to be faster than the the slowest portion. Using a simple metaphor, if you have a car, a bicycle, and a turtle, when traveling in a pack, you can only move as fast as the turtle. The counterpoint to this is Gustafson's law, which roughly states that for parallel problems, adding more resources can increase the net work completed. Using are above metaphor, if we add more turtles (or cars or bicyclists) we are still increase the amount distance travelled per unit time. I do apologize to Computer Scientists as this is a very rough approximation, but this way of thinking about the situation can help us understand how to use some trickery and slight of hand to make things "impossibly" faster/better.

Using these concepts, let's talk about a real problem: Suppose we have a web application that needs to do 15 seconds of computing in order to present a visual representation of some data to a user. We happen to know that the slowest portion takes 5 seconds and must be performed serially. furthermore, let's suppose that the we have a system, that based on a user's name, can compute the following:

  • Compute user's favorite number - 1 second
  • Compute user's favorite food - 2 seconds
  • Compute user's favorite pet - 3 seconds
  • Compute user's Birthday - 4 seconds
  • Compute user's favorite color - 5 seconds

Applying Amdahl's law, we can conclude that the fastest parallel implementation should take 5 seconds to finish computing (and require 3 processors). That is, if we started all 5 computations at the same time, we would not be fully complete until 5 seconds had passed. Optimally, with three processors: 1 processor could compute favorite color, one processor could compute Birthday + favorite number, and one processor could compute favorite food + favorite pet. This would shorted the entire transaction time to 5 seconds.

Thinking about the problem in terms of Gustavson's Law, however, we also discover that adding another 2 processors gives us an opportunity to do more work (and/or support more capacity). Put another way, suppose we also know we can also do the following:

  • Compute user's favorite song - 1 second
  • Compute user's shoe size - 2 seconds
  • Compute user's favorite flavor ice cream - 3 seconds
  • Compute user's favorite drink - 4 seconds

Knowing this, by simply adding two more processors, we can return almost double the data for our users. Furthermore, applying some web trickery, pipelining, and user experience magic, we can potentially fool the user into thinking that every operation takes 1 second. For example, we know that a user will typically only look at two data items at a time and, they will typically look at each screen for 2 seconds, we can display data is up to 15 seconds old, caching the data on the client is possible, and there is no strong preference for which data item they may want to look at first (yes, these are quite a few suppositions, but not unusual), we could do the following:

  • break the display into 5 sections (in order of display):
    1. favorite song + favorite number: Reachable after 1 second
    2. shoe size + favorite food: Reachable after 3 seconds
    3. favorite flavor ice cream + favorite pet: Reachable after 5 seconds
    4. Birthday + favorite drink: Reachable after 7 seconds
    5. favorite color: Reachable after 9 seconds
  • Add more processors and schedule the work as follows (there are other combinations that also work):
    • Processor 1: favorite number, then favorite food (number is done after 1 second, food after 3 seconds...processor is idle for 4 seconds)
    • Processor 2: favorite song, then shoe size (song is done after 1 second, shoe size after 3 seconds...processor is idle for 4 seconds)
    • Processor 3: favorite pet, then Birthday (pet is done after 3 seconds, birthday after 7 seconds...processor is fully utilized
    • Processor 4: favorite ice cream, then favorite drink (ice cream is done after 3 seconds, favorite drink after 7 seconds, processor is fully utilized)
    • Processor 5: favorite color (color is done after 5 seconds, processor is idle for 2 seconds)

By fiddling with the user experience, we've stretched our window of time and amount of compute resources available for the longer running serial operations. This means we could even optimize a bit further if we wanted by moving things around in the experience. What we've effectively done is given ourselves an opportunity to give the appearance that a 15 second operation actually takes no longer than 1 second. Note, this DOES require some compromise, but the general approach is one that can yield practical benefits even when traditional logic (and thinking purely from a sound theoretical perspective) might say it isn't possible.

Friday, June 12, 2015

The Programmers Code

  1. Prior to writing code, I will search the internet, I will ask intelligent questions, and I will realize that many of the answers I get may not be correct. I will test the answers and validate them before repeating it to someone else. I will not reinvent prior art without adding value
  2. Revision control tools are my staple, I will live and die by version control tools. Distributed, Centralized, Lock and Release, or Update and Commit...they all work and they are the foundation upon which I build everything. I will strive to learn how to use these tools to manage branches, tags, and how to properly share my code with my fellow programmers which whichever tool my team happens to use
  3. On all fronts I will respect the code that already exists and seek to understand why it is the way it is
  4. Giving credit when due, I will respect my peers, those who are less skilled, and those who are more skilled. I will not assume I'm smarter than everyone else, nor will I assume that esoteric and confusing work was the work of someone more intelligent than me. If I stumble across code that looks like a rabid squirrel ran across someone's keyboard I will seek out the author and understand the reason behind the incoherence
  5. Remembering that all programming languages and cultures are different, I will learn the idioms of whichever programming language I happen to be using. I will not code Ruby in Java, FORTRAN in Assembler, nor any other "language 1" in "language 2"
  6. At no point will I name classes, frameworks, files, or other artifacts for mythical creatures, movie characters, or other seemingly clever things. I will name things appropriate for the domain I'm working in (note, if you're writing a system working with the aforementioned things...you have a special dispensation)
  7. My comments and aspirations will be obtainable. I will never insert comments like //TODO or //FIXME unless I personally take responsibility for these actions and have the ability to rectify them in the very near future
  8. Mannual testing is a minimum, I will think about how to test my code before I write it. If possible with my tools, I will write automated and repeatable tests before writing implementations. A compiler is NOT a testing tool
  9. Every line of code will be tested...every time I change it...if it's difficult see previous item
  10. Rrealizing that code is complex, I will use an IDE, vi and emacs are for gunslingers or people on vt100 terminals. If I don't know what a vt100 terminal is, I will only use vi when ssh-ing into a remote server...and then only when necessary.
  11. Sed, grep, awk, less, vi, regular expressions, bash are things I will know and understand. Even if I have no need in my job, I will know them and love them (even when I hate them)
  12. Creative code styles are forbidden, I will respect the format. Nobody wins whitespace and bracing arguments, whatever is currently in place is what I'll use. Leave these debates for internet trolls and ivory tower architects, they're the only ones who should care
  13. One language or framework is the not always the best for everything. I will refrain from trying to solve every problem with whatever my favorite tool happens to be. I will politely and strongly argue logical and rational points about merits and shortfalls of frameworks and languages. I will not wage Jihad against languages and frameworks I do not understand
  14. Daily builds and checkins are too infrequent. I will constantly commit code and modularize my changes to illustrate progress at all times
  15. Everything I write will be obsolete in four years. I will not cling to code, coding paradigms or system metaphors beyond their useful lifespan