Monday, April 25, 2016

Headless rasberry pi 3 install for OSX for non-noobs

Having purchased a raspberry pi 3 a few weeks ago, I was quite confused by almost every reference for install mentioning "plug in HDMI monitor and USB keyboard" as a step. While I've found references on how to do a headless install, it seems that many of the instructions come from a background of "you've already installed and run the graphical installer". As a person coming from an arduino/linux server background, I really don't need X11 for my use case and just want a powerful micro controller that I can setup via ssh (well, USB would be better, I still don't understand why you can't do this using the USB connection as a tty...but that's a different discussion). What follows are the steps I used...NOTE if you use the wrong disk number you will destroy potentially important information on your machine, use at your own risk and only do this if you understand what this means otherwise you will likely have an unusable machine or at a minimum lose information.

First, download the raspbian lite image.

Next, plug your sd card into your mac



and you should see an entry that corresponds to your SD card. My output had an entry similar to this (other output omitted)

/dev/disk2s1 129022 55730 73292 44% 0 0 100% /Volumes/mysdcard

Unmount the sd card:

sudo diskutil unmount /dev/disk2s1

Copy the image to the RAW device (this means /dev/rdisk2 instead of /dev/disk2s1...the disk number will quite likely be different on your machine)...

sudo dd if=2016-03-18-raspbian-jessie-lite.img of=/dev/rdisk2 bs=1m

Note, I'm not sure about the whole "block size" thing, but this is what I used.

This will run for a few minutes with no feedback, you can hit ctrl-T in your terminal to get a status output. Once this command has completed, you can eject the disk.

sudo diskutil eject /dev/rdisk2

now plug the sd card into your pi, power it up (via usb), and look for the device (plugged into ethernet) on your network. Assuming you've found the device's IP address (mine was at you can then ssh into the machine with:

ssh pi@

Using 'raspberry' as the password

At this point you should have a functional pi image and can continue with your configuration...My first set was to resize the root partition using raspi-config (as I have a 32gb card).

Hopefully these instructions will help 'slightly more advanced' users wade through the "Noob" clutter available on the internet.

Friday, April 15, 2016

Do you test your IT operations and business processes?

The software industry expends a lot of energy making sure software is tested. From unit testing, to system and performance testing, to manual "poke it with a stick testing", almost no software team "doesn't do it" or fails to see the need. Ironically though, many places don't routinely test their IT operations and business processes. This is ironic because if those things are broken or brittle, it generally has a MUCH larger negative impact on a company than buggy software.

To clarify, I've worked with many companies that have a "backups" and "disaster recovery plans", but they never TEST to see if either of this can actually lead to a recovery in the timeframe expected. A well known (for me at least) scenario in the IT field (related to operations) is this:

  1. "Yes we do backups"
  2. Server fails, all data is gone
  3. Build new server (this works)
  4. Restore data that was previously backed up
  5. Realize backups actually were written in a way that isn't recoverable, the backups we thought were being performed have actually never worked, someone "forgot" to enable backups for that particular server...(the list goes on and on...)
  6. Weep
  7. Go out of business

Stretching outside the technical realm, there's another area that confounds me on its lack of testing maturity, which is: "Testing your process". Most places I've encountered are, at best, able to define and presumably follow a process of some sort, but generally unable to understand or define what happens when the process fails. As an example, many places have "human steps" in their process, but never test "what happens if the human forgets, has incorrect assumptions about what that step means, or is just plain lazy and lies about performing a step?". In general, there is too much reliance on an individual sense responsibility being the safeguard that the process will perform adequately.

As a very common example...if we have a software delivery process and a step is "update the API documentation", how many organizations will actually QA the process to understand how to detect and/or ensure that this step is done? More importantly, how many teams will have someone test "making a change without updating the documentation properly" to ensure that this is detected? My general answer is "a vanishingly small number".

Most people (in my experience) when quizzed about issues such as this will throw out statements like "well, we pay our people well and they are 'good' people, so we don't have to worry about it". To me, this is a silly and fragile position to take, many professions have very highly paid people who are extremely reliable (for people) that still have checks, double checks, and controls to ensure that the "things we said we were going to do" actually "got done". While I think the security industry is the dimension of tech that has made the most progress in "defining" these sort of controls, I still see that (even in that industry) most companies don't take the additional step to validate or test that the process itself is adequate and failure is detected at an appropriate manner, level, and time.

Tuesday, April 5, 2016

Let it crash

"Let it crash" is a watchword in the Erlang world that has a pretty specific meaning in that context, but can be seriously misapplied if taken out of context.

In the erlang world, a "crash" is the termination of an actor in it's specific context. In a well designed actor system, the actors have very specific jobs and if they cannot complete that job they are free to fail immediately. This is a bit of a problem for folks working in the JVM world as crash can be overloaded to mean things that change the semantics of a transactional system.

Real world(ish) example: Suppose you have a distributed system that accepts a message, writes it to a data store, then hands a new message off to three other components. Suppose further that the transactional semantics of the system are such that the job isn't "done" until all four operations have completed and are either #1 permanently successful, or #2 permanently failed.

The important detail here is that when performing a transfer, we want the balances of both accounts are updated as a single transaction and we cannot be in a state where the money has left one account, but has not arrived at the other account. To do this requires the concept of a distributed transaction, but without using an "out of the box" distributed transaction coordinator. To clarify, we will assume that the components described are exposed via web services and don't have access to each other's underlying transaction management system.

So, to design this, the trivial implementation (let's call it the synchronous model) is as follows:

In this model, we need to handle a situation where if EITHER of the nested transactions fail, the requestor can rollback the OTHER transaction also and report back to the client that the entire transaction has failed. I won't dig into the details, but this is fairly complicated and we'll leave those details alone. The important detail here is that the entire transaction is synchronous and blocking. This means that the client and the requestor must hang around (typically in memory) waiting for the other components to report "success" or "failure" before reporting anything to the client. Moreover, it means that a failure of any component ultimates is a "permanent" failure (from the client's perspective), and it's up to the client to retry the transaction if it's important. While some failures might genuinely be permanent (one or the other of the accounts maybe don't or will never exist), while other failures (connectivity to one or other of the accounts updators) may only be transient and/or short lived.

In many ways, this simplifies things as it delegates responsibility of management of success or failure to the leftmost component. That having been said, there is still potential for things to go wrong if, for example, the first updator succeeds, but then the requestor dies and is unable to rollback the first transaction.

When put that way, it's obvious (I hope), that there needs to be some intermediate management that determines if there are any "partial" transactions if the request processor dies and can immediately rollback partial transactions should a failure occur. As an example, here is what this might look like.

We're still dodging some internal transaction management housekeeping, but the important detail is that between the point where the client lost track of the requestor (because it died), and the final "transaction failed" from the supervisor, the client has no idea what the state of the transaction genuinely could be that the transaction succeeded, but the connectivity between the transfer requestor and the client simply failed.

So the problems in this model are twofold: #1 it's "mostly" synchronous (though the Request supervisor -> client messaging clearly isn't), and #2 it assumes that it's "OK" for the transfer requestor to simply fail should an intermittant failure in part of the system cause a partial update to have happened. Obviously this may or may not be acceptable depending on the business rules at play, but it is certainly a common model...i.e. you aren't sure if the transaction worked because, as the client...your network went down, so you get an out of band email from the Request supervisor at your bank confirming that it, in fact, did fail.

While this is a good approach, it does tend to tie up more resources in highly concurrent systems, doesn't deal with failure very well (you only have a few hard coded strategies you can use), and when you scale to dozens or hundreds of components, the chances of a single failure become so large that you are unlikely to EVER succeed.

So what's the alternative?

The detail here (which enables more flexibility), is to assume things will intermittantly fail, and design the transaction semantics into the application protocol. This allows you to have "less robust" individual components, but adds the complexity of transaction management to the entire system. An example of how this might work:

The important details here are: #1 transaction details become persistant in the Transfer Store, #2 the Transfer Supervisor takes on the reponsibilitiy for the semantics of how the transaction strategy is managed, #3 the transaction gains the capabilities to become durable across transient failures with "most" components in the system, and #4 each independant component only needs to be available for smaller amounts of time. In general these are all desirable qualities, but...

The devil is in the details

Some of the negative side effects of this approach are that: #1 as the designer of the system, you now are explicitly reponsibile for the details of the transactional behavior, #2 if the system is to be robust across component failures, the operations must be idempotent (not have side effects across invocations). As an example of how this might be more robust, let's look at how we might implement a behavior that is durable across transient failures:

In this model, the transfer supervisor implements a simple retry strategy when the account requestor is unavailable. While this potentially make the system more robust and accounts for failure with much more flexibility, it's obvious that the Account Requestor (or the store) needs to be able to discern that sometimes it might recieve duplicates and be able to handle that gracefully. Additionally, it becomes more important to know the difference between something that mutates state and something that simply acknowledges that it is in agreement with your perspective on the state of the system.

More importantly, the latter approach now means we must take into account the difference between a "permanent failure", and a "transient failure" and this is often not a trivial task...i.e. transferring between a nonexistant account and a real account a transient problem or not?...if you think that's a trivial answer, think about a situation where there yet another async process that creates and destroys accounts? Is it acceptable to retry in 1 minute (in case the account is in the process of being created when you initially try the transfer?).

In conclusion, while distributing transactions into smaller pieces adds great power, it also comes with great responsibilitiy. This approach to designing systems is the genisys of the "let it crash" mantra bandied about by Scala and Erlang acolytes. "Let it crash" doesn't necessarily mean "relax your transaction semantics" or "don't handle errors", it means you can delegate responsibilty for recovery from failure out of a synchronous process and deal with it in more robust and novel ways.

Monday, April 4, 2016

Problems in the internet of things

Problems in the “Internet of Things”

Having worked with connected vehicles for a number of years now, there are some things that it seems newcomers always “get wrong”. Having worked through the “plain ‘ol Internet” (POI) boom, I see parallels between the mistakes made during that period and similar mistakes in the current ongoing boom. I’ll outline the biggest few:<,/p>

Failing to recognize mobility

In the POI, people used a client/server paradigm to design applications for the web. Additionally, the protocol generally chosen was one designed for document management, not application connectivity and it too almost a decade before general purpose alternatives arose that were better suited for the types of interactions desired. Moreover the general interactive design tended to try and replicate a desktop experience instead of design for the new platform. The problem here is that the device “might not be where you think it is” or it may have even “fallen off the network”. Without good tracking of these events, diagnosing problems with devices is a nightmare (is my Chevy land in the levy, or is the battery just dead?).

Failing to design for a headless device

` In the POI, folks failed to account for the fact that the client and the server were connected with a somewhat unreliable network with varying latency. This was remedied (after years of pain) by giving a user feedback and perhaps give them advice (hit refresh, if that doesn’t work call 1888-hit-it-again)… With headless devices, there is no “refresh button”. Often clever engineers will put logic in for retrying, but in my observation they forget about the fact that without giving data to a user of management service (or building updatable AI into the device for managing connectivity), the rules are often too primitive or brute force to be effective. A great one I’ve seen a number of times is a progressive fallback retry strategy that ends up with the device wait so long (or going offline) that it’s nearly impossible to account for losses.

Failing to manage embedded problems

In the original “mainframe” days, resources were fairly well managed as they were scarce and/or expensive, as we transitioned to the POI days, the equation on the costs of these things changed dramatically (memory became cheap, client storage disappeared [for a while], power was ubiquitous and distributed. In the IOT, power can become scarce and must be carefully managed (new problem [yes I know embedded folks ‘get it’], memory is a decision that can be balanced, as can storage. There are, however a multitude of other “embedded system” problems that are now being introduced to a larger group of engineers. Historically there hasn’t been a large overlap between people who deal with network protocols, backend systems, and embedded devices in uncontrolled environments. There are many systems that have “parts” of those problems, but not very many where ALL of them now must be solved for. i.e. perhaps a warehouse management system works with embedded devices talking to backends, but it’s NOT mobile, and it’s generally a controlled environment.


This is the big hairy and scary gorilla in the room. At the inception of the POI, security was a very secondary concern because historically it was handled in the server room and by tightly controlling the desktop. The POI opened this up such that the client was inherently insecure and observable. This led to many mistakes by folks who were used to being able to control both sides of the equation and not realizing that this mental model is dangerous in a highly distributed world. In the IOT world, the bigger problem is that our ways of thinking about security don’t account necessarily for the fact that when devices are moving about, they encounter many network situations that just don’t happen with a web browser or mobile phone. Depending on what sorts of sensors and capabilities the devices are designed for, the number of ways that things can go wrong is multiplied many times over (versus the relatively simple problems with the POI).

This is just a short list, but hopefully gives pause to folks designing connected devices to “think about the things they might be thinking about incorrectly” when designing IOT solutions.