Friday, April 15, 2016

Do you test your IT operations and business processes?

The software industry expends a lot of energy making sure software is tested. From unit testing, to system and performance testing, to manual "poke it with a stick testing", almost no software team "doesn't do it" or fails to see the need. Ironically though, many places don't routinely test their IT operations and business processes. This is ironic because if those things are broken or brittle, it generally has a MUCH larger negative impact on a company than buggy software.

To clarify, I've worked with many companies that have a "backups" and "disaster recovery plans", but they never TEST to see if either of this can actually lead to a recovery in the timeframe expected. A well known (for me at least) scenario in the IT field (related to operations) is this:

  1. "Yes we do backups"
  2. Server fails, all data is gone
  3. Build new server (this works)
  4. Restore data that was previously backed up
  5. Realize backups actually were written in a way that isn't recoverable, the backups we thought were being performed have actually never worked, someone "forgot" to enable backups for that particular server...(the list goes on and on...)
  6. Weep
  7. Go out of business

Stretching outside the technical realm, there's another area that confounds me on its lack of testing maturity, which is: "Testing your process". Most places I've encountered are, at best, able to define and presumably follow a process of some sort, but generally unable to understand or define what happens when the process fails. As an example, many places have "human steps" in their process, but never test "what happens if the human forgets, has incorrect assumptions about what that step means, or is just plain lazy and lies about performing a step?". In general, there is too much reliance on an individual sense responsibility being the safeguard that the process will perform adequately.

As a very common example...if we have a software delivery process and a step is "update the API documentation", how many organizations will actually QA the process to understand how to detect and/or ensure that this step is done? More importantly, how many teams will have someone test "making a change without updating the documentation properly" to ensure that this is detected? My general answer is "a vanishingly small number".

Most people (in my experience) when quizzed about issues such as this will throw out statements like "well, we pay our people well and they are 'good' people, so we don't have to worry about it". To me, this is a silly and fragile position to take, many professions have very highly paid people who are extremely reliable (for people) that still have checks, double checks, and controls to ensure that the "things we said we were going to do" actually "got done". While I think the security industry is the dimension of tech that has made the most progress in "defining" these sort of controls, I still see that (even in that industry) most companies don't take the additional step to validate or test that the process itself is adequate and failure is detected at an appropriate manner, level, and time.

No comments: