Back On The Market!

After a year spent writing my book, working on Aura, speaking at conferences and user groups, advising startups, and proposing new design patterns, I am back on the market.

I’ve been writing PHP code since 1999, and in that time I’ve been everything from a junior developer to a VP of Engineering. If you have a PHP codebase that requires some attention, especially a legacy app that needs to be modernized, I’m your man. I’m also excellent as a leader, mentor, manager, and architect, on small teams and on large ones.

Resume and references available on request. Contact me by email (pmjones88 at gmail) or on Twitter @pmjones if you want to talk!>

UPDATE (Tue 19 Aug): Well that was quick. I’m off the market again, and looking forward to productive efforts with my new employer. My deepest gratitude to everyone who expressed interest; I am truly humbled by the whole experience. Thank you to all.

Software Runs the World: How Scared Should We Be That So Much of It Is So Bad?

The underlying problem here is that most software is not very good. Writing good software is hard. There are thousands of opportunities to make mistakes. More importantly, it’s difficult if not impossible to anticipate all the situations that a software program will be faced with, especially when–as was the case for both UBS and Knight–it is interacting with other software programs that are not under your control. It’s difficult to test software properly if you don’t know all the use cases that it’s going to have to support.

There are solutions to these problems, but they are neither easy nor cheap. You need to start with very good, very motivated developers. You need to have development processes that are oriented toward quality, not some arbitrary measure of output. You need to have a culture where people can review each other’s work often and honestly. You need to have comprehensive testing processes — with a large dose of automation — to make sure that the thousands of pieces of code that make up a complex application are all working properly, all the time, on all the hardware you need to support. You need to have management that understands that it’s better to ship a good product late than to ship a bad product on time. Few software companies do this well, and even fewer of the large companies that write much of their software.

This is why there is so much bad software out there. In most cases we learn to live with it. Remember the blue screen of death? Ever stood at an airline counter waiting interminably for the agent to make what should be a simple switch from one flight to another? Ever been on the phone with a customer service representative who says his computer is slow or not working? That’s what living with bad software looks like.

But in our increasingly complex and interconnected financial system, it’s not clear we can live with it.

Read the whole thing. Via Software Runs the World: How Scared Should We Be That So Much of It Is So Bad? – James Kwak – The Atlantic.

Complex Systems and Normal Accidents

One of my favourite sections of the book was Harford’s discussion of accidents. Most of the problems Harford examines in the book are complex and “loosely coupled”, which allows experimentation with failure. But what if the system is tightly coupled, meaning that failures threaten the survival of the entire system? This concept reminded me of work by Robert May, which undermined the belief that increased network complexity led to stability.

The concept of “normal accidents”, taken from a book of that title by Charles Perrow, is compelling. If a system is complex, things will go wrong. Safety measures that increase complexity can increase the potential for problems. As such, the question changes from “how do we stop accidents” to how do we mitigate their damage when they inevitably occur? This takes us to the concept of decoupling. When applied to the financial system, can financial institutions be decoupled from the broader system so that we can let them fail?

(Emphasis mine.) via Harford’s Adapt: Why Success Always Starts with Failure.

Disaster Rituals

Combine How Complex Systems Fail with Fooled By Randomness and throw in some organizational behavior models, and you get the human response to unforeseen disaster. We think we can prevent future disaster, somehow, by going through a particular set of rituals. Then Malcolm Gladwell asks, way back in 1996:

But what if the assumptions that underlie our disaster rituals aren’t true? What if these public post mortems don’t help us avoid future accidents? Over the past few years, a group of scholars has begun making the unsettling argument that the rituals that follow things like plane crashes or the Three Mile Island crisis are as much exercises in self-deception as they are genuine opportunities for reassurance. For these revisionists, high-technology accidents may not have clear causes at all. They may be inherent in the complexity of the technological systems we have created.

I think there are lessons here for, among other things, the BP oil spill. As with most of Gladwell, it’s worth your time to read the whole thing.

How Complex Systems Fail

The paper How Complex Systems Fail by Richard Cook should be required reading for anyone in programming or operations. Hell, it should be required reading for most everyone. You should read the whole paper (it’s very short at under five pages), but here are the main points:

  1. Complex systems are intrinsically hazardous systems.
  2. Complex systems are heavily and successfully defended against failure.
  3. Catastrophe requires multiple failures – single point failures are not enough.
  4. Complex systems contain changing mixtures of failures latent within them.
  5. Complex systems run in degraded mode.
  6. Catastrophe is always just around the corner.
  7. Post-accident attribution accident to a ‘root cause’ is fundamentally wrong.
  8. Hindsight biases post-accident assessments of human performance.
  9. Human operators have dual roles: as producers & as defenders against failure.
  10. All practitioner actions are gambles.
  11. Actions at the sharp end resolve all ambiguity.
  12. Human practitioners are the adaptable element of complex systems.
  13. Human expertise in complex systems is constantly changing
  14. Change introduces new forms of failure.
  15. Views of ‘cause’ limit the effectiveness of defenses against future events.
  16. Safety is a characteristic of systems and not of their components.
  17. People continuously create safety.
  18. Failure free operations require experience with failure.

Points 2 and 17 are especially interesting to me. It’s not that people can build complex systems that work; instead, it is that people have to actively prevent system failure. Once you stop maintaining the system, it begins to fail. It sounds like thermodynamics; without a constant input of energy from people, the mostly-orderly complex system will descend into increasing disorder and failure.

Write For Persons, Not Groups

Ryan King links to this great essay.

Anyway, I babbled at Nat along these lines for a while, predicting that, while I was sure that anyone he talked to in a corporation would tell him, “free groupware, yes, awesome!”, there was really no reason to even bother releasing something like that as open source, because there was going to be absolutely no buy-in from the “itch-scratching” crowd. With a product like that, there was going to be no teenager in his basement hacking on it just because it was cool, or because it doing so made his life easier. Maybe IBM would throw some bucks at a developer or two to help out with it, because it might be cheaper to pay someone to write software than to just buy it off the shelf. But with a groupware product, nobody would ever work on it unless they were getting paid to, because it’s just fundamentally not interesting to individuals.

And there you have it in a nutshell. No **organization** ever wrote or used a program. **Individuals** write and use programs. If you want people to love your software, it has to appeal to individuals.

As a corollary, if your program sucks for individuals, it will suck for the organization. This goes along with the complex adaptive systems and emergent behavior issues I rant about from time to time.

Good idea for a project at the end of the article, too: server-less calendar sharing. Cool.