Systemic causes of single points of failure

There are any number of reasons you may find your organization at risk of having only one person with the access or knowledge required for a given task. For example, your security policy may necessitate a Keeper of the Keys situation that prevents you from distributing authority. Or, maybe your staff is small and there simply aren't enough hands available to make sure every task has at least two people who are comfortable with it. Long term employees are sometimes inadvertent single points of failure: tribal knowledge and simple consistency can obscure an individual's responsibilities. Suddenly they're out sick and you realize you don't have anyone else with access to update the release notes.

Silo-ing of information

Sometimes specialized information gets stuck in a single department or section of a product. Maybe only your backend developers understand that a change will break compatibility. Perhaps only your customer relations team knows that a certain customer's contract means a legacy operating system needs to remain supported.

In these cases, interdisciplinary work groups are your best bet at reducing conflict. SpiderOak has weekly squad meetings where cross-disciplinary and cross-departmental squad members meet and discuss ongoing changes, questions, and concerns. We use these meetings to unblock tasks as well as to make sure that the scope and impact of changes is understood at all levels and in all departments. In this way, we hope to avoid situations where personnel find out about changes only after they've happened. As a bonus, cross-disciplinary work groups have been found to have a whole host of benefits related to increased communication and trust as well as driving results.

If you are unable to restructure your organization to allow interdisciplinary work groups, consider holding periodical same-paging meetings to ensure that information is flowing freely through your organization.

Security

Security policies often require that only select individuals act as the keepers of the keys. Losing those individuals, or those individuals losing those keys can be disastrous. Distributed authority is one way to make sure that loss of information in one area does not result in all data being lost, as well as guaranteeing that your software's authority structure mirrors your real-world authority structure. This is why our products make use of distributed ledger (or blockchain) technology.

Another way to lessen the stress of a keeper-of-the-keys scenario without compromising on security would be to build in internal ways of reclaiming keys. This can be done in as simple a way as setting up a password manager for admin keys so that if an employee leaves the company, the password manager account can be passed on to their successor. On a more in-depth level, many products (such as Inclave, on our end) work to give users a secure way to recover a team after access has been lost.

Identifying single points of failure

Let's walk through a couple of hypotheticals and highlight ways you might turn them into user stories to audit your organization for single points of failure.

Problem: Offboarding an employee who has moved on is one of the most common times to discover your organization has a single point of failure.

Maybe an admin account's credentials left along with said admin. Maybe the departing employee is the only person who knew how to run the test to verify cert pinning. Maybe it's something as trivial as that they owned the calendar meeting invite and now you can't make changes or cancelations without spinning up a whole new event.

In a company that onboards and offboards frequently, you may have many strategies to deal with all those little gotchas. In a company that has a very low rate of turnover it's likely another story.

Solution: List your responsibilities, and regularly practice offboarding them.

It can be hard to remember all the buttons and switches you are responsible for when you've done them so long they've become automatic. One way to combat this is to encourage staff to keep a "Things Only I Do" list and review it occasionally to evaluate opportunities to train other team members on the basics. This may seem like a campaign to have employees work themselves into irrelevancy, but in reality it provides a chance for employees to evaluate whether the scope of their responsibilities still accurately fits within their job and shift responsibilities that no longer do.

Even more importantly, it allows staff to relax: if you're a single point of failure, it's hard to take a vacation without running the risk of panicked phone calls while you're trying to catnap on the beach. If your employees can take vacations without worrying about what's going on back at the office, it's a huge benefit for both the employee and the organization, paying huge dividends in happiness, relaxation, productivity, and creativity. And yes, this type of practice offboarding will make real offboarding transitions easier.

Problem: Someone is on vacation, and they're the only one who can resolve an issue.

Your release process is super smooth: get the right folks in the right room and presto: release pushed... until the person who publishes release notes is on vacation and suddenly you realize no one else has the right permissions. Or, you've got a user reporting an issue and the specialist on your team is out sick. Do either of these situations sound familiar? We quickly get used to the way things are. As long as things are running well, we don't tend to question whether they're running optimally.

Solution: Document processes and identify critical personnel.

Documenting processes together with the names of those who own or are capable of handling steps is another great way to ferret out single owner tasks. This is especially true for any repetitive processes you have such as, perhaps, release procedures. (Bonus points if documenting it list-style helps you realize it can be automated.)

This works similarly to listing out tasks you are responsible for personally, but instead of listing a name at the top you would put a process as the title. Then, think about every step — in order — that process requires to be successful. Next, list the name or names of the parties responsible for each individual step. If you find that any steps have only a single name, think hard about training up a second-in-command and making sure they have the right credentials. These lists have multiple benefits because they can be used to document processes for training purposes and to highlight tasks that are be good candidates for automation.

Problem: How can we be sure we've done enough to avoid single points of failure?

Say you've gone through your personal tasks and trained up others where you think you might become a blocker. And you've documented the heck out of all your procedures and processes. You think you've addressed several weak points, but it's left you with a nagging suspicion you've only scratched the surface.

Solution: Mandatory vacations!

This may sound like a joke, but it's actually an excellent way to test your work so far. Only have one release engineer? Plan for her to be out for at least one release. Organize in advance, have her document what she is responsible for and how she does it (we're fans of "While I'm Away" documents at SpiderOak), and muddle through at least one release without her. When she's back, you can evaluate what struggles you encountered and work on a strategy to improve them for next time. Besides, (remember all the benefits of vacations discussed above?) she'll probably come back refreshed and ready with some creative solutions of her own!

https://spideroak.com/whitepapers/