One of our key metrics to measure our performance is the rating we get from support requests. We review our performance on a daily basis during a team stand-up - if we detect increased response times or lower satisfaction, we can move quickly to fix bottlenecks.
A growing trend over recent years has been to introduce a tiered support structure. Typically the first-response support center handler will run down a script or checklist to troubleshoot common issues before advancing the issue upwards in a pyramid-type organisation.
This makes sense when the product or service is contained - if your washing machine won't start there is a limited number of variables it could be - is the power properly connected? is the fuse blown? did you first choose your washing programme first? for the most part, the software installed at the factory will never change.
Running connected cloud services is of course, something very different - there are usually many variables at play. Everything from a fibre cut at a data centre, a new bug in introduced in a middleware patch somewhere in the network stack, to a misbehaving OS update that comes with side-effects. A good example is the recent Spectre security issue which is rooted in the hardware itself; a design flaw in the CPU bubbles up all the way up to affect everything.
It might not always seem like it but in the big picture, outages at data centers are rare and the underlying technology is getting increasingly more robust through less points of failure, more distribution and simply learning-from-failing. Do you remember the last time you couldn't access Spotify or watch a movie on Netflix?
To this end, our first responders are actually hands-on software developers who are passionate about Django and Python - in other words, putting experts up-front.
The challenge is of course in volume - the tiers of support are usually designed to funnel access to the experts. How do you juggle demand for support against fast response times and actually solving the issue?
Our somewhat boring but honest answer is: documentation.
Investing time into documentation and most important, making sure Google can index it effectively means we reduce common queries and tend to only need to answer never-seen-before issues. Users are in a hurry and want the quickest route to a solution - even with the best designed sitemap and structure, Google will probably find the answer quicker.
As an example, a simple Google query "cache in divio" presents the following relevant articles as top results:
If you need to traverse the documentation itself, then it's 3 clicks down - still easy to find but Google gets you there faster.
A common user experience miss is to treat documentation as part of an overall website and neglect the core function: “give me the answer now”. The last thing the user wants to see is a popup asking for a website review on first-contact. Add a popup for the promotion of a new even-greater product when they have a problem with the existing one and the user will be screaming.
To this end, we keep our documentation separate. We have found that the Readthedocs platform works best for us - we can tailor the look-and-feel for a clean minimalistic style and use "ReStructuredText" to automatically keep our API documentation in sync.
It's something we think alot about and even talk about at conferences - one of our most popular posts discusses it in-depth.
If the problem is entirely new and the documentation has yet to cover it, then there is an investigation to be made and often high-pressure work that needs doing quickly.
This means bringing team members together at short notice to jump on an issue - a developer might need specialist expertise from a cloud operations expert to assist in debugging something in the network stack. We address people availability by distributing the team and competencies across time-zones. This comes with a whole set of other challenges But, we believe with all the modern communication tools available, it works well for us. Most importantly, it means we have experts up-front at all times.
The user who needs support has one point of contact who owns the issue to resolution and brings in colleagues as necessary. In other words, this person is becomes a manager of the issue and owns it all the way to resolution. Once the issue is resolved, the notes and experiences are shared and, where relevant, the documentation team make updates accordingly.
In order to maintain a consistent quality level and the same tone of voice, we have the same core team working on documentation who can translate developer discussions into something that makes sense for our users. It is all too easy to wait with documentation updates But from experience, this quickly leads to "documentation debt" so we reflect changes in documentation as soon as possible after an issue is resolved as a matter of protocol.
In short, there is no quick fix : when a user has an issue, they want it fixed as soon as possible and they will typically turn to Google first. The documentation site and how users (and Google) can traverse it is a crucial part of the overall experience.
By investing in and being passionate about documentation you have more bandwidth to have experts up-front and resolve issues more effectively without running through multiple layers of support. This directly correlates to a better user experience and higher support ratings accordingly.