Incident analysis is its own topic, but as you have probably learned at this point the cut I want to give to this blog is a bit more practical based on my own experience.
If you want to know more about this topic there are many resources but the one that pop up to my mind now when writing this article is for sure “Incident Management for Operations” a small book from O’Reilly that I have here in my bookshelf.
I find myself very often working with early stage company that don’t really have yet an incident management strategy, and often it is not the right time to have one because I think there is a time for everything.
If your team is busy developing and validating the product, there are probably not many incidents that require a management system because you mainly deal with bugs or corner cased that you somehow expected to happen and get fixed.
So what I think you need is not an incident management strategy, you need a good habit, then you can turn it into a more structured strategy later.
What I usually do is to elect a GitHub repository if you are on GitHub or to create a dedicated mailing lists if you are mailing list based, or a category in your internal website/Notion, you need a place! Figure it out where it is! It should have the possibility to add follow-up statements, comments and to be linked to bug tracking or similar tooling.
I can suggest, if you have an infrastructure as code repository, to use it. Every time you see something that looks wrong or that somebody reach out to you out of the blue with a Slack message, create a new issue, and label it with something like “incident analysis”. At the beginning the title is always something like: “Unknown incident” because often you don’t know where this is going to end and then copy paste all the actions, scripts you run, what you look at, take screenshot and share all of this material in there. Those are actions you have to do anyway to figure out what is going, but now you have a place where you can dump everything you find. If you share what you do you are drafting out future documentation, sharing operational experience and much more. The advantages are immeasurable.
Moving forward your colleagues will understand how to troubleshoot on their own, or how effort a random Slack message can generate, so they will be a bit more careful, and you will be able to justify your effort in learning how to troubleshoot the system you are writing.
Are you having trouble figuring out your way to building automation, release and troubleshoot your software? Let's get actionables lessons learned straight to you via email.