Post mortem are short summaries of incidents. They typically describe why the incident happened, an estimated cost, and how to prevent similar incidents in the future.
The best teams write and share their post mortem after every significant incident.
To write a post mortem, add a comment to an incident including post mortem
, like in the example below:
While writing the post mortem, you can use the Markdown language to pretty format the post mortem as shown in the example above.
An example of the post mortem written in Markdown:
## Summary
On April 27th, at 2:00 PM, users reported that they were unable to access the company's website. The application server had crashed due to a segmentation fault caused by a memory leak in a recent code change. The IT team was able to resolve the issue by rolling back the code change and restarting the application server. The system was back online by 2:30 PM. The estimated costs of the incident include lost productivity and revenue, as well as an impact on at least 500 users.
## Why this happened
The root cause of the incident was a recent code change that introduced a memory leak, causing the application server to crash. The memory leak was not caught during the testing phase and was deployed to production without proper validation. As a result, the application server consumed more and more memory until it crashed, causing the system outage.
## Estimated costs
The incident caused a 30-minute downtime, resulting in lost productivity and revenue. It is estimated that the outage affected at least 500 users.
## Action Plan
To prevent similar incidents in the future, the following actions will be taken:
- Implement a more rigorous code review process to catch potential issues before they are deployed to production. This will include additional code reviews and testing to ensure that any changes to the application are thoroughly validated before deployment.
- Implement automated testing to catch memory leaks and other potential issues. This will involve the use of tools like memory profiling and automated testing frameworks to catch potential issues before they are deployed to production.
- Monitor the application server's memory usage and set up alerts to notify IT when it reaches a certain threshold. This will involve setting up automated monitoring tools to track the application's memory usage and alert the IT team when it reaches a certain threshold.
- Consider implementing redundancy or failover systems to prevent downtime in the event of a crash. This will involve exploring options like load balancing and high availability to ensure that the system remains available even in the event of a hardware or software failure.
## Follow Up
The IT team will follow up on the action plan to ensure that it is implemented effectively. The team will also review the incident with stakeholders to discuss any lessons learned and identify additional areas for improvement. Finally, the IT team will continue to monitor the system to ensure that it remains stable and available to users.
Here's a shorter version of the post-mortem case
# Post-Mortem: Web App Server Outage Incident
**Summary:** On January 3, 2023, from 10:00 am to 11:45 am, our web app experienced a server outage, impacting over 1000 users. The root cause was identified as a hardware failure on the primary server.
**Timeline:**
- 10:00 am: Increased error rates and declining server response times triggered an alert.
- 10:15 am: Web app server became unresponsive, causing a complete outage.
- 10:20 am: Incident reported, and investigations initiated.
- 10:30 am: Root cause identified as a hardware failure.
- 10:35 am: Attempts to resolve the issue through server restart unsuccessful.
- 10:45 am: Switched to backup server and redirected traffic to restore service.
- 11:00 am: Service fully restored, users regained access.
- 11:45 am: Post-incident analysis conducted.
**Root Cause:** Hardware failure on the primary server led to unresponsiveness and service outage.
**Resolution:** Switched to backup server and redirected traffic to restore service.
**Impact:** Over 1000 users experienced disruptions, unable to access the web app during the outage.
**Lessons Learned:**
1. Ensure robust redundancy and failover systems for quick server switching during failures.
2. Enhance monitoring and alerting mechanisms to detect hardware failures and abnormal server behavior promptly.
3. Establish clear communication channels and escalation procedures for efficient incident response and user updates.
**Future Actions:**
1. Investigate the hardware failure and implement preventive measures.
2. Enhance monitoring system to detect server issues proactively.
3. Implement a robust failover mechanism for seamless server switching.
4. Review and update incident response procedures for improved communication.
We apologize for the inconvenience caused and appreciate your understanding. Our team remains committed to providing a reliable user experience.
Need help with Markdown language? Check out the Markdown cheat-sheet to create the best post mortem out there.
It takes less than a minutes to setup your first monitoring.