Updated on March 7, 2023
Mean time to recovery, also known as mean time to restore, refers to the average duration it takes to recover from a product or system failure. It is a crucial metric in incident management as it indicates how quickly downtime incidents are resolved and systems are brought back to operational status.
While MTTR commonly stands for mean time to recovery, it can also represent other metrics within the incident management process. To avoid misunderstandings, it is recommended to either use the full names or clearly specify which metric is being referred to. The other three meanings of MTTR are:
MTTR is calculated by summing up the time taken to recover from all incidents and dividing it by the total number of incidents.
For example, if a system experienced two separate incidents with downtimes of 20 minutes each during a week, the MTTR for that week would be 10 minutes.
While MTTR is a commonly used metric in incident management, it has limitations. It provides a high-level overview of the entire incident management process but does not offer insights into the specific areas that consume the most time. Without more detailed data, it is challenging to identify areas for improvement.
To overcome this limitation, it is necessary to use additional metrics that focus on specific parts of the process.
Mean time to respond measures the average time it takes to respond to a product or service failure from the moment the first alert is received. The difference between mean time to recovery and mean time to respond provides the time taken for an alert to be received.
To calculate mean time to respond, add up the time taken to respond to all incidents and divide it by the total number of incidents.
Mean time to repair measures the average time it takes to repair a system. Unlike mean time to respond, it starts counting from the beginning of the incident repair process.
To calculate mean time to repair, add up the time taken to repair all incidents and divide it by the total number of incidents.
Mean time to resolve measures the average time it takes to resolve a product or service failure. It represents the point when the cause of an incident is identified and fixed, preventing similar incidents in the future.
To calculate mean time to resolve, add up the time taken to resolve all incidents and divide it by the total number of incidents.
Mean time to acknowledge measures the average time it takes for the responsible team to acknowledge an incident from the moment the alert is triggered. It reflects the team's responsiveness and the effectiveness of the alerting system.
By using these metrics in combination, a more comprehensive understanding of the incident management process can be gained, enabling targeted improvements and optimizations.
We notify you when your website experiences downtime
Stay informed with a comprehensive infrastructure monitoring platform
Check Uptime, Ping, Ports, SSL, and more.
Receive incident alerts via Slack, SMS, and phone.
Easily schedule on-call duties.
Create a free status page on your own domain.