Article

A strategic roadmap for implementing site reliability engineering practices

By Udaykumar Gupta, Vanishree Mahesh

05 Feb, 2025
6 min read

Insights

SRE is essential for maintaining system stability and minimizing outages in complex cloud environments, enabling organizations to balance feature development with operational reliability.
Organizations face difficulties in adopting SRE due to fragmented approaches driven by tool vendors, short-term focus of internal teams, and a lack of understanding of SRE capabilities.
Key SRE practices such as monitoring, observability, well-defined service level objectives (SLOs), and error budgets are crucial for aligning technical metrics with business goals, ultimately enhancing customer experience.
Infosys's SRE horizon map helps organizations assess their current SRE maturity, prioritize the capabilities to develop, and provides a clear roadmap for building reliable and resilient systems.

The demands on product development teams to deliver new features while maintaining system stability, particularly in complex distributed cloud environments, remain a significant challenge. Meanwhile, technology and infrastructure executives are under increasing pressure to minimize outages and security breaches, resolve incidents quickly and proactively, and achieve operational efficiencies.

To meet these challenges and ensure the reliability of deployed systems, SRE is emerging as a transformative practice. SRE combines automation, artificial intelligence (AI), and machine learning (ML) to address the critical goal of keeping systems up and running — or as close to 100% uptime as possible.

The challenges in adopting SRE solutions

SRE practices offer solutions to enhance reliability, including automating workflows, enabling self-healing systems, and proactively preventing incidents before they impact customers. Despite these benefits, many chief technology officers (CTOs) and infrastructure leaders struggle to design a roadmap for integrating SRE practices into their organizations. The complexity arises from multiple perspectives:

A clear set of guidelines and a framework to assess the current capabilities and the intended outcome will help executives integrate SRE practices into their workflows.

Tool vendors often push their own proprietary solutions, which can create confusion and lead to fragmented approaches.
Internal teams focus on narrow, short-term problems, making it hard to align the broader organization around a unified strategy.
Lack of understanding about SRE capabilities and their maturity stages hinders the ability to develop a cohesive strategy that scales across the enterprise.

As organizations attempt to transform their operations through automation and predictive capabilities, they face challenges in selecting the right technologies and defining the correct sequence of steps. Most companies are unsure of how to progress toward a systematic adopting of SRE practices. A clear set of guidelines and a framework to assess the current capabilities and the intended outcome will help executives integrate SRE practices into their workflows.

The challenges in adopting SRE solutions

SRE principles and practices

By understanding and applying core foundational practices, SRE teams can align with business objectives and improve service reliability, ultimately enhancing customer experience:

Monitoring and observability: Track metrics, logs, and events to ensure systems operate within expected parameters and to detect anomalies before they escalate into incidents. This provides insights into resource usage and system health, helping identify and address issues proactively to prevent future occurrences.

Best practices for monitoring in SRE include defining clear metrics and thresholds for performance and reliability, automating monitoring and alerts to ensure immediate response to critical issues, ensuring redundancy in tools and infrastructure, and regularly reviewing and updating monitoring configurations to keep them aligned with system changes and business goals.

Well-defined service line objectives and error budgets: An error budget is the maximum amount of time that a technical system can fail without contractual consequences defined in service level agreements (SLA). It is also a way of measuring how your service level indicator (SLI) has performed against your service level objective (SLO) over a period of time.

By building a mature alerting strategy for SLIs, SLOs, error budgets, and burn rates (the rate at which an error budget is being consumed), you can detect and resolve issues sooner to help avoid missing internal SLOs and your customers’ SLAs. The best practices for SLOs and error budgets include defining realistic SLOs based on user expectations, monitoring service metrics like uptime and latency, and using error budgets to prioritize reliability over new features when needed.

Automate manual tasks to reduce toil: Toil refers to manual, repetitive tasks that are time-consuming and provide little value toward achieving an organization’s objectives. These tasks often drain resources without significantly improving outcomes, making them ideal candidates for automation to enhance efficiency and free up teams for more impactful work.

Reducing toil involves identifying high-toil areas through regular assessments, conducting toil audits to pinpoint repetitive, low-impact tasks, and using monitoring systems to analyze metrics for insights. This helps uncover unnecessary complexity and manual intervention, enabling teams to focus on automation and improve operational efficiency.

Horizon map to evaluate SRE maturity

To implement SRE practices, it’s essential to understand your organization’s current maturity level and the next steps. The horizon map framework provides a way to evaluate SRE maturity, covering automation, predictive capabilities, and incident management sophistication, helping organizations identify their current status and priorities.

The horizon map also helps organizations understand the sequence in which capabilities should be developed. It emphasizes the importance of transitioning from basic monitoring to more advanced, AI-driven automation, predictive analysis, and self-healing systems. This structured progression allows organizations to strategically adopt new technologies and build an effective SRE capability.

Horizon 1

SRE maturity at Horizon 1 (H1) starts with foundational monitoring and observability, which includes infrastructure monitoring of back-end systems, application performance monitoring for both user interactions and synthetic behaviors, and log monitoring to identify specific events or patterns leading to issues. It also involves incident management practices like manual log correlation to analyze data and detect patterns across sources.

At this level, key criteria include defining SLIs/SLOs and establishing a baseline of automation, which covers automating repetitive tasks, such as using automated runbooks with predefined steps and scripts, compliance audits, resource scaling, and alerting systems for detecting and responding to incidents.

With these core practices in place, organizations can advance to Horizon 2 (H2), where further automation and sophisticated observability strategies are implemented.

Horizon 2

In H2, SRE capabilities evolve from basic monitoring to full observability with logs, traces, and metrics. While monitoring alerts you to violations of predefined conditions, observability enables deeper investigation into anomalies using traces and logs to identify root causes.

H2 emphasizes full-stack observability, covering applications, databases, servers, and networks, along with customer journey observability and real-user experience monitoring. Alerting advances to include suppression, with noise reduction, de-duplication, and correlation to manage overwhelming alerts and improve troubleshooting.

These experiments simulate failures and stress scenarios, assessing system resilience under unpredictable conditions and revealing potential weaknesses.

A US-based media and entertainment provider leveraged Datadog for tracing and correlation, allowing developers and SRE teams to access complete problem context in a single view, reducing their mean time to recovery by over 30%.

The capability to correlate logs to events or traces enhances observability by linking detailed logs with higher-level events, providing a full view of system behavior. This enables teams to resolve issues faster by understanding the context behind each log entry. A US-based media and entertainment provider leveraged Datadog for tracing and correlation, allowing developers and SRE teams to access complete problem context in a single view, reducing their mean time to recovery by over 30%.

AI and ML are introduced at a basic level for anomaly detection, triaging, and self-healing, paving the way for Horizon 3 (H3). Additionally, some processes such as managing knowledgebase articles and compliance are treated as code, to accelerate workflows, reduce errors, and implement software best practices such as version control, automated testing, and deployment. Making knowledgebases accessible to AI can help deploy generative AI capabilities for contextually relevant searches.

Horizon 3

H3 represents the highest level of SRE maturity, focusing on driving digital reliability through extensive automation powered by predictive AI/ML models, generative AI, and agentic systems. For example, in the event of a sudden traffic spike, agentic AI can autonomously scale infrastructure, adjust resources, and implement cost-saving measures such as shutting down idle instances or shifting workloads to more cost-effective environments, all without human intervention. Additionally, generative AI is leveraged extensively for tasks like issue summarization, triage, root cause analysis, and workload prediction.

In this phase, more processes are managed systematically through code, including observability-as-code, policy-as-code (PaC), and service mapping-as-code. An example of this is a large global card and payment financial services corporation that automated over 30 regulatory control and risk assessment processes using a PaC approach. This automation meant it could deliver more than 10 software releases per day, with a release vehicle that maintained a change failure rate of less than 1%.

AI-driven alert suppression and error budget-based release gating to control when and how software releases are deployed based on predefined criteria such as error budgets or performance thresholds further enhances efficiency. It also ensures better management of resources, streamlined operations, and proactive incident resolution. A multinational investment bank and financial services company defined and implemented error-budget-based release gating for their tier 1 applications (critical, high-priority applications essential to business operations). This strategy led to a significant improvement in key SLO adherence, raising it from 95% to 99%.

An example of this is a large global card and payment financial services corporation that automated over 30 regulatory control and risk assessment processes using a PaC approach.

Knowledgebase-as-code, introduced in H2, is augmented with generative AI to search beyond simple keywords for a more contextual retrieval and also suggest or trigger automated actions or modifications based on the search query. Also, agentic AI systems are deployed for end-to-end alert, incident and IT operations workflow automation.

An evolution in H3 is the orchestration of chaos experiments in production. These experiments are now systematically implemented in live environments, allowing SRE teams to simulate real-world failure scenarios under production conditions. One multinational chain of coffeehouses and roasteries adopted chaos experiments in production on its store systems platform running on a Kubernetes cluster architecture. This proactive approach helped improve platform resiliency and resulted in an increase in availability — from 99.5% to 99.95%.

Another example of implementing H3 capabilities can be seen with a financial company that invested heavily in observability and alert management. Its target state includes automating most incident management processes and ensuring that critical incidents are resolved within two hours. By using tools like anomaly detection and full-stack observability, they aim to improve operational efficiency and achieve their target state, where 99% of incidents are detected and resolved proactively.

SRE platform reference architecture

Technology leaders should adopt a platform approach to bring together H1, H2, and H3 capabilities to drive consistent adoption of SRE practices and patterns. Infosys is supporting clients to boost their SRE capabilities by helping them put together a robust platform ((Figure 1) built on a modern software toolchain and industry best practices, such as the solutions offered by Infosys Cobalt and Topaz.

Figure 1. SRE platform reference architecture

Source: Infosys

SRE implementation guidelines

In today’s rapidly evolving technology landscape, adopting modern SRE practices is not just a competitive advantage — it is essential for staying ahead. A well-defined roadmap, based on a tool-agnostic view of SRE capabilities, will ensure that infrastructure leaders can navigate this transformation successfully.

CTOs and infrastructure leaders can leverage the horizon map to develop a pragmatic roadmap for adopting SRE practices based on their organization's needs and current state of maturity. The roadmap should focus on key problem areas and outline a clear sequence of actions to move toward an optimized, automated infrastructure.

To develop a successful SRE adoption strategy, leaders should consider the following steps:

Evaluate current capabilities: Use the horizon map to assess where the organization stands in terms of SRE maturity. Are they still in the early stages, relying heavily on manual intervention, or have they implemented some level of automation?
Identify key problem areas: Pinpoint the most critical challenges that need to be addressed. These could include long incident resolution times, inefficient operations, or a lack of predictive capabilities. Understanding these pain points will help prioritize efforts and investments.
Target state vision: Define the ideal state for the organization. For example, in a target state, 99% of incidents should be detected and resolved by the system before they even reach customers. Incident resolution times should be reduced to less than two hours for critical issues, and Level 1 support should be largely automated or eliminated altogether.
Actionable roadmap: Build a step-by-step plan to move from the current state to the target state. This should include both short-term actions such as improving observability or automating incident alerts, and long-term goals such as standardizing the technology stack to reduce variation and technical debt, implementing AI-driven anomaly detection and issue prediction, as well as self-healing systems.
Leverage modern tools in a platform approach: Invest in advanced tools for full-stack observability, anomaly detection, real-user monitoring and generative AI and agentic capabilities. These tools will help identify and resolve issues before customers are even aware of them, improving both operational efficiency and customer satisfaction. Develop platforms to drive consistent adoption of SRE practices and patterns.
Transform the culture: Ensure there are no silos in operations — cross-team synergy is one of the key expectations. Define objectives, key results, and key performance indicators to drive clear accountability.
Benchmark and iterate: Continuously compare performance against industry benchmarks and competitors. As customer expectations evolve, so too should the organization’s SRE capabilities. For example, response times that were once acceptable might no longer meet customer standards, requiring further refinement and investment.