Open source

Exploring Site Reliability Engineering Testing with Open Telemetry

This whitepaper explores the modern approaches such as Site Reliability Engineering (SRE) to manage application performance and stability while responding to changes in business need.

Insights

  • In constantly evolving technological landscape, striking the right balance between control, flexibility, and risk in managing their IT environments have become the prime focus.
  • This white paper examines the Site Reliability Engineering (SRE) as one of the superior quality engineering (QE) practice and provide reader with information on how that helps mitigate operational risks effectively.
  • The white paper explains how observability “data” plays a crucial role in SRE testing, how Open Telemetry is next important thing in observability space and how SRE testing can leverage it in its journey to improve the quality of testing deliverables.
  • The white paper also includes the reference use cases utilizing OpenTelemetry in SRE.

Introduction

During the global pandemic era, more enterprises started moving their assets to cloud as this helped them become more cost-efficient and agile in the way they operate their businesses. These Cloud migrations are forcing enterprises to reevaluate their site reliability practices and investments. Reliability and resilience are more difficult to maintain in cloud, and many enterprises have realized this as their lift-and-shift cloud migration tactics have hampered their ability to sustain their availability and performance objectives.

Observability tools were born out of sheer necessity when traditional tools and debugging methods could not identify what software did in production. Observability refers to the ability to gain insights into the internal state of a system. It provides valuable information on when and why errors occur within a system. For SRE testing, this actionable data is crucial to ensure the security and reliability of applications.

Effective observability can help to identify some of these issues:

  • Low performing CPUs OR Physical memory constraints.
  • Inefficient code, memory leaks, or deadlock.
  • Insufficient Database performance due to over/underuse of indexes.
  • Sudden spikes in concurrent users leading to long load time.
  • Third-party services used by the application impacting performance.

SRE testing challenges with modern observability

In recent times, SRE testing is facing certain challenges with their observability set ups which is impacting their efficiency to manage prod as well as non-production issues.

  • Lack of unified solution: Telemetry data plays a crucial role in helping SRE teams understand the behavior and performance of these systems. To have a comprehensive understanding of their services and applications, they need to instrument all their frameworks and libraries across different programming languages. However, there is no single instrument or tool offered by commercial vendors that can collect data from all applications within an organization. This lack of a unified solution leads to data silos and other ambiguities, making troubleshooting and resolving performance issues more difficult. Thereby leading to slippage of data defects in real-time environments.
  • Tons of telemetry data: Data collection plays a pivotal role in cloud-native applications, but many organizations are not adequately prepared for it.
  • Instrumentation is a toil: Instrumentation is an ongoing requirement involving significant effort. The process of instrumenting various tools is considered toil, with multiple tools each having their own specific instrumentation process. Adopting a new tool necessitates repeating this work each time, including setting up the agent. Moreover, existing instrumentation is not compatible across different vendors and products, leading to a lack of context as not all services may be instrumented with every tool available.

Gartner expects that by 2026, 70% of organizations successfully applying observability will achieve shorter latency for decision making, enabling competitive advantage for target business or IT processes.

Enterprises can mitigate these challenges by successfully adopting OpenTelemetry as their observability solution. OpenTelemetry’s Mission is to enable effective observability by making high-quality, portable telemetry ubiquitous. OpenTelemetry does three important things:

  • It allows enterprises to own the data that they generate.
  • Organizations do not have to be stuck with any specific proprietary data format or tool.
  • It allows engineers to learn a single set of APIs and conventions.

These three things combined enables teams and organizations the flexibility they need in today’s modern computing world.

What is OpenTelemetry?

OpenTelemetry emerges as the ultimate solution to the current challenges with observability, revolutionizing the management of telemetry data and empowering organizations to enhance their observability practices. It is the key to getting a handle on enterprise telemetry and fueling the comprehensive visibility SRE testing needs to improve observability practices. It provides tools to collect data from across technology stack, without getting bogged down in tool-specific deliberations.

It is an open-source project that aims to provide a unified standard for collecting, processing, and exporting telemetry data from distributed systems. This solution is designed to be vendor-agnostic and flexible, allowing users to choose the tools and platforms that best meet their needs.

Figure 1: Observability with OpenTelemetry

Figure 1: Observability with OpenTelemetry

For setting up observability with OpenTelemetry, SRE testing needs to instrument the application code (Java, .Net, Python etc.) with OpenTelemetry client libraries that generate telemetry data like logs, events, metrics, and traces (MELT). Once the telemetry data is generated, it can be viewed directly from enterprise observability/monitoring set up and can be sent to an OpenTelemetry collector.

The collector consists of three main components: receivers, processors, and exporters to create a standardized and uniform data pipeline from various sources/applications.

  • Receivers are used to get data into the collector.
  • Processors are used to do any processing required on the collected data, like data massaging, data manipulation, or any change in the data as it flows through the collector.
  • Exporters are used to export data to an observability backend.

This data collected in observability backend can be observed and processed to understand the overall system behavior.

Features of OpenTelemetry

  • Unified observability solution
    Open Telemetry offers standardized format for telemetry data collection allowing organizations to consolidate their observability data and provide unified solution across different application technology stacks and different underlying platforms with diverse monitoring tools.
  • Monitoring tools are decoupled from data collection
    OpenTelemetry offers range of tools like OTel collector which enables the aggregation, enrichment, filtering, and processing of telemetry data. This helps organizations to decouple their monitoring solutions from actual observability data collection.
  • Observability solutions will become vendor neutral
    OpenTelemetry assists organizations to seamlessly integrate with multiple monitoring and observability platforms. This allows organizations the freedom to choose and switch between different observability solutions, without locking into a specific vendor or proprietary solution.
  • Flexibility and control in data collection
    OpenTelemetry gives control over what telemetry data can be sent to data collection platforms. This helps Organizations ensure that they capture only the information they need. This helps reducing unnecessary noise and excess costs associated with it. The flexibility allows organizations to adapt their observability set ups to their specific needs and requirements, optimizing data flow, and reducing resource impact on instrumented applications.

OpenTelemetry Adoption

For enterprises, which are considering implementing OpenTelemetry in their environments, few tips to follow for better results:

  • Broaden project goals beyond mere instrumentation and telemetry generation. Prioritize observability and focus on the capabilities you aim to unlock using telemetry data. For instance, consider implementing observability-driven development practices to detect performance issues early in the software development lifecycle.
  • Engage with all key stakeholders to develop project objectives and document them in a comprehensive project charter. Keep the scope limited and break down long-term vision into smaller, manageable projects.
  • Create a solid list of test use cases, particularly if transitioning from an existing instrumentation solution.
  • Avoid creating data silos by ensuring that metrics, traces, and logs are sent to separate tools.
  • Familiarize with the current limitations of OpenTelemetry and leverage mixed instrumentation capabilities by integrating with platforms such as New Relic.
  • Recognize that OpenTelemetry extends beyond being just a specification. It also offers official implementations for various popular programming languages, including .NET, C++, Erlang/Elixir, Go, Java, JavaScript, PHP, Python, Ruby, Rust, and Swift. It is worth noting that these implementations continue to evolve, so consult the compliance matrix to determine the progress of your preferred language.
  • Explore both manual and automatic forms of instrumentation provided by OpenTelemetry, which utilizes the API (Application Programming Interface) to generate data.

By following these guidelines, the foundation for a well-structured plan can be laid that ensures a successful adoption of OpenTelemetry in the enterprise environment.

Other Use cases

By effectively utilizing OpenTelemetry in SRE testing operations, we could revolutionize observability in certain industries. Few of the use cases are discussed below:

Use cases

Conclusion

Site Reliability Engineering Testing and OpenTelemetry are powerful tools for building reliable and performant systems. By using the principles of SRE testing and the capabilities of OpenTelemetry, engineers can create systems that are resilient, scalable, and efficient. Implementing SRE testing and OpenTelemetry requires a significant investment in time and resources, a cultural shift towards prioritizing reliability over new features and a willingness to adopt new tools and processes. But for organizations that are willing to make this investment, the benefits can be substantial.

The significance of OpenTelemetry lies in its ability to standardize the collection and transmission of telemetry data to backend platforms. By providing a unified format of instrumentation across all services, it bridges visibility gaps. Engineers are relieved of the burden of re-instrumenting code or installing different proprietary agents whenever a backend platform is changed. Unlike commercial solutions, OpenTelemetry remains effective even as modern technologies emerge, eliminating the need for vendors to constantly build new integrations to ensure interoperability with their products. It helps facilitate the healthy performance of enterprise applications and vastly improves business outcomes.

Authors

Tak Amit Ashok

Delivery Manager

Madhav Chaudhari

Project Manager

Vijay Narayan Rao

Group Project Manager

Ravikumar Sadashiv Chamorshikar

Project Manager