This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

O

Observability

What is Observability?

In IT and cloud computing, observability is the ability to measure a system’s current state based on the data it generates, such as logs, metrics, and traces.

Observability relies on telemetry derived from instrumentation that comes from the endpoints and services in your multi-cloud computing environments. In these modern environments, every hardware, software, and cloud infrastructure component and every container, open-source tool, and microservice generates records of every activity. The goal of observability is to understand what’s happening across all these environments and among the technologies, so you can detect and resolve issues to keep your systems efficient and reliable and your customers happy. Organizations usually implement observability using a combination of instrumentation methods including open-source instrumentation tools, such as OpenTelemetry.

Many organizations also adopt an observability solution to help them detect and analyze the significance of events to their operations, software development life cycles, application security, and end-user experiences. Observability has become more critical in recent years, as cloud-native environments have gotten more complex and the potential root causes for a failure or anomaly have become more difficult to pinpoint. As teams begin collecting and working with observability data, they are also realizing its benefits to the business, not just IT.

Because cloud services rely on a uniquely distributed and dynamic architecture, observability may also sometimes refer to the specific software tools and practices businesses use to interpret cloud performance data. Although some people may think of observability as a buzzword for sophisticated application performance monitoring (APM), there are a few key distinctions to keep in mind when comparing observability and monitoring.

What is the difference between monitoring and observability?

Is observability really monitoring by another name? In short, no. While observability and monitoring are related — and can complement one another — they are actually different concepts. In a monitoring scenario, you typically preconfigure dashboards that are meant to alert you to performance issues you expect to see later. However, these dashboards rely on the key assumption that you’re able to predict what kinds of problems you’ll encounter before they occur. Cloud-native environments don’t lend themselves well to this type of monitoring because they are dynamic and complex, which means you have no way of knowing in advance what kinds of problems might arise.

In an observability scenario, where an environment has been fully instrumented to provide complete observability data, you can flexibly explore what’s going on and quickly figure out the root cause of issues you may not have been able to anticipate.

Benefits of observability

Observability delivers powerful benefits to IT teams, organizations, and end-users alike. Here are some of the use cases observability facilitates:

  1. Application performance monitoring: Full end-to-end observability enables organizations to get to the bottom of application performance issues much faster, including issues that arise from cloud-native and microservices environments. An advanced observability solution can also be used to automate more processes, increasing efficiency and innovation among Ops and Apps teams.
  2. DevSecOps and Site Reliability Engineering: Observability is not just the result of implementing advanced tools, but a foundational property of an application and its supporting infrastructure. The architects and developers who create the software must design it to be observed. Then DevSecOps and SRE teams can leverage and interpret the observable data during the software delivery life cycle to build better, more secure, more resilient applications. For the uninitiated, DevSecOps stands for development, security, and operations. It’s an approach to culture, automation, and platform design that integrates security as a shared responsibility throughout the entire IT lifecycle. Site reliability engineering (SRE) is a software engineering approach to IT operations.
  3. Infrastructure, cloud, and Kubernetes monitoring: Infrastructure and operations (I&O) teams can leverage the enhanced context an observability solution offers to improve application uptime and performance, cut down the time required to pinpoint and resolve issues, detect cloud latency issues and optimize cloud resource utilization, and improve administration of their Kubernetes environments and modern cloud architectures.
  4. End-user experience: A good user experience can enhance a company’s reputation and increase revenue, delivering an enviable edge over the competition. By spotting and resolving issues well before the end-user notices and making an improvement before it’s even requested, an organization can boost customer satisfaction and retention. It’s also possible to optimize the user experience through real-time playback, gaining a window directly into the end-user’s experience exactly as they see it, so everyone can quickly agree on where to make improvements.
  5. Business analytics: Organizations can combine business context with full stack application analytics and performance to understand real-time business impact, improve conversion optimization, ensure that software releases meet expected business goals, and confirm that the organization is adhering to internal and external SLAs.

DevSecOps teams can tap observability to get more insights into the apps they develop, and automate testing and CI/CD processes so they can release better quality code faster. This means organizations waste less time on war rooms and finger-pointing. Not only is this a benefit from a productivity standpoint, but it also strengthens the positive working relationships that are essential for effective collaboration.

What are the challenges of observability?

Observability has always been a challenge, but cloud complexity and the rapid pace of change has made it an urgent issue for organizations to address. Cloud environments generate a far greater volume of telemetry data, particularly when microservices and containerized applications are involved. They also create a far greater variety of telemetry data than teams have ever had to interpret in the past. Lastly, the velocity with which all this data arrives makes it that much harder to keep up with the flow of information, let alone accurately interpret it in time to troubleshoot a performance issue.

Organizations also frequently run into the following challenges with observability:

  • Data silos: — Multiple agents, disparate data sources, and siloed monitoring tools make it hard to understand interdependencies across applications, multiple clouds, and digital channels, such as web, mobile, and IoT.

  • Volume, velocity, variety, and complexity: — It’s nearly impossible to get answers from the sheer amount of raw data collected from every component in ever-changing modern cloud environments, such as AWS, Azure, and Google Cloud Platform (GCP). This is also true for Kubernetes and containers that can spin up and down in seconds.

  • Manual instrumentation and configuration: — When IT resources are forced to manually instrument and change code for every new type of component or agent, they spend most of their time trying to set up observability rather than innovating based on insights from observability data.

  • Lack of pre-production: — Even with load testing in pre-production, developers still don’t have a way to observe or understand how real users will impact applications and infrastructure before they push code into production.

  • Wasting time troubleshooting: — Application, operations, infrastructure, development, and digital experience teams are pulled in to troubleshoot and try to identify the root cause of problems, wasting valuable time guessing and trying to make sense of telemetry and come up with answers.

Also, not all types of telemetry data are equally useful for determining the root cause of a problem or understanding its impact on the user experience. As a result, teams are still left with the time-consuming task of digging for answers across multiple solutions and painstakingly interpreting the telemetry data, when they could be applying their expertise toward fixing the problem right away. However, with a single source of truth, teams can get answers and troubleshoot issues much faster.

1 - OpenMetrics

The de-facto standard for transmitting cloud-native metrics at scale.

Creating OpenMetrics within CNCF was a given.

- Richard “RichiH” Hartmann, director of community at Grafana Labs and OpenMetrics founder.

What is OpenMetrics?

It specifies the de-facto standard for transmitting cloud-native metrics at scale, with support for both text representation and Protocol Buffers. OpenMetrics is a Cloud Native Computing Foundation sandbox project. OpenMetrics creates an open standard for transmitting cloud-native metrics at scale. It acts as an open standard for Prometheus and is the officially supported exposition format for the project and compatible solutions.

Metrics are a specific kind of telemetry data, and when combined with logs and traces, provide a comprehensive view of the performance of cloud native applications.

OpenMetrics was spun out of Prometheus to provide a specification and de-facto standard format for metrics.

It is used or supported by most CNCF projects and many wider cloud native ecosystem projects. Furthermore, any changes are considered closely with Cortex, Prometheus, Kubernetes, and Thanos.

OpenMetrics is used in production by many large enterprises, including GitLab, DoorDash, Grafana Labs, Chronosphere, Everquote, and SoundCloud. 

OpenMetrics stems from the stats formats used inside of Prometheus and Google’s Monarch time-series infrastructure, which underpins both Stackdriver and internal monitoring applications.

Learn

Learn more about OpenMetrics from the official documentation

Learn more about Prometheus from the official documentation

2 - OpenTelemetry

An open-source standard for logs, metrics, and traces.

What is OpenTelemetry?

OpenTelemetry (also referred to as OTel) is an open-source observability framework made up of a collection of tools, APIs, and SDKs. Otel enables IT teams to instrument, generate, collect, and export telemetry data for analysis and to understand software performance and behavior.

Having a common format for how observability data is collected and sent is where OpenTelemetry comes into play. As a Cloud Native Computing Foundation (CNCF) incubating project, OTel aims to provide unified sets of vendor-agnostic libraries and APIs — mainly for collecting data and transferring it somewhere. Since the project’s start, many vendors have come on board to help make rich data collection easier and more consumable.

What is telemetry data?

Capturing data is critical to understanding how your applications and infrastructure are performing at any given time. This information is gathered from remote, often inaccessible points within your ecosystem and processed by some sort of tool or equipment. Monitoring begins here. The data is incredibly plentiful and difficult to store over long periods due to capacity limitations — a reason why private and public cloud storage services have been a boon to DevOps teams.

Logs, metrics, and traces make up the bulk of all telemetry data.

  • Logs are important because you’ll naturally want an event-based record of any notable anomalies across the system. Structured, unstructured, or in plain text, these readable files can tell you the results of any transaction involving an endpoint within your multicloud environment. However, not all logs are inherently reviewable — a problem that’s given rise to external log analysis tools.

  • Metrics are numerical data points represented as counts or measures that are often calculated or aggregated over a period of time. Metrics originate from several sources including infrastructure, hosts, and third-party sources. While logs aren’t always accessible, most metrics tend to be reachable via query. Timestamps, values, and even event names can preemptively uncover a growing problem that needs remediation.

  • Traces are the act of following a process (for example, an API request or other system activity) from start to finish, showing how services connect. Keeping a watch over this pathway is critical to understanding how your ecosystem works, if it’s working effectively, and if any troubleshooting is necessary. Span data is a hallmark of tracing — which includes information such as unique identifiers, operation names, timestamps, logs, events, and indexes.

How does OpenTelemetry work?

OTel is a specialized protocol for collecting telemetry data and exporting it to a target system. Since the CNCF project itself is open source, the end goal is making data collection more system-agnostic than it currently is. But how is that data generated?

The data life cycle has multiple steps from start to finish. Here are the steps the solution takes, and the data it generates along the way:

  • Instruments your code with APIs, telling system components what metrics to gather and how to gather them
  • Pools the data using SDKs, and transports it for processing and exporting
  • Breaks down the data, samples it, filters it to reduce noise or errors, and enriches it using multi-source contextualization
  • Converts and exports the data
  • Conducts more filtering in time-based batches, then moves the data onward to a predetermined backend.

OpenTelemetry components

OTel consists of a few different components as depicted in the following figure. Let’s take a high-level look at each one from left to right:

OpenTelemetry Components

OpenTelemetry Components


APIs

These are core components and language-specific (such as Java, Python, .Net, and so on). APIs provide the basic “plumbing” for your application.

SDK

This is also a language-specific component and is the middleman that provides the bridge between the APIs and the exporter. The SDK allows for additional configuration, such as request filtering and transaction sampling.

In-process exporter

This allows you to configure which backend(s) you want it sent to. The exporter decouples the instrumentation from the backend configuration. This makes it easy to switch backends without the pain of re-instrumenting your code.

Collector

The collector receives, processes, and exports telemetry data. While not technically required, it is an extremely useful component to the OpenTelemetry architecture because it allows greater flexibility for receiving and sending the application telemetry to the backend(s). The collector has two deployment models:

  1. An agent that resides on the same host as the application (for example, binary, DaemonSet, sidecar, and so on)
  2. A standalone process completely separate from the application Since the collector is just a specification for collecting and sending telemetry, it still requires a backend to receive and store the data.

Benefits of OpenTelemetry

OTel provides a de facto standard for adding observable instrumentation to cloud-native applications. This means companies don’t need to spend valuable time developing a mechanism for collecting critical application data and can spend more time delivering new features instead. It’s akin to how Kubernetes became the standard for container orchestration. This broad adoption has made it easier for organizations to implement container deployments since they don’t need to build their own enterprise-grade orchestration platform. Using Kubernetes as the analog for what it can become, it’s easy to see the benefits it can provide to the entire industry.

Learn

Learn more about OpenTelemetry from the official documentation

3 - OpenTracing

An initiative to enable reusable, open source, vendor neutral instrumentation for distributed tracing.

Ideas about distributed tracing and monitoring across multiple systems have certainly generated quite a buzz. It’s becoming more important than ever before to be able to see what’s going on inside our requests as they span across multiple software services. Aiming to harness this importance, the OpenTracing initiative has sprung up to help developers avoid vendor lock-in.

What Is Distributed Tracing?

Distributed tracing is a mechanism you can use to profile and monitor applications. Unlike regular tracing, distributed tracing is more suited to applications built using a microservice architecture, hence the name.

Distributed tracing tracks a single request through all of its journey, from its source to its destination, unlike traditional forms of tracing which just follow a request through a single application domain. In other words, we can say that distributed tracing is the stitching of multiple requests across multiple systems. The stitching is often done by one or more correlation IDs, and the tracing is often a set of recorded, structured log events across all the systems, stored in a central place.

What is OpenTracing?

It’s a vendor-agnostic API to help developers easily instrument tracing into their code base. It’s open because no one company owns it. 

OpenTracing wants to form a common language around what a trace is and how to instrument them in our applications. In OpenTracing, a trace is a directed acyclic graph of Spans with References that may look like this


This allows us to model how our application calls out to other applications, internal functions, asynchronous jobs, etc. All of these can be modeled as Spans, as we’ll see below.

For example, if I have a consumer website where a customer places orders, I make a call to my payment system and my inventory system before asynchronously acknowledging the order. I can trace the entire order process through every system with an OpenTracing library and can render it like this:


Each one of these bracketed blocks is a Span representing a separate software system communicating over messaging or HTTP.

Terminology

Let’s talk a bit about the components of the OpenTracing API.

  • Tracer
    This tracer is the entry point into the tracing API. It gives us the ability to create Spans. It also lets us extract tracing information from external sources and inject information to external destinations.

  • Span
    This represents a unit of work in the Trace. For example, a web request that initiates a new Trace is called the root Span. If it calls out to another web service, that HTTP request would be wrapped within a new child Span. Spans carry around a set of tags of information pertinent to the request being carried out. You can also log events within the context of a Span. They can support more complex workflows than web requests, such as asynchronous messaging. They have timestamps attached to them so we can easily construct a timeline of events for the Trace.

  • SpanContext
    The SpanContext is the serializable form of a Span. It lets Span information transfer easily across the wire to other systems.

  • References
    So far, Spans can connect to each other via two types of relationship: ChildOf and FollowsFrom. ChildOf Spans are spans like in our previous example, where our ordering website sent child requests to both our payment system and inventory system. FollowsFrom Spans are just a chain of sequential Spans. So, a FollowsFrom Span is just saying, “I started after this other Span.”

Is OpenTracing still in Use?

OpenTracing is an open-source CNCF (Cloud Native Computing Foundation) project which provides vendor-neutral APIs and instrumentation for distributed tracing. Although OpenTracing and OpenCensus have merged to form OpenTelemetry in early 2019, third-party libraries and frameworks like Hazelcast IMDG still come equipped with OpenTracing pre-instrumentation.

OpenTracing became a CNCF project back in 2016, with the goal of providing a vendor-agnostic specification for distributed tracing, offering developers the ability to trace a request from start to finish by instrumenting their code. Then, Google made the OpenCensus project open source in 2018. This was based on Google’s Census library that was used internally for gathering traces and metrics from their distributed systems. Like the OpenTracing project, the goal of OpenCensus was to give developers a vendor-agnostic library for collecting traces and metrics. This led to two competing tracing frameworks, which led to the informal reference “the Tracing Wars.” Usually, competition is a good thing for end-users since it breeds innovation. However, in the open-source specification world, competition can lead to poor adoption, contribution, and support. Going back to the Kubernetes example, imagine how much more disjointed and slow-moving container adoption would be if everybody was using a different orchestration solution. To avoid this, it was announced at KubeCon 2019 in Barcelona that the OpenTracing and OpenCensus projects would converge into one project called OpenTelemetry and join the CNCF.

Learn

Learn more about OpenTracing from the official documentation