Skip to content

Monitoring HTTP Service Health

Overview

HTTP is the backbone of modern cloud applications. Yet, very little is done to understand the health of HTTP communications. Outside of services connected to the load balancer, it has been very difficult to measure the key performance indicators (KPIs) of latency, throughput and error rates for HTTP calls.

The Epoch Application Operations Center (AOC) captures and analyzes service interactions to deliver complete picture of HTTP service health. The AOC does deep analysis of application level protocols such as HTTP and gathers all the KPIs along with HTTP attributes. In this tutorial, we will provide a step-by-step guide for using various HTTP datasources, and grouping and filtering the HTTP data based on HTTP attributes

Topics Covered

  1. Defining HTTP Latency, Throughput and Error Rates
  2. Comparing Latency of HTTP Success and Errors

Setup

We will be using sock-shop app running on a Kubernetes cluster as our target application for mapping and monitoring. The AOC is installed as a pod and the collectors are installed as DaemonSet pods on each of the Kubernetes worker nodes (see figure below). You can easily get this setup going in your Kubernetes cluster using our installer.

k8s-setup

What HTTP Service to Monitor?

Your application probably has a lot of HTTP services. The Epoch maps help you understand the dependencies among services and pick HTTP calls that you should monitor. From the Maps Tutorial, we have the following picture of HTTP interactions in the sock-shop app. We will pick the HTTP communication between front-end and catalogue for this tutorial (see figure below).

pick

Getting List of the HTTP Interactions

There might be multiple HTTP calls going on between front-end and catalogue pods. We can understand these calls by using the AOC Analytics Sandbox. All we need to do is select the client and server pod names and groupby http.uri, easy!

  1. From the left navigation box, select Analytics Sandbox
  2. Select http.request_response.count as the Datasource
  3. Select count as the Aggregation function to apply
  4. Set http.uri as the GroupBy
  5. Now, let's set the Filters so that we restrict the client and server to specific pods
    • pod_name(client) : sock-shop/front-end...
    • pod_name(server) : sock-shop/catalogue...
  6. Change the chart type to Bar

uri

We can see the http URI associated with the communication between front-end and catalogue. As expected, the calls are for URI of the form /catalogue/<catalogue_id>.

Defining HTTP Avg Latency

We will define the HTTP Avg Latency for the calls to URI:/catalogue.*. Additionally we will restrict to measure the GET requests coming from front-end to catalogue.

  1. From the left navigation box, select Analytics Sandbox
  2. Select http.request_response.latency as the Datasource
  3. Select avg as the Aggregation function to apply
  4. Now, let's set the Filters so that we restrict the metrics to the specific http interaction of interest.
    • pod_name(client) : sock-shop/front-end... [set the client using pod_name]    - pod_name(server) : sock-shop/catalogue... [set the server using pod_name]   - http.uri : /catalogue.* (regex) [HTTP URIs matching the regex /catalogue.*]
    • http.request_method : GET [This is the method we care about]

And we have the chart measuring latency of front-end to catalogue HTTP interaction! We selected the HTTP latency datasource. Then we applied the client/server filters and restricted the metrics to specific URI (/catalogue.*) and specific method GET. All this was made easy because Epoch gathers the HTTP metrics along with all the key attributes such as URI, request method, etc. automatically from analyzing service interactions.

Defining HTTP Throughput

This is very similar to defining the latency. All we need is to change the datasource from http.request_response.latency to http.request_response.throughput. Below we have repeated the steps and also highlighted in the resulting chart.

  1. From the left navigation box, select Analytics Sandbox
  2. Select http.request_response.throughput as the Datasource
  3. Select throughput as the Aggregation function to apply
  4. Now, let's set the Filters so that we restrict the metrics to the specific http interaction of interest.
    • pod_name(client) : sock-shop/front-end...
    • pod_name(server) : sock-shop/catalogue...
    • http.uri : /catalogue.* (regex)
    • http.request_method : GET

Defining HTTP Error Rates

For simplicity, let's focus on the HTTP 5xx and 4xx errors (for e.g., status code 500, 404, etc.). Then the error rate will be defined as:

(Throughput of HTTP 5xx or 4xx requests) / (Total Throughput) * 100

Continuing, from the previous section, we have already defined the overall throughput. Below is the screenshot of that query. Note the query statement name A. And so, A represents the total throughput. We will see how to use this name and combine query to generate the error rate. We will create another query statement and use filters to restrict the throughput metrics to HTTP 5xx and 4xx status codes.

add-stmt

  1. Create another query statement by clicking the + METRIC button. Note this creates new statement named B.
  2. Select http.request_response.throughput as the Datasource
  3. Select throughput as the Aggregation function to apply
  4. Now, let's set the Filters so that we restrict the metrics to the specific http interaction of interest.
    • pod_name(client) : sock-shop/front-end...
    • pod_name(server) : sock-shop/catalogue...
    • http.uri : /catalogue.* (regex)
    • http.request_method : GET
    • http.status_code : (4\d\d|5\d\d)(regex) [We filter on status code and select only those requests that are getting 4xx or 5xx errors]
      Query statement B, has throughput of the 4xxand 5xx errors. Next we will use the EXPRESSION feature to combine and obtain the error rate i.e B/A*100.

eval

  1. Create an expression statement by clicking the +EXPRESSION button
  2. Select Eval as the operator to combine queries using arithmetic
  3. Now simply use $and query statement name, to reference the results of query statements and write the appropriate mathematical formula. In this case ($B/$A)*100
    We have the error rates. For error rates, we created two query statements and combined them to obtain error rates.

Comparing Latency of HTTP Errors and Success

If an HTTP service is failing, it better fail fast. Otherwise the end users not only end up waiting longer but in the end are frustrated to recieve HTTP errors. A good approach to measure this is the ratio of Avg Latency of HTTP Errors / Avg Latency of HTTP Success. Let's learn how to define this metric in Epoch.

  1. From the left navigation box, select Analytics Sandbox
  2. Select http.request_response.latency as the Datasource
  3. Select avg as the Aggregation function to apply
  4. Now, let's set the Filters so that we restrict the metrics to the specific http interaction of interest.
    • pod_name(client) : sock-shop/front-end... [set the client using pod_name]    - pod_name(server) : sock-shop/catalogue... [set the server using pod_name]   - http.uri : /catalogue.* (regex) [HTTP URIs matching the regex /catalogue.*]
    • http.request_method : GET [This is the method we care about]
    • http.status.code : (4\d\d|5\d\d) [regex matching 4xx and 5xx errors]


Note the query statement name A. This query statement is returning the average latency of HTTP requests resulting in 4xx and 5xx errors. latency-errors

  1. Create another query statement by clicking the + METRIC button. Note this creates new statement named B.
  2. From the left navigation box, select Analytics Sandbox
  3. Select http.request_response.latency as the Datasource
  4. Select avg as the Aggregation function to apply
  5. Now, let's set the Filters so that we restrict the metrics to the specific http interaction of interest.
    • pod_name(client) : sock-shop/front-end... [set the client using pod_name]    - pod_name(server) : sock-shop/catalogue... [set the server using pod_name]   - http.uri : /catalogue.* (regex) [HTTP URIs matching the regex /catalogue.*]
    • http.request_method : GET [This is the method we care about]
    • http.status.code : 200 [This is the latency of success]


Note the query statement name B. This query statement is returning the average latency of HTTP requests resulting in success. Now, we just need to calculate A/B to get the ratio comparing the latency of errors and success

  1. Create an expression statement by clicking the +EXPRESSION button
  2. Select Eval as the operator to combine queries using arithmetic
  3. Now simply use $and query statement name, to reference the results of query statements and write the appropriate mathematical formula. In this case ($A/$B) compare-latency
    Note that the Eval statement name is C. The plot of C reveals that the latency of error requests is a small fraction of successful requests. This is how it should be! As mentioned earlier, this is a good metric to track and set alerts as it greatly impacts end user experience.