Monitoring HTTP Service Health¶
Overview¶
HTTP is the backbone of modern cloud applications. Yet, very little is done to understand the health of HTTP communications. Outside of services connected to the load balancer, it has been very difficult to measure the key performance indicators (KPIs) of latency, throughput and error rates for HTTP calls.
The Epoch Application Operations Center (AOC) captures and analyzes service interactions to deliver complete picture of HTTP service health. The AOC does deep analysis of application level protocols such as HTTP and gathers all the KPIs along with HTTP attributes. In this tutorial, we will provide a step-by-step guide for using various HTTP datasources, and grouping and filtering the HTTP data based on HTTP attributes
Topics Covered¶
- Defining HTTP Latency, Throughput and Error Rates
- Comparing Latency of HTTP Success and Errors
Setup¶
We will be using sock-shop
app running on a Kubernetes cluster as our target application for mapping and monitoring. The AOC is installed as a pod and the collectors are installed as DaemonSet pods on each of the Kubernetes worker nodes (see figure below). You can easily get this setup going in your Kubernetes cluster using our installer.
What HTTP Service to Monitor?¶
Your application probably has a lot of HTTP
services. The Epoch maps help you understand the dependencies among services and pick HTTP
calls that you should monitor. From the Maps Tutorial, we have the following picture of HTTP
interactions in the sock-shop
app. We will pick the HTTP
communication between front-end
and catalogue
for this tutorial (see figure below).
Getting List of the HTTP Interactions¶
There might be multiple HTTP calls going on between front-end
and catalogue
pods. We can understand these calls by using the AOC Analytics Sandbox. All we need to do is select the client and server pod names and groupby http.uri
, easy!
- From the left navigation box, select Analytics Sandbox
- Select
http.request_response.count
as the Datasource - Select
count
as the Aggregation function to apply - Set
http.uri
as the GroupBy - Now, let's set the Filters so that we restrict the client and server to specific pods
pod_name(client) : sock-shop/front-end...
pod_name(server) : sock-shop/catalogue...
- Change the chart type to
Bar
We can see the http URI
associated with the communication between front-end
and catalogue
. As expected, the calls are for URI
of the form /catalogue/<catalogue_id>
.
Defining HTTP Avg Latency¶
We will define the HTTP Avg Latency
for the calls to URI
:/catalogue.*
. Additionally we will restrict to measure the GET
requests coming from front-end
to catalogue
.
- From the left navigation box, select Analytics Sandbox
- Select
http.request_response.latency
as the Datasource - Select
avg
as the Aggregation function to apply - Now, let's set the Filters so that we restrict the metrics to the specific
http
interaction of interest.pod_name(client) : sock-shop/front-end...
[set the client usingpod_name
] -pod_name(server) : sock-shop/catalogue...
[set the server usingpod_name
] -http.uri : /catalogue.* (regex)
[HTTP URIs
matching theregex /catalogue.*
]http.request_method : GET
[This is the method we care about]
And we have the chart measuring latency of front-end
to catalogue
HTTP interaction! We selected the HTTP
latency datasource. Then we applied the client/server filters and restricted the metrics to specific URI
(/catalogue.*
) and specific method GET
. All this was made easy because Epoch gathers the HTTP metrics along with all the key attributes such as URI
, request method
, etc. automatically from analyzing service interactions.
Defining HTTP Throughput¶
This is very similar to defining the latency. All we need is to change the datasource from http.request_response.latency
to http.request_response.throughput
. Below we have repeated the steps and also highlighted in the resulting chart.
- From the left navigation box, select Analytics Sandbox
- Select
http.request_response.throughput
as the Datasource - Select
throughput
as the Aggregation function to apply - Now, let's set the Filters so that we restrict the metrics to the specific
http
interaction of interest.pod_name(client) : sock-shop/front-end...
pod_name(server) : sock-shop/catalogue...
http.uri : /catalogue.*
(regex)http.request_method : GET
Defining HTTP Error Rates¶
For simplicity, let's focus on the HTTP
5xx
and 4xx
errors (for e.g., status code 500, 404, etc.). Then the error rate will be defined as:
(Throughput of HTTP 5xx or 4xx requests) / (Total Throughput) * 100
Continuing, from the previous section, we have already defined the overall throughput. Below is the screenshot of that query. Note the query statement name A
. And so, A
represents the total throughput. We will see how to use this name and combine query to generate the error rate. We will create another query statement and use filters to restrict the throughput metrics to HTTP
5xx
and 4xx
status codes.
- Create another query statement by clicking the
+ METRIC
button. Note this creates new statement namedB
. - Select
http.request_response.throughput
as the Datasource - Select
throughput
as the Aggregation function to apply - Now, let's set the Filters so that we restrict the metrics to the specific
http
interaction of interest.pod_name(client) : sock-shop/front-end...
pod_name(server) : sock-shop/catalogue...
http.uri : /catalogue.*
(regex)http.request_method : GET
http.status_code : (4\d\d|5\d\d)
(regex) [We filter on status code and select only those requests that are getting4xx or 5xx
errors]
Query statementB
, has throughput of the4xx
and5xx
errors. Next we will use theEXPRESSION
feature to combine and obtain the error rate i.eB/A*100
.
- Create an expression statement by clicking the
+EXPRESSION
button - Select
Eval
as the operator to combine queries using arithmetic - Now simply use
$
and query statement name, to reference the results of query statements and write the appropriate mathematical formula. In this case($B/$A)*100
We have the error rates. For error rates, we created two query statements and combined them to obtain error rates.
Comparing Latency of HTTP Errors and Success¶
If an HTTP
service is failing, it better fail fast. Otherwise the end users not only end up waiting longer but in the end are frustrated to recieve HTTP errors. A good approach to measure this is the ratio of Avg Latency of HTTP Errors / Avg Latency of HTTP Success
. Let's learn how to define this metric in Epoch.
- From the left navigation box, select Analytics Sandbox
- Select
http.request_response.latency
as the Datasource - Select
avg
as the Aggregation function to apply - Now, let's set the Filters so that we restrict the metrics to the specific
http
interaction of interest.pod_name(client) : sock-shop/front-end...
[set the client usingpod_name
] -pod_name(server) : sock-shop/catalogue...
[set the server usingpod_name
] -http.uri : /catalogue.* (regex)
[HTTP URIs
matching theregex /catalogue.*
]http.request_method : GET
[This is the method we care about]http.status.code : (4\d\d|5\d\d)
[regex matching 4xx and 5xx errors]
Note the query statement name A
. This query statement is returning the average latency of HTTP requests resulting in 4xx and 5xx errors.
- Create another query statement by clicking the
+ METRIC
button. Note this creates new statement namedB
. - From the left navigation box, select Analytics Sandbox
- Select
http.request_response.latency
as the Datasource - Select
avg
as the Aggregation function to apply - Now, let's set the Filters so that we restrict the metrics to the specific
http
interaction of interest.pod_name(client) : sock-shop/front-end...
[set the client usingpod_name
] -pod_name(server) : sock-shop/catalogue...
[set the server usingpod_name
] -http.uri : /catalogue.* (regex)
[HTTP URIs
matching theregex /catalogue.*
]http.request_method : GET
[This is the method we care about]http.status.code : 200
[This is the latency of success]
Note the query statement name B
. This query statement is returning the average latency of HTTP
requests resulting in success. Now, we just need to calculate A/B
to get the ratio comparing the latency of errors and success
- Create an expression statement by clicking the
+EXPRESSION
button - Select
Eval
as the operator to combine queries using arithmetic - Now simply use
$
and query statement name, to reference the results of query statements and write the appropriate mathematical formula. In this case($A/$B)
Note that theEval
statement name isC
. The plot ofC
reveals that the latency of error requests is a small fraction of successful requests. This is how it should be! As mentioned earlier, this is a good metric to track and set alerts as it greatly impacts end user experience.