The labor day was approaching and Mary decided to run a discount sales program on that occasion. Suddenly, some time in midday, her online store stopped working and she was getting a 404 message on the screen. Refreshing didn’t help. Mary was angry and confused. So she called his website admin…
Server is your application’s brain. It manages the most critical aspects of your app’s (or website’s) activities. With this in mind, there’s no doubt you should closely watch your server and react to anomalies as they appear, in order to avoid stories such as the one described above. We’ve drawn a list of essential performance metrics for your server, so now you can have a better clue of how your server performs. There are many measurements that you can use when providing server monitoring and naturally there are many publications on this topic. But, there are no concrete recommendations covering a minimal set of metrics that give accurate information about the state of a web application. In this article, we try to present our point of view on this and to compose a minimal set of metrics. There are many important metrics, termed “key performance indicators”, that allow you to evaluate the health of your server more or less correctly. That is, they each contain an essential piece of information that is needed to understand the state of your web application.
RPS is the evaluation of how many requests per second are being sent to a target server. In other words, this metric is called Average Load and it allows you to understand what load your web application currently works under. Usually, this is calculated as a count of the requests received during a measurement period, where the period is represented in seconds. Generally, the measurement period (often called monitoring period) is in the range of 1 – 5 minutes. The lengthening of the monitoring period leads to an undesirable “smearing” of the load indicator.
Naturally some errors may occur when processing requests, especially when under a big load. The Error Rate is usually calculated as a percentage of problem requests relative to all requests, and it reflects how many response HTTP status codes indicate an error on the server – including any requests that never get a response (timed out). It is known that web servers return an HTTP Status Code in the response header. Normal codes are usually 200 (OK) or something in the 3xx range, indicating a redirect on the server. Common error codes are 4xx and 5xx, which mean the web server knows it has a problem fulfilling that request. Error Rate is a significant metric because it measures “performance failure” in the application and tells you how many failed requests have occurred at a particular point in time. Normally, no one can define the tolerance for Error Rate in their web application. Some consider an Error Rate of less than 1% successful. However, normally you must try to minimize possible errors in order to avoid performance problems, and constantly work to eliminate them.
By measuring the duration of every request/response cycle, you will be able to evaluate how long it takes the target web application to generate a response. The ART takes into consideration every round trip request/response cycle during a Monitoring period and calculates the mathematical mean of all the response times. The resulting metric is a reflection of the speed of the web application – perhaps the best indicator of how the target site is performing, from the users’ perspective. Please take into account that the ART includes the lead time of any resource being used during response preparation. Thus, the average will be significantly affected by any slow components. The recommended standard unit of measurement for ART is milliseconds.
Similar to the ART, PRT also measures the round trip of request/response cycles, however the peak will tell us what the longest cycle is at that point in the test. For instance, if we are looking at a graph that is showing a 5 minute monitoring period and the PRT is 13 seconds, then we know that one of our requests took that long. In the case where the average calculation may be sub-second (because our other resources had speedy responses), you may still not be troubled and just consider that there is no problem yet. But, when the ART and PRT start becoming comparable, that indicates that you undoubtedly will have a problem in your server. Generally, the PRT shows that at least one of the resources is potentially problematic. It can reflect an anomaly in the application, or it can be due to “expensive” database queries, etc. The standard measurement unit of PRT is recommended to be milliseconds.
Uptime is the amount of time that a server has stayed up and running properly. It reflects the reliability and availability of the server and, obviously, this value should be as large as possible. The value can be calculated as an absolute value or as a percentage of actual server uptime to ideal server uptime. For example, if you start the server on Aug 1, 2012 and check the uptime exactly 31 days later on Aug 31, 2012, then the whole duration is 31 days or 2,678,400 seconds. If your server has been stopped during that period for 1,000 seconds, then the uptime percentage (availability) will be 100 * (1 – (1000 / 2678400)) = 99.963%. Usually, if your server is in production, a value less than 99% should lead to attention and less than 95% – to troubling.
CPU Utilization is the amount of CPU time used by the Web Application while processing a request. Usually, it is the percentage of CPU usage that is calculated, which indicates how much of the processor’s capacity is currently in use by your application. When the percentage of CPU usage begins to max out at 100%, additional action may need to be taken because that points to the existence of some problem in your application, or to a capacity deficiency of the host machine.
Memory Utilization refers to the amount of memory used by a Web Application while processing a request. Usually, it is calculated as the process’s percentage of memory utilization, which is a ratio of the Resident Set Size to the physical memory. Note that the Resident Set Size (space for text, data, stack) is a real occupied memory size.
As usual, a web application can generate a lot of threads to process requests. The number of threads is an important metric because the number of threads per process is normally limited by the system. So if your application generates too many threads it can be an indicator that you have a problem in the application. Obviously, the count of existing threads is proportional to the load and inversely proportional to the processing time of the requests.
A file descriptor is an object that a process uses to read or write to an open file and to open network sockets. An Operating System places limits on the number of file descriptors that a process may open. A lack of available file descriptors can cause a wide variety of symptoms which are not always easily traced back to. The Open Files Descriptors (OFD) provides a count of the total number of file descriptors that are currently allocated and open for processing. The percentage of the total number of open file descriptors with respect to the maximum allowed count of descriptors for processing is a good metric for evaluating the health of a web application
Server performance and monitoring is surely goes beyond this essential measurements. We have covered this topic extensively here as well as in our sister-publication, where you can find a plethora of articles about different aspects of monitoring of all kinds of servers. Also, if you seek a Slow-motion view of what happens on a website, when user clicks on a link, here’s Warren’s series of blog posts that will help you understand the specific behaviors of web applications.