Welcome to Part V of Chronology of a Click.This series chronologically details what happens after a user clicks on a link in a web page. It is interspersed with website performance tips that relate to the topic being discussed. It is presented in multiple parts. Here’s what we’ve covered so far: The user clicks on a link. The browser issues an HTTP request. The HTTP request is packaged into TCP segments, then IP datagrams, then frames as it is passed down the client machine’s protocol stack. The frames hop from machine to machine as they travel through the Internet. They eventually arrive at the web server and are passed up the server’s protocol stack. IP datagrams are unpackaged from the frames. TCP segments are unpackaged from the IP datagrams. The HTTP request is unpackaged from the TCP segments. The HTTP request is decrypted and decompressed as needed.And now the HTTP request comes knocking on the web server’s front door. The term web server sometimes refers to the machine, but herein it refers to the software.
As with pretty well every other stage in this process, the web server performs better if its environment is configured to make optimized use of its resources, and if those resources are plentiful. Example: Swapping and paging are never appropriate on a machine that hosts a web server.
All other software installed on the web server’s machine will compete with it for available resources, even if the software is not running. Install the web server on its own dedicated machine.
If you buy the cheapest possible equipment, you’ll get what you paid for.
A web server in its simplest form may be thought of as a file server, but it can do so much more than merely serve files upon request. The sections below discuss what a web server can do. It is presented in more or less chronological order, but the exact order on your web server is likely different.
Buy an Internet service with a high outgoing:incoming bandwidth ratio. Web servers send much more data than they receive, so we need much more outgoing bandwidth than incoming. Note that typical home services are not appropriate because they have a very low outgoing:incoming bandwidth ratio (because typical home users download much more than they upload).
Apache is used in the examples below to illustrate concepts that apply to most web servers. The syntax on other servers will be different, but the concepts in most cases will be similar.
Specify a complete list of options instead of using a wildcard specification with DirectoryIndex. The list should be in order, with the most commonly-used filenames first.
RAID (Redundant Array of Inexpensive Disks) allows us to stripe data across multiple drives, which not only gives a performance boost, but can also be used to improve data integrity.
Use separate servers for static content and dynamic content. This lets us have separate configurations for the two different types of content.
Modularization & Dynamic Configuration
Apache is highly modularized, which lets us easily include or exclude modules and extensions. This is done via command line switches when Apache is compiled or via httpd.conf at startup.
Performance Consideration: Loading features statically at compile time will improve performance, but the executable will occupy more memory.
Only the core and mod_so modules are strictly required, but you would be hard pressed to find a website that uses no other.
Performance Consideration: Do not load anything that your website does not use. This will decrease both RAM usage and execution time. [It is common to see web servers enabled for all sorts of wonderful things that are never used on any of its web sites.]
Apache is configured by placing plain-text directives in httpd.conf. Other configuration files can also be used in conjunction with httpd.conf by using the include directive. Directives are read and executed only when Apache starts or restarts. Configuration can be controlled by setting/resetting command-line switches, editing the file and restarting, or setting/resetting environment variables. Global-level directives apply to the entire server, but directives can also be scoped to individual directories, files, URLs, or virtual hosts.
Apache also allows dynamic configuration through .htaccess files. Each .htaccess file relates to the direcory in which it is found and to all subdirectories. Example: If the web server is serving a resource from a directory named x, which is a subdirectory of y, which is a subdirectory of z, which is a subdirectory of docroot, it must look for and parse .htaccess files in all four of those directories. If it finds a BrowserMatch directive in both y and docroot, it uses the directive in y and ignores the directive in docroot.
Note the number of disk accesses required to check all those ancestor directories. Avoid .htaccess whenever possible; use the Directory directive in httpd.conf instead. [This may not be possible if the web server is hosted by a third party.]
The httpd.conf file may use AllowOverride and/or AllowOverrideList to allow or disallow .htaccess files or specific directives within the .htaccess files. This can be done on a directory by directory basis.
If you must use .htaccess, use it as close to docroot as possible. Every additional level is another disk access.
Virtual and Per-User Hosting
Web server software can run more than one website on one computer.
Virtual hosting creates multiple websites for multiple domains. A domain may share its IP address and/or docroot with another domain or have exclusive access to its own IP address and/or docroot. Example: Web hosting companies use virtual hosting for most of their clients. Their clients usually share IP addresses, but do not share docroots.
It is also possible to create websites for IP addresses that don’t have associated domain names. In fact, we can also create a website for a domain and create a completely different website for that domain’s IP address. [Now we're just getting weird.]
If websites share an IP address, those websites will not be able to serve resources by IP address; they can serve resources by domain name only. So what? Resources served by IP address avoid DNS lookups and cookie processing. IP addresses should be used instead of domain names whenever possible.
In addition to virtual hosting, per-user hosting allows each user account to host his own website. For example, if user ted on a machine named yellow.example.com wants to host his own website, he can put his files in his public_html directory and access them at http://yellow.example.com/~ted/. This technique is fading away as we trend toward dedicated servers that have only root and webadmin users.
Concurrency, Threads, & Processes
If a web server processed all requests serially, it would become overwhelmed almost immediately. A web server must be able to service multiple requests simultaneously.
Apache manages concurrency with pluggable concurrency modules (also called multiprocessing modules or MPMs).
MPMs are specific to an environment, so we can’t choose one that’s not built for our web server’s platform. Of the remaining choices, though, we must realize that our choice can affect the speed and scalability of the web server. If scalability or performance is important, we can choose something like worker or event, which have multiple threads per process. If compatibility with older software is important, prefork may be more suitable, but it has only one thread per process. [Other MPMs are available from Apache and from third parties.]
Limit the resources given to each connection so no connection can starve the other connections.
Use MaxClients to set the maximum number of simultaneous connections.
Limiting Resource Usage
It is conceivable that bandwidth, number of connections, and other resources may be limited on a per-client, per-virtual-host, or per-directory basis. Some of this functionality comes with recent versions of Apache, some is available in modules written by third parties, and some is not available yet.
A server administrator might want to use these techniques to guard against denial of service attacks, spam, or users who hog the server with multiple downloads.
Reverse IP Lookups
Web servers may do a reverse IP lookup upon receipt of every request. This ensures that the client really is who he says he is. If your data is secured by IP address or domain name, this prevents IP spoofing.
DNS lookups journey through the Internet, so they are slow (and can be very slow). Avoid this traffic by always turning HostnameLookups off. However, note that double-lookups will be performed if we use mod_authz_host, even if we turn HostnameLookups off.
Authentication, Authorization, & Access Control
Authentication is the verification that someone is who they say they are. Authorization is the permission a user needs in order to be somewhere or get something.
Apache provides several techniques for protecting directories and files:
- Place the directories and files outside the docroot hierarchy if no one should ever access them from a web browser.
- Use Order, Allow, and Deny directives within the scope of a Files or Directory directive to allow/deny access by domain, IP address, or any HTTP header (see SetEnvIf).
Avoid DNS lookups by using IP addresses instead of domain names whenever possible in Allow and Deny directives.
Performance Consideration: Large password files are a performance problem. If you have more than 100 users, you’re asking for trouble.
If we are protecting data, we should remember to use SSL during transmission. Without it, the data will be visible to every machine at every hop along the way.
Redirection & Rewriting
Now that the web server has a URL, it must figure out what to do with it. If the requested resource isn’t available for any reason (missing, moved, security restriction, etc.), the web server sends the appropriate HTTP response code to the client. Once again, down the protocol stack, through the Internet, and up the client’s protocol stack. The entire process now starts again from the very beginning (i.e., Part II of this article).
If this process results from a moved file (either permanent or temporary), the file may have moved to a different web server or to a different directory on the same web server. In either case, a redirect is sent to the client.
Avoid redirection and missing resources. They result from a bad request coming from one of our web pages, so they can be avoided by putting the correct link into the web page.
If the web server gets past the above process, it then follows a set of rewrite rules supplied by the website developers. These rules map the URL to a file on the local machine. When the web server is finished with the rewrite rules, it has a specific file name within the docroot hierarchy. The path and filename may look nothing like the original URL.
Redirection and rewriting are two different things, but there is a rewrite rule that initiates a redirect.
Avoid rewriting. If the URL hierarchy is the same as the file hierarchy on the hard disk, rewriting can be bypassed.
Server-Side Includes (SSI)
HTML files can include dynamic content by using SSI directives. After configuring the web server, the programmers can merely include a directive to insert a date, a standard header/footer/navbar/component, or more complex content that is generated by a CGI script or some server-side scripting language (e.g., PHP). It is common to use a .shtml file extension if the file uses SSI, but that is not always the case. There are a few different ways to configure SSI.
SSI can be slow if configured poorly, which is easily done. We might be better off with a server-side scripting language (e.g., PHP) or XMLHttpRequest from the client side.
If you decide to use SSI, use XBitHack
instead of .shtml file extensions.
SSI directives also allow variables to be set and conditions to be tested. However, it is far from a full 3GL, structured programming language.
Never configure SSI to parse all
.html files. The performance hit is just too great. Use .shtml file extensions or XBitHack
instead. Even better, use PHP or some other scripting language.
Browsers can specify the media types, languages, character sets, and encodings they can handle by setting the accept, accept-language, accept-charset, and accept-encoding HTTP headers. When the web server sees these headers, it compares them to the media types, languages, character sets, and encodings available on the server. Deciding what to use in the response is called “content negotiation.”
The browser can prioritize its list, which makes the server’s choice a little easier, but content negotiation is not as simple as comparing the browser’s list to the sever’s list and selecting the best choice from the intersection set.
- What if there is no intersection set between the browser’s list and the server’s list?
- What if the browser does not provide a list?
Type maps and multiviews make it possible to specify different files for different media types, languages, and encodings. The URL need only name the resource in a generic way; the server will choose the right file based on the HTTP accept-… headers.
Example: If the URL is http://example.com/navbar, the server may have files named navbar.en.php, navbar.es.php, and navbar.fr.php. It knows which one to use by inspecting the accept-language header from the request.
Type maps (.var files) list resources; the files that contain the resources; and the media types, language, and encodings of those files. It was not originally intended to be a handy, easy-to-use documentation tool; it just turned out that way. The type maps can help with both debugging and ongoing maintenance.
Content can be in its own file or included within its type map file.
Placing the content into the type map file instead of in its own, separate file will save the web server one file-open operation. [It also groups all the files for a single resource together in one place.]
Content negotiation negatively impacts performance. Turn it off. However, if you decide that the benefits of content negotiation outweigh the performance hit, use type maps instead of multiviews. This will eliminate extra disk accesses.
Retrieve the Content
This is starting to sound like the simplest part of a web server, but not so. Yes, now that it knows the specific file to serve, it can be as simple as copying the file to output. However, if the file is CGI or a script, the web server must turn control over to that program and wait for its output. Scripting is covered in the next part (Part VI) of this article.
Apache offers great flexibility in content manipulation before and after the content reaches the content generator. Manipulation before it reaches the content generation is called input filtering. Manipulation after it leaves the content generator is called output filtering.
Fix programs that spawn a large number of processes, don’t release connections, or have memory leaks. Add more memory if you must, but it’s better to locate and fix the code that causes the problem.
Beware potential performance problems with dynamic output filtering
. “Dynamic” means it happens while the user is waiting for a response – and that’s not a good time to be doing anything.
Create the HTTP Response
Now that the content is ready to go, the web server prepares the HTTP response, including the headers.
Make sure the web server always includes the Last-Modified, Expires, Cache-Control, Content-Length, and Content-Type headers. Caching can be affected if these headers are missing.
If required, the web server will now encrypt the body of the HTTP response (not the headers). It will take one last look at the HTTP request’s headers to find out which forms of encryption the client understands, then it will encrypt in one of those forms. This is similar to the content negotiation process that was discussed above, but it would more properly be called encoding negotiation.
Only encrypt when you need to. Avoid the extra processing time and SSL handshaking.
Web Server Caching
Now that all the hard work is done, the web server may decide to keep the content in memory for awhile, just in case someone else should request it. After all, memory access is much, much faster than disk access.
To keep or not to keep is decided on the basis of configuration options and available memory. How long to keep it for is decided the same way. Which cached content to dispose of first when memory fills up is decided by a simple LRU (least recently used) algorithm.
Disk-based caching in Apache is not quite what you’d expect. If you use disk caching, remember that you are responsible for running htcacheclean at the right intervals.
In-memory caching uses similar algorithms and protocols to the proxy caching that we discussed in part 3
. One difference to keep in mind, though, is that UseCananonicalName
should be turned on if you are using virtual hosting.
Although the web server’s caching abilities are good, the operating system’s file/disk caching abilities are probably better. Make sure the o.s. is configured to make the best possible use of file/disk caching.
Apache logging is highly configurable. It can include the initial request, the URL mapping process, the final resolution of the connection, errors, and anything sent from third-party modules or server-side scripts. Severity level can be set on a per-module basis to increase or decrease the amount of information sent to the log. The log entries’ format can be set to the common log format (CLF), the combined log format, or to a custom format of our design.
Stick to CLF. There are a good number of log-analysis tools out there, but most of them assume we are using CLF.
Performance Consideration: Log files need to be rotated because they can grow very big very fast. However, log files cannot be rotated while the server is running. Use apachectl graceful to restart the server without losing any connections, but note that this may cause a performance blip. It’s best to do this at the time of lowest usage. This comment does not apply to piped logging, offline post-processing, or cronolog.
Forensic logging is very strict and has no customizations, but can aid in both debugging and security.
Input to and output from CGI scripts can also be logged.
Logging input to and output from CGI scripts can be handy in a test environment, but never use it in production.
Much more will be logged in the test environment because we need it for debugging and testing. Logging should never be the same in the test and production environments.
All logging impairs performance. If performance were our only concern, we would turn it off completely. However, because we must also consider the need for information during problem determination, we leave minimal logging turned on in production environments. The tradeoff is very real, so we need to give it some thought rather than blindly accept the default settings.
Close the Connection
After serving some content, Apache will wait for another request without closing the connection. However, it will not wait forever. The KeepAliveTimeout configuration directive specifies how long to wait.
Performance Consideration: Setting KeepAliveTimeout bigger or smaller may improve performance or it may degrade performance. The goal is not to maximize nor minimize this setting, but to optimize it.
A modern web server is far different from a simple file server. The selection and configuration of its features can dramatically impact performance. If a third party hosts your web server, you may not have enough control over its configuration to ensure acceptable performance.
The whole series is comprised of 15 parts: