Online education portals like Udacity and Coursera are really changing the world of remote learning in significant ways. By making free and high quality education accessible to a global audience, these platforms are opening up undreamt of possibilities for communities around the world to improve, grow, and prosper in the digital economy of the 21st century. Education at top tier colleges and universities has traditionally been a social and economic privilege, but now anyone can join in the learning revolution by sitting in virtual classrooms with the world’s best and brightest educators. Whether this involves learning how to code and build smart phone apps, or starting up a new business, or learning about public health literacy, the sky is the limit of what’s now possible.

Everything about Web and Network Monitoring

Website Performance: Building Tables and Indexes

This is the second of four articles about Database Management Systems’ (DBMS) performance. The first of these four articles presented an overview and some installation tips. The article you are now reading talks about building the database’s tables and indexes. Parts three and four get into the meat of accessing the database, with non-SQL tips in the third article and SQL tips in the fourth.Website Performance: Taxonomy of Tips introduced a classification scheme to help us organize the many performance tips found on the Internet.  Database Management Systems fall into category #3.2 (the-server-side-script-accesses-a-service).

Although this article is based on MySql experience, the concepts apply to other DBMS’s as well.

Don’t Take My Word For It

The following tips may improve a database’s performance, but it is possible they may not. Since data and usage patterns are unique, the only way to know whether a tip helps is to test it before committing to it. When testing, use datasets and usage patterns that resemble the production environment. Using a custom monitor to measure the results is a time-saving idea because no additional effort is required to implement monitoring after the webapp goes live.

Creating Tables

Physical data storage impacts performance. Where are the tables physically located? On which bus? On which computer? On which partition? If tables are split across multiple disks, which algorithm is used to map rows to disks? Does a row need to be moved from one disk to another when its data changes? Is data striped across multiple drives? We need to not only answer these questions, but to also consider whether our choices yield the best performance.

Compressing textual data is almost always a good idea. Other data types may have good compression ratios, too. The CPU cost of compressing and uncompressing is usually less than the cost of moving large amounts of data between the disk and RAM.

When we design tables for performance, we must keep in mind that data integrity (including relational integrity) is more important than performance in almost all cases, and that there are sometimes tradeoffs between the two. For example, turning off MySql’s checksum feature may provide better performance, but is the slightly elevated risk of corrupted data worth it?

Selecting a Storage Engine

DBMS’s may offer several storage engines. Which one we choose affects performance, either for the good or for the bad. Not all storage engines compress data automatically, allow indexing or random access, or support relational integrity. We must know the characteristics of the available storage engines before deciding which one works best with our tables and usage patterns.

Table Splitting

In many cases, especially for larger datasets, splitting tables horizontally or vertically improves performance, but at a cost of additional processing logic. A horizontal split puts some rows in one table and some in another. A vertical split puts some columns in one table and some in another.

Horizontal Splitting: In some designs, access to live data is not a requirement. A delay of a few minutes, hours, or days before data is available is perfectly acceptable. In these cases, a table can be split horizontally, with new rows in one table (the transactional table) and older rows in a second, identically-structured table (the non-transactional table). New rows are added to the transactional table and queries are performed on the non-transactional table. Rows are moved from the transactional table to the non-transactional table when system activity troughs or according to a predetermined schedule. If the DBMS implements prioritization of queries, the rows can also be moved over on an ongoing basis, but at a lower priority than other queries. Indexing is not required on the transactional table because of its write-only nature.

Horizontal splitting can improve performance even if our design requires access to live data. Each row is stored in one of several identically-structured tables. Given some immutable value in the row (e.g., the primary key), a mapping algorithm tells us which table a row is in (or should be put into). Placing each table on a separate drive is not strictly required, but it does further improve performance.

Vertical Splitting: If a table has many columns, split it by putting the less-frequently-used columns into a second table. If a query requires data from both tables, a simple join on the primary key is used. In many cases, though, only the first table is used, which means the DBMS does not have to work its way through all the less-frequently-used data.

This concept can be extended in many ways. For example, if our analysis of usage patterns identifies a few frequently-used, but independent, queries, we could place the columns for each of those queries into separate tables. If two queries have a column in common (i.e., they are not completely independent), that column can be placed in both tables to eliminate the need for a join. This is an example of acceptable redundancy in physical database design, but processing logic must make sure the redundant data is always in sync.

Creating Indexes

Indexes often improve performance, but not always. Queries that write to a table also affect the table’s indexes. The more indexes we have, the more indexes we must update. Writing to a table takes longer if there are too many indexes just like reading from a table takes longer if there are too few indexes. When indexes are used, the following tips may help.

Columns that are used in

  • WHERE clauses,
  • ORDER BY clauses,
  • GROUP BY clauses, or
  • JOIN expressions (e.g., primary keys, foreign keys)

are usually indexed. Other columns are usually not indexed. This is probably a good point from which we can start our testing and measuring.

Keep indexed columns short and indexes small. Autoindexed INTs make the best primary keys.

Do not use multiple columns in a single index. Use multiple indexes, one for each column, instead.

Like tables, indexes are stored on disk, so physical storage of indexes also impacts performance. Please refer to the discussion on physical storage above – it applies to indexes, too. Also consider pointer size for indexes.

If a query affects most or all of the rows in a table, sequential access is faster than using an index because it minimizes disk seeks (assuming low fragmentation on the HDD). In most cases, the DBMS knows this and automatically bypasses the index. If all the queries on a table are like this, indexing won’t be used, so building indexes can actually degrade overall performance.

Indexes are less important for queries on small tables.

If the DBMS gives us a choice of indexing algorithms, read the documentation to see which one is best for a given situation.

References

Custom Monitors in Paid Monitor with Python by Hovhannes Avoyan.  Published 2011.07.07 by Paid Monitor at blog.monitor.us/index.php/2011/07/07/custom-monitors-in-paid-monitor-with-python.  Accessed 2012.01.13.

Paid Monitor Open API.  Published by Paid Monitor at portal.monitor.us/api.  Accessed 2012.01.13.

MySQL 5.6 Reference Manual: 7. Optimization. Published at dev.mysql.com/doc/refman/5.6/en/optimization.html. Accessed 2012.02.08.

MySQL 5.6 Reference Manual: 7.2.1.4. How to Avoid Full Table Scans. Published at dev.mysql.com/doc/refman/5.6/en/how-to-avoid-table-scan.html. Accessed 2012.02.08.

Using M3 to Take System Monitors to the Next Level by Josh Mattson.  Published 2011.09.06 by Paid Monitor at blog.monitor.us/index.php/2011/09/06/using-m3-to-take-custom-system-monitors-to-the-next-level.  Accessed 2012.01.13.

Website Performance: Taxonomy of Tips by Warren Gaebel, B.A., B.C.S.  Published 2011.12.29 by Paid Monitor at blog.monitor.us/2011/12/website-performance-taxonomy-of-tips.  Accessed 2012.01.31.

Try Paid Monitor For Free.  A 15-day free trial.  Your opportunity to see how easy it is to use the Paid Monitor cloud-based monitoring system.  Credit card not required.  Accessed 2012.01.13.

The Paid Monitor Exchange at GitHub.  This is the official repository for scripts, plugins, and SDKs that make it a breeze to use the Paid Monitor system to its full potential.  Accessed 2012.01.13.

Post Tagged with

About Warren Gaebel

Warren wrote his first computer program in 1970 (yes, it was Fortran).  He earned his Bachelor of Arts degree from the University of Waterloo and his Bachelor of Computer Science degree at the University of Windsor.  After a few years at IBM, he worked on a Master of Mathematics (Computer Science) degree at the University of Waterloo.  He decided to stay home to take care of his newborn son rather than complete that degree.  That decision cost him his career, but he would gladly make the same decision again. Warren is now retired, but he finds it hard to do nothing, so he writes web performance articles for the Monitor.Us blog.  Life is good!