Database management systems are pervasive in the modern world. The notion of a persistent, redundant, and highly distributable library of information has become the single most important concept in our information technology repertoire. In fact, virtually every human being in the Western world interacts with a database management system of some kind on a daily basis—often without using a personal computer at any time throughout the day.
With millions of data transactions taking place every second, it comes as little surprise that database optimization is an area of key research for academic institutions and corporate research and development departments. From a software company’s perspective, the relational database most often serves as the core of data-driven software applications, and lack of database optimization in such a key area can incur significant costs to both the client and the company.
With millions of data transactions taking place every second, it comes as little surprise that database optimization is an area of key research.
The purpose of this paper is to discuss basic database optimization using mathematical cost estimation for different types of queries, a review of join performance, and the effects of various physical access structures on specific query examples. The intended audience should be familiar with SQL and basic relational database concepts – typically an experienced database developer. Specific examples will be given in the context of MS SQL Server 2005, but the concepts they illustrate will be general enough to apply to any SQL-supporting relational DBMS (Database Management System).
After reviewing the paper, the reader will hopefully have a better understanding of how RDBMSs formulate execution strategies for complex queries and be able to use this knowledge to retrieve information at a lower cost.
A database index is a physical access structure for a database table that functions much as the name would suggest: it is a sorted file that informs the database of where records are physically located on the disk. To get the idea of what an index does, consider a textbook. In order to find a particular section, the reader can either start reading the book and keep reading until he finds what he is seeking, or, alternatively, he can consult the table of contents and go directly to the desired section. A database index functions much like a textbook index does.
Adding appropriate indexes to large tables is the single most important part of database optimization, as we will see when we discuss some examples of cost estimation. Creating a single index for a large table with no existing indexes can reduce a query’s execution time by an order of magnitude. As an example, consider the following scenario. Say we have a database table called EMPLOYEE with 100,000 records. Assume that we wish to perform the following simple query on the table and that no indexes exist on the table:
SELECT FirstName, LastName FROM EMPLOYEE WHERE EmpID = 12345;
In order to find the employee record with the appropriate EmpID, the database must potentially scan through all 100,000 employee records to return the correct result. This type of scan is referred to as a full table scan.
Luckily, a database developer can create an index on the EmpID column to prevent such scans from occurring. Additionally, if this field has a UNIQUE constraint, the DBMS will internally compile the index as a hash table, with each employee ID hashing to the desired record’s physical disk address. Scanning thus becomes completely unnecessary, and record location is performed in constant time. After the database developer adds this index, the DBMS can immediately locate the employee record with EmpID 12345—a potential reduction of 100,000 operations.
Indexes fall into one of two categories: clustered and unclustered. The primary distinction between the two categories is that an unclustered index does not affect that actual ordering of the records on disk and clustered indexing does. Because clustered indexing affects the physical ordering of the records, there can be at most one clustered index for each table. The same restriction does not apply to unclustered indexes, so we can create as many as disk space allows (although this is not necessarily a good idea, as we’ll see in a moment).
SQL Server 2005 and SQL Server 2008 come with a utility called the Database Engine Tuning Advisor. In SQL Server Management Studio, the database developer can highlight the query to optimize and then click the button labeled “Database Engine Tuning Advisor.” A window should appear similar to the following:
The Tuning Options tab allows the database developer to configure whether or not they want the advisor to replace existing indexes or only consider adding new ones. The database developer should evaluate existing indexes to determine whether or not existing indexes should be dropped.
Once the configuration options are set, clicking the “Start Analysis” button will begin the tuning. Once analysis is complete, a window with recommendations will be displayed along with the estimated performance improvement for the query being optimized. These recommendations include:
Recommendations may be applied automatically via the Actions > Apply Recommendations menu item.
Cost estimation is the process of applying a meaningful and consistent measure of execution cost to a particular query. Various metrics can be used for this purpose, but the most common and most relevant metric is the number of block accesses required by the query. Since disk I/O is such an expensive operation in terms of time consumption, our goal is to minimize the number of block accesses as much as possible while not sacrificing functionality.
After reading this paper, the database developer will hopefully have a better understanding of basic database optimization techniques and how the DBMS formulates execution strategies for different types of queries, even though the provided examples are very limited in scope. The database developer should also understand the importance of creating well-tuned indexes and what criteria go into selecting the columns for indexing .
The cost estimation examples provided in this document were modified from examples provided in Fundamentals of Database Systems 5th edition by Doctors Ramez Elmasri of the University of Texas at Arlington and Shamkant Navathe of the Georgia Institute of Technology.