For a given SELECT query, the DBMS has a number of possible execution strategies. Here are a few that we will discuss (the list is not complete):
|Query||SELECT FROM EMPLOYEE WHERE EmpID=125|
|Number of EMPLOYEE records (r)||100,000|
|Number of disk blocks (b)||10,000|
|Blocking factor (bfr) (records per block)||10|
The DBMS must use a linear search when no database index exists on the selection condition (e.g., the EmpID). This is precisely the type of operation that we want to prevent with proper indexing.
Given our assumptions, the cost of a linear search for this query would be:
So, for the query above, C = b/2 = 5,000 block accesses.
The DBMS might employ a binary search on an index with non-unique entries. Say, for instance, that the database developer has a nonclustered index that groups by city, then first name, then last name, and they need to retrieve all customers who live in a certain city. The DBMS can use a binary search to find the first matching city record and then retrieve all subsequent city records by traversing down the database index row by row.
The cost of a performing a binary search is exactly the same as performing a binary search anywhere else, namely, C = log2b.
For the query above, the database developer has C = log2b = log210,000 = 14 block accesses.
If the column is unique (like a primary key, for example) then the database index can be implemented as a hash table. Such an index on the EmpID column would allow the database developer to hash directly to the correct employee record in constant time.
In general, static and linear hashes have a cost of C = 1.
For SELECT queries with composite selection conditions, the cost must be evaluated based on the database index structure which exists for each column involved in the join condition—which goes into more detail than this paper is designed to cover. Nevertheless, even for our very simple example query, we can mathematically verify an enormous performance improvement by using a good database index.