Database performance optimization deals with the problem of reducing time (and/or other resources) required to complete database queries by resolving various inefficiencies. There are many kinds of performance issues, and they can be categorized in many different ways. One important way is by scope: statement-level or instance-level. Instance-level performance tuning (which we are not covering in this mini-series) deals with such problems as excessive parsing, cache efficiency, concurrency etc. Statement-level performance tuning deals mostly with inefficient data access.
In order to extract X bytes from the database for the user, it is generally necessary to read a (much) larger amount of information. This is because as a rule, the required data are scattered around much larger volumes of information, and finding it is like finding the proverbial needle in the haystack.
Quantitative analysis of query’s performance involves following metrics:
- the actual volume of data required for answering the query
- data transfer rate
- data access efficiency, i.e. how much extra data we need to read in order to access the data we need.
Data access inefficiencies have very different nature, and are addressed by very different methods, and to a varying degree. Some of them have to do with the fundamental properties of relational databases (and thus cannot be eliminated completely). Some can be introduced (or amplified) by flaws of logical data design (and thus it will be very difficult to resolve them after the application goes live). Some have to do with how data design is implemented on the physical level (such problems are generally fixable, but may require application downtime e.g. to make changes to indexes or partitioning etc.). Finally, a large class of SQL performance problems result from poor decisions made by the optimizer due to inaccurate information about data. Such problems can often be “hot-fixed” by improving optimizer stats or using various techniques (e.g. SPM baselines) to lock in the desired plan.
Data access can be compared to travel: the least possible travel time for given speed will be achieved when we travel by a straight line. The more curvy and twisty travel path is, the longer the travel will take. And the worst case scenario is when there are many possible paths, and we don’t know which one to take – then we can get lost and keep wandering for a very long time.
Considering that the most common type of storage for databases is a hard disk drive (HDD), the analogy with travel becomes more than just an analogy. Reading information from HDD involves mechanical motion of the attenuator arm with respect to the disk itself. Depending on how requested information is geometrically distributed over the surface of the disk (in a few dense “clusters” of tracks, or randomly scattered over the entire disk), the attenuator’s path will be more or less “curvy and twisty”, which will have a major impact on performance. Equally important is our beforehand knowledge about data distribution patterns. If it’s not accurate, then we can easily “get lost” a keep “wandering” around disk surface for a long time.
Depending on the way data is accessed (sequentially or randomly), different metrics are more suitable for describing the amount of data transferred. Large sequential I/O operations can be measured in megabytes, because sequential transfer rate is a well-defined quantity for a given storage device (i.e. if you know what kind of storage you have, you know how many megabytes you can read sequentially per second). Random reads, on the other hand, are more conveniently measured in I/O operations rather than megabytes, because the time a random I/O takes depends little on its size. A common abbreviation for this metric is LIO (“logical I/O”). So the amount of work actually performed by a database query can be characterized by LIO. On the other hand, the amount of data the user needs (or the amount of data that is required to answer users question – those two can be different because of aggregation, we’ll get to it later) can often be more conveniently measured in rows.
So if we measure the amount of useful data delivered to the user in rows, and the amount of resources spent in logical I/O operations, then LIO per row (LIOPR) ratio would characterize data access efficiency.
LIOPR, however, shouldn’t be used blindly. The number of rows doesn’t have much meaning in presence of aggregation. If data is sorted, and then filtered, then LIOPR also cannot be used directly. But in general it is a good measure of query’s efficiency (and when it can’t be used on the statement level, it can always be used to describe efficiency of individual operations in the SQL plan). Some general guidelines for interpreting LIOPR values are summarized in the table below:
|10 – 100||questionable|
Putting all this, together, for I/O-bound queries the formula for performance would look like this:
(query time) = (LIOPR) (rows) (1 – BCHR) sreadtim,
where sreadtim is single-block read time. Because some of the components cannot be known exactly, this formula is mostly suitable for qualitative analysis (i.e. to show what factors are contributing to query’s performance and to what extent), but some rough quantitative estimates can also be made.
For example, a query that retrieves 10,000 rows (no aggregation or sort/filtering), mostly retrieves data via indexes (i.e. LIOPR >= 1) on a HDD with 6 millisecond random access (i.e. single block read) time can be expected to complete in about 1 minute, assuming no significant help from the buffer cache (i.e. BCHR = 0).
In the next part of this mini-series, we will look at components of this formula in more detail.