Tuning Analytic Functions

2 Apr

In general, tuning analytic functions (and more generally, all sort operations) is rather difficult. While for most poorly performing queries it’s relatively straightforward to gain some improvements  by applying “eliminate early” principle one way or another, for slow sort operations it’s rarely applicable. Usually options are limiting to rewriting a query without analytics (e.g. using self-joins or correlated subqueries to achieve the same goal) or manually resizing the workarea to reduce/eliminate the use of disk. Recently, however, I had a case where I managed to obtain an excellent performance gain using a different technique that I would like to share in this post.

The original query was selecting about 100 columns using the LAG function on one of the columns in the WHERE clause, but in my test case I’ll both simplify and generalize the situation. Let’s create a table with a sequential id, three filtering columns x, y and z, and 20 sufficiently lengthy columns.


drop table t;

create table t
as
with gen as (select level id from dual connect by level <= 1e3)
select rownum id,
 round(dbms_random.value(0, 10)) x,
 round(dbms_random.value(0, 10)) y,
 round(dbms_random.value(0, 10)) z,
 rpad('x', 999, 'x') padding1,
rpad('x', 999, 'x') padding2,
rpad('x', 999, 'x') padding3,
rpad('x', 999, 'x') padding4,
rpad('x', 999, 'x') padding5,
rpad('x', 999, 'x') padding6,
rpad('x', 999, 'x') padding7,
rpad('x', 999, 'x') padding8,
rpad('x', 999, 'x') padding9,
rpad('x', 999, 'x') padding10,
rpad('x', 999, 'x') padding11,
rpad('x', 999, 'x') padding12,
rpad('x', 999, 'x') padding13,
rpad('x', 999, 'x') padding14,
rpad('x', 999, 'x') padding15,
rpad('x', 999, 'x') padding16,
rpad('x', 999, 'x') padding17,
rpad('x', 999, 'x') padding18,
rpad('x', 999, 'x') padding19,
rpad('x', 999, 'x') padding20
from gen g1, gen g2
where rownum<=1e5;

In the query we select all columns from the table, but only a small fraction of rows (0.1%), and calculate LAG on one of the columns.

select id,
 prev_x,
 padding1,
 padding2,
 padding3,
 padding4,
 padding5,
 padding6,
 padding7,
 padding8,
 padding9,
 padding10,
 padding11,
 padding12,
 padding13,
 padding14,
 padding15,
 padding16,
 padding17,
 padding18,
 padding19,
 padding20
from
(
 select lag(x) over(order by id) prev_x,
 t.*
 from t
)
 where x = 1
 and y = 1
 and z = 1;

It would be nice if we could push the predicate into the subquery (i.e. filter first, and then calculate the LAG), but obviously that would change the meaning of the query and lead to different results. However, if we cannot reduce the dataset by restricting rows, we can reduce it by restricting selected columns. Since the query needs all columns, we would need to do a self-join to pick them up, but since we can do that after the filtering, it won’t cost us much. To further reduce the cost, let’s use the rowid:

select
 id,
 prev_x,
 padding1,
 padding2,
 padding3,
 padding4,
 padding5,
 padding6,
 padding7,
 padding8,
 padding9,
 padding10,
 padding11,
 padding12,
 padding13,
 padding14,
 padding15,
 padding16,
 padding17,
 padding18,
 padding19,
 padding20
from t,
 (
 select rowid, prev_x
 from
 (
 select id,
 lag(x) over(order by id) prev_x,
 x,y,z
 from t
 )
 where x = 1
 and y = 1
 and z = 1
 ) prev
where t.rowid = prev.rowid;

On my 11.2.0.1 test database rewritten query executed in about 30s, as opposed to over 2min for the original query (in the original case the difference was even more dramatic — 1 minute versus several hours, which could have been due to the fact that the optimizer underestimated the number of rows to be processed by the analytic function, which may have resulted in the workarea being undersized).

By the way this example also illustrate the importance of not abusing “*” and only selecting rows that are really needed to the user. Imagine a situation when the user is only interested in 5 columns out of 100+, but the developer puts “select *” in all subqueries (because it’s easier to type than “select col1, col2, col3, col4, col5″) and only filters the columns in the main query itself. In principle Oracle could be able to figure out that most of the columns aren’t needed in inner query blocks, but there’s no guarantee for that, and it could easily end up keeping all columns until the very last stage of query processing, which can have serious negative implications for query’s performance (especially if there are sorts or analytics involved).

About these ads

4 Responses to “Tuning Analytic Functions”

    • savvinov April 2, 2013 at 7:37 am #

      Hi Sayan,

      thanks for sharing this, I’ve added this to my “to-read” list.

  1. Sayan Malakshinov April 2, 2013 at 7:26 am #

    BTW, about analytic functions: I’ve shown once on sql.ru, that sometimes it’s better to move some calculating of “lag”s to pl/sql: http://www.sql.ru/forum/actualutils.aspx?action=gotomsg&tid=983774&msg=13483198

    • savvinov April 2, 2013 at 7:39 am #

      Indeed — for the query in my today’s post, I was also considering moving to PL/SQL as an option (that way one is able to do the filtering first, and then use indexes to locate consequtive values for remaining few records).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 344 other followers

%d bloggers like this: