Wednesday, November 28, 2012

Use ASH for troubleshooting a long running query

You wont believe it how useful ASH (Active Session History) can be.

Step 1: Find the sql_id of the long_running query (the following lists the top 100 most -busy queries. I.e., queries that are in the most ASH samples either on cpu or actively waiting on something) :

[Update]: In this newer post we describe how one can use ASH and SQL Monitoring to tune a complex query that does not even finish!

nkarag@DWHPRD> host cat ash_top100_sql
cat: ash_top100_sql: No such file or directory

nkarag@DWHPRD> host cat ash_top100_sql.sql
set pagesize 999
set lines 999
col username format a13
col program  format a10 trunc
col modules  format a10 trunc
col session_id format 9999
col session_serial# format 99999
col inst_id format 9

select *
from (
select
 username, sql_id, SQL_CHILD_NUMBER,
    sum(decode(ash.session_state,'ON CPU',1,0)) CPU,
    sum(decode(ash.session_state,'WAITING',1,0)) - sum(decode(ash.session_state,'WAITING',decode(wait_class,'User I/O',1, 0 ), 0)) WAITING,
    sum(decode(ash.session_state,'WAITING',decode(wait_class,'User I/O',1, 0 ), 0))  IO,
    sum(decode(session_state,'ON CPU',1,1))     TOTAL
from (
select sample_time , c.username, SESSION_ID, SESSION_SERIAL#, INST_ID, PROGRAM,module, IS_AWR_SAMPLE, a.sql_id, SQL_CHILD_NUMBER, IS_SQLID_CURRENT,
    SESSION_STATE, WAIT_TIME, WAIT_CLASS, EVENT, P1TEXT, P2TEXT, P3TEXT,  TIME_WAITED/100 time_waited_secs,
    BLOCKING_SESSION, BLOCKING_SESSION_SERIAL#, BLOCKING_INST_ID
from gv$active_session_history a,
 dba_users c
where
a.user_id = c.user_id
) ash
where
        SAMPLE_TIME > sysdate - (&minutes_from_now/(24*60))
group by username, sql_id, SQL_CHILD_NUMBER
order by sum(decode(session_state,'ON CPU',1,1)) desc
)
where rownum < 101
/

nkarag@DWHPRD> @ash_top100_sql
Enter value for minutes_from_now: 20
old  19:        SAMPLE_TIME > sysdate - (&minutes_from_now/(24*60))
new  19:        SAMPLE_TIME > sysdate - (20/(24*60))

USERNAME      SQL_ID        SQL_CHILD_NUMBER        CPU    WAITING         IO      TOTAL
------------- ------------- ---------------- ---------- ---------- ---------- ----------
SYS           a55a2732q80m7           0       1160          1          6       1167
SYS                                       -1         76        780          0        856
LSINOS        11dzaj21zthzr                0        345          0         30        375
SYS                                        0        108         46          0        154
LSINOS        5ufj8g6fw5dht                0          9          0         24         33
LSINOS        d1suttxfhnha0                0         22          0          0         22



Step 2: Analyze the top wait events of this query for a period of time (e.g., for the last 20 minutes):

nkarag@DWHPRD> host cat ash_events.sql
set pagesize 999
set lines 999


select  decode(grouping(event),1,'TOTAL',event) event,
                decode(grouping(wait_class),1,'TOTAL', wait_class) wait_class,
                sum(PCNT), sum(nosamples), sum(nosamplestot),
                decode(grouping(owner),1,'TOTAL', owner)owner, decode(grouping(object_name),1,'TOTAL', object_name) object_name, decode(grouping(object_type),1,'TOTAL', object_type) object_type,
                decode(grouping(P1TEXT),1,'TOTAL', P1TEXT) P1TEXT, decode(grouping(P2TEXT),1,'TOTAL', P2TEXT) P2TEXT, decode(grouping(P3TEXT),1,'TOTAL', P3TEXT) P3TEXT
from (
        select event, wait_class,round((count(*)/cnttot)*100) PCNT, count(*) nosamples, cnttot nosamplestot,P1TEXT, P2TEXT, P3TEXT,
                owner, object_name, object_type
        from (
        select  event, wait_class, P1TEXT, P2TEXT, P3TEXT, owner, object_name, object_type,  count(*) over() cnttot
        from gv$active_session_history a  join dba_objects b on(a.CURRENT_OBJ# = b.object_id)
        where
        SAMPLE_TIME > sysdate - (&minutes_from_now/(24*60))
        and session_state= 'WAITING' and WAIT_TIME = 0
        and sql_id = nvl('&sql_id',sql_id)
        and SQL_CHILD_NUMBER = nvl('&SQL_CHILD_NUMBER',0)
        )t
        group by event, wait_class,  P1TEXT, P2TEXT, P3TEXT, owner, object_name, object_type, cnttot
        --order by count(*) desc
)
group by rollup(event, (wait_class,  P1TEXT, P2TEXT, P3TEXT, owner, object_name, object_type))
order by sum(PCNT) desc
/


nkarag@DWHPRD> @ash_events
Enter value for minutes_from_now: 20
old  13:        SAMPLE_TIME > sysdate - (&minutes_from_now/(24*60))
new  13:        SAMPLE_TIME > sysdate - (20/(24*60))
Enter value for sql_id: a55a2732q80m7
old  15:        and sql_id = nvl('&sql_id',sql_id)
new  15:        and sql_id = nvl('a55a2732q80m7',sql_id)
Enter value for sql_child_number: 0
old  16:        and SQL_CHILD_NUMBER = nvl('&SQL_CHILD_NUMBER',0)
new  16:        and SQL_CHILD_NUMBER = nvl('0',0)



Formated with excel:


EVENT WAIT_CLASS SUM(PCNT) SUM(NOSAMPLES) SUM(NOSAMPLESTOT)  OWNER  OBJECT_NAME OBJECT_TYPE
cell single block physical read User I/O 100 99         ICT_DW ICT_USAGE_FCT_RAT_COMP_BIDX INDEX SUBPARTITION
TOTAL TOTAL 100 9 9  TOTAL TOTAL TOTAL
cell single block physical read TOTAL 100 9 9  TOTAL TOTAL TOTAL




It is obvious the query is doing a lot of single block reads on an index.
We can go a step further (always with the help of ASH), by trying to locate the execution plan operation that has taken the most DB time to execute.


nkarag@DWHPRD> host cat ash_ops.sql
set pagesize 999
set lines 999

select SQL_PLAN_OPERATION, owner, object_name, object_type, round((count(*)/cnttot)*100) PCNT, count(*) nosamples, cnttot nosamplestot
from (
        select     A.SQL_PLAN_OPERATION, owner, object_name, object_type, count(*) over() cnttot
        from gv$active_session_history a  join dba_objects b on(a.CURRENT_OBJ# = b.object_id)
        where
                SAMPLE_TIME > sysdate - (&minutes_from_now/(24*60))
                and sql_id = nvl('&sql_id',sql_id)
)
group by SQL_PLAN_OPERATION, owner, object_name, object_type, cnttot
order by count(*) desc
/

nkarag@DWHPRD> @ash_ops
Enter value for minutes_from_now: 20
old   6:                SAMPLE_TIME > sysdate - (&minutes_from_now/(24*60))
new   6:                SAMPLE_TIME > sysdate - (20/(24*60))
Enter value for sql_id: a55a2732q80m7
old   7:                and sql_id = nvl('&sql_id',sql_id)
new   7:                and sql_id = nvl('a55a2732q80m7',sql_id)


Formated with excel:


SQL_PLAN_OPERATION OWNER OBJECT_NAME OBJECT_TYPE PCNT NOSAMPLES NOSAMPLESTOT
SORT ICT_DW ICT_USAGE_FCT_RAT_COMP_BIDX INDEX SUBPARTITION 98 910 926
BITMAP INDEX ICT_DW ICT_USAGE_FCT_RAT_COMP_BIDX INDEX SUBPARTITION 2 14 926
PARTITION COMBINED ICT_DW ICT_USAGE_FCT_RAT_COMP_BIDX INDEX SUBPARTITION 0 1 926
BITMAP CONVERSION ICT_DW ICT_USAGE_FCT_RAT_COMP_BIDX INDEX SUBPARTITION 0 1 926


We can deduce that the most DB-time consuming operation is a SORT that used a specific bitmap index.

Now, that we have found the point in our sql that consumes the most time we can focus our efforts on tuning this, instead of doing irrelevant tunings. All this has been accomplished due to the detailed session data provided by ASH (Active Session History).

Cheers,
OL.

Update (2/1/2013):
In the above query apart from the column SQL_PLAN_OPERATION you must add also column SQL_PLAN_OPTIONS so as to have a clearer picture of the executed operation. For example:

Lets run the following query:

nikos@NIKOSDB> select count(amount_sold) from sh.sales
  2  /

COUNT(AMOUNT_SOLD)
------------------
   918843

Now lets get the execution plan operation that takes the most DB time in this query (we will utilize the column SQL_PLAN_OPTIONS to get the exact operation):

nikos@NIKOSDB> host cat ash_ops.sql
set pagesize 999
set lines 999


select SQL_PLAN_OPERATION, SQL_PLAN_OPTIONS, owner, object_name, object_type, round((count(*)/cnttot)*100) PCNT, count(*) nosamples, cnttot nosamplestot
from (
select     A.SQL_PLAN_OPERATION, SQL_PLAN_OPTIONS, owner, object_name, object_type, count(*) over() cnttot
from gv$active_session_history a  join dba_objects b on(a.CURRENT_OBJ# = b.object_id)
where        
SAMPLE_TIME > sysdate - (&minutes_from_now/(24*60))
and sql_id = nvl('&sql_id',sql_id)
)            
group by SQL_PLAN_OPERATION, SQL_PLAN_OPTIONS, owner, object_name, object_type, cnttot
order by count(*) desc
/
nikos@NIKOSDB> @ash_ops
Enter value for minutes_from_now: 5
old   6: SAMPLE_TIME > sysdate - (&minutes_from_now/(24*60))
new   6: SAMPLE_TIME > sysdate - (5/(24*60))
Enter value for sql_id: 3dp4d69xdfh1u
old   7: and sql_id = nvl('&sql_id',sql_id)
new   7: and sql_id = nvl('3dp4d69xdfh1u',sql_id)

SQL_PLAN_OPERATION       SQL_PLAN_OPTIONS      OWNER     OBJECT_NAME    OBJECT_TYPE      PCNT  NOSAMPLES NOSAMPLESTOT
------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------- ---------- ---------- ------------
BITMAP INDEX       FAST FULL SCAN      SH     SALES_PROMO_BIX    INDEX PARTITION       100    1 1



So, the most time-consuming operation for this count was a fast full index scan on SALES_PROMO_BIX bitmap index.

OL


No comments:

Post a Comment