某核心系统从11.2.0.3升级到11.2.0.4,升级后CPU市盈率下降幅度很大,系统平稳,然而在第二天的业务高峰期某核心语句执行时间从原先的几毫秒变成几百秒,严重影响业务。很显然,在数据量变化不大的情况下,SQL语句执行效率下降,那大概率是执行计划发生了变化,至于执行计划变化的原因,因为系统版本变化,考虑如下:
1)统计信息变化
2)CBO优化器变化导致问题
3)优化器BUG
  语句其实很简单,语句模拟如下:
create table t as select * from dba_objects;
--表t的object_id列有索引,其实这里的last_ddl_time也是有索引的,而且可以走索引,为了简化,不进行模拟

create index idx_t on t(object_id);
exec dbms_stats.gather_table_stats(ownname=>user,tabname=>'t',no_invalidate=>false);
SQL> select count(*) from t;
  COUNT(*)
----------
    261898
原始语句:
select *
from(
select rowid,t.*
from t where t.object_id in
(
select object_id
from(
select object_id
from t
where mod(object_id,10)=0 
and status='VALID'
and last_ddl_time > trunc(sysdate-200)
order by timestamp,last_ddl_time
    ) where rownum
)   and t.status='VALID'
and t.last_ddl_time > trunc(sysdate-200)
order by last_ddl_time
) where rownum

  本来在11.2.0.3上平稳运行,执行计划走NESTED LOOPS,子查询结果作为驱动,然后驱动外层表,从而走object_id索引。但是升级后的执行计划却是这样的:
--执行12分钟还没有出现结果
SQL> set autotrace traceonly
SQL> select *
  2  from(
  3  select rowid,t.*
  4  from t where t.object_id in
  5  (
  6     select object_id
  7     from(
  8             select object_id
  9             from t
 10             where mod(object_id,10)=0 
 11             and status='VALID'
 12             and last_ddl_time > trunc(sysdate-200)
                order by last_ddl_time
 13   14      ) where rownum 15  )   and t.status='VALID'
 16             and t.last_ddl_time > trunc(sysdate-200)
 17             order by last_ddl_time
 18  ) where rownum^Cselect *
*
ERROR at line 1:
ORA-01013: user requested cancel of current operation
Elapsed: 00:12:05.73

--问题执行计划如下:
Execution Plan
----------------------------------------------------------
Plan hash value: 3028954274
-------------------------------------------------------------------------------------
| Id  | Operation                    | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |      |     1 |   219 |  2100K  (2)| 07:00:08 |
|*  1 |  COUNT STOPKEY               |      |       |       |            |          |
|   2 |   VIEW                       |      |     1 |   219 |  2100K  (2)| 07:00:08 |
|*  3 |    SORT ORDER BY STOPKEY     |      |     1 |   100 |  2100K  (2)| 07:00:08 |
|*  4 |     FILTER                   |      |       |       |            |          |
|*  5 |      TABLE ACCESS FULL       | T    |  4936 |   482K|   855   (3)| 00:00:11 |
|*  6 |      FILTER                  |      |       |       |            |          |
|*  7 |       COUNT STOPKEY          |      |       |       |            |          |
|   8 |        VIEW                  |      |    49 |   637 |   851   (2)| 00:00:11 |
|*  9 |         SORT ORDER BY STOPKEY|      |    49 |  1960 |   851   (2)| 00:00:11 |
|* 10 |          TABLE ACCESS FULL   | T    |    49 |  1960 |   850   (2)| 00:00:11 |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(ROWNUM   3 - filter(ROWNUM   4 - filter( EXISTS ()
   5 - filter("T"."STATUS"='VALID' AND
              "T"."LAST_DDL_TIME">TRUNC(SYSDATE@!-200))
   6 - filter("OBJECT_ID"=:B1)
   7 - filter(ROWNUM   9 - filter(ROWNUM  10 - filter("STATUS"='VALID' AND MOD("OBJECT_ID",10)=0 AND
              "LAST_DDL_TIME">TRUNC(SYSDATE@!-200))

  和原先预计的走NESTED LOOPS不一样,走了FILTER操作,FILTER这种有2个子节点的,说明子查询未展开,也就是查询转换失败。一般遇到这种情况,首先为了快速解决问题,肯定是用SQL PROFILE之类的工具先绑定正确的执行计划,然而使用SQL PROFILE无效。那只能进一步分析解决问题了。查看10053:
 
*****************************
Cost-Based Subquery Unnesting
*****************************
SU: Unnesting query blocks in query block SEL$1 (#1) that are valid to unnest.
Subquery removal for query block SEL$3 (#3)
RSW: Not valid for subquery removal SEL$3 (#3)
Subquery unchanged.
Subquery Unnesting on query block SEL$2 (#2)SU: Performing unnesting that does not require costing.
SU: Considering subquery unnest on query block SEL$2 (#2).
SU:   Checking validity of unnesting subquery SEL$3 (#3)
SU:     SU bypassed: Subquery in a view with rowid reference.                   
--含有ROWID的subquery unnest失败
SU:   Validity checks failed.

  10053显示因为子查询的视图含有rowid导致subquery unnest失败,遇到这种情况要么去MOS上看看是不是BUG,要么就是改写语句。通过查询MOS,发现含有rowid的查询有不少问题:
  通过搜索"subquery rowid"找到个比较相近的:
Query Referencing ROWID of Subquery With Join Fails With ORA-01445 (Doc ID 1929880.1)
上面的内容如下:

SYMPTOMS

A query referencing a rowid from a subquery with a join fails with the following error:

SQL> select rowid
from
  (
    select e.empno
    from emp e left outer join (select deptno from dept) d
      on (e.deptno = d.deptno)
  );

ORA-01445: cannot select ROWID from, or sample, a join view without a key-preserved table


CAUSE

The error is observed in query with ANSI joins because the way the query is written.

A rowid is only defined for individual rows in a table, so it is not legal to select a rowid from a subquery unless each row from that subquery can be guaranteed to be uniquely associated with exactly one row in one table.  Therefore, the subquery may have only one item in its FROM list; that item must also have a rowid; and the subquery must not use DISTINCT, GROUP BY, or anything else that might gather multiple rows into one.

SOLUTION

 Reference the rowid when it is valid as an explicit select list item:

SQL> select rid as "ROWID"
from
  ( select e.empno, e.rowid as rid
    from emp e left outer join (select deptno from dept) d
      on (e.deptno = d.deptno)
  );
 
  可以按照上面的思路将rowid改写成别名,再在最外层将别名改回来,以保证SQL语句的查询列名一致,改写如下:
--改写方案       
select rd as "ROWID",object_id,object_name,last_ddl_time
from(
select rowid rd,t.*
from t where t.object_id in
(
select object_id
from(
select object_id
from t
where mod(object_id,10)=0 
and status='VALID'
and last_ddl_time > trunc(sysdate-200)
order by timestamp,last_ddl_time
    ) where rownum)   and t.status='VALID'
and t.last_ddl_time > trunc(sysdate-200)
order by last_ddl_time
) where rownum
  改写后的执行计划正确,如下所示:
--改写方案       
selectExecution Plan
----------------------------------------------------------
Plan hash value: 16082276
-------------------------------------------------------------------------------------------
| Id  | Operation                      | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |          |    49 |  4900 |   931   (2)| 00:00:12 |
|*  1 |  COUNT STOPKEY                 |          |       |       |            |          |
|   2 |   VIEW                         |          |    49 |  4900 |   931   (2)| 00:00:12 |
|*  3 |    SORT ORDER BY STOPKEY       |          |    49 |  5145 |   931   (2)| 00:00:12 |
|*  4 |     TABLE ACCESS BY INDEX ROWID| T        |     1 |   100 |     2   (0)| 00:00:01 |
|   5 |      NESTED LOOPS              |          |    49 |  5145 |   930   (2)| 00:00:12 |
|   6 |       VIEW                     | VW_NSO_1 |    49 |   245 |   851   (2)| 00:00:11 |
|   7 |        HASH UNIQUE             |          |    49 |   245 |            |          |
|*  8 |         COUNT STOPKEY          |          |       |       |            |          |
|   9 |          VIEW                  |          |    49 |   245 |   851   (2)| 00:00:11 |
|* 10 |           SORT ORDER BY STOPKEY|          |    49 |  1960 |   851   (2)| 00:00:11 |
|* 11 |            TABLE ACCESS FULL   | T        |    49 |  1960 |   850   (2)| 00:00:11 |
|* 12 |       INDEX RANGE SCAN         | IDX_T    |     1 |       |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(ROWNUM   3 - filter(ROWNUM   4 - filter("T"."STATUS"='VALID' AND "T"."LAST_DDL_TIME">TRUNC(SYSDATE@!-200))
   8 - filter(ROWNUM  10 - filter(ROWNUM  11 - filter("STATUS"='VALID' AND MOD("OBJECT_ID",10)=0 AND
              "LAST_DDL_TIME">TRUNC(SYSDATE@!-200))
  12 - access("T"."OBJECT_ID"="OBJECT_ID")
Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
       4026  consistent gets
          0  physical reads
          0  redo size
       7402  bytes sent via SQL*Net to client
        586  bytes received via SQL*Net from client
          8  SQL*Net roundtrips to/from client
          2  sorts (memory)
          0  sorts (disk)
        100  rows processed
) where rownum
  现在执行计划正确走NESTED LOOPS,并且被驱动表走OBJECT_ID索引。再次思考一个问题,从SQL语句的语义和业务上分析,这个语句其实就是按照条件查询然后排序,查询出前100行的rowid和指定列,也就是没有必要用子查询或关联查询,可以将语句进一步简化:
--其实原来的业务就是这么简单
select rowid,object_id,object_name,last_ddl_time
from (
select object_id
from t
where mod(object_id,10)=0 
and status='VALID'
and last_ddl_time > trunc(sysdate-200)
order by timestamp,last_ddl_time
)
) where rownum
  通过业务分析后改写的SQL很简单,其实本质就是查询之后排序,然后找前100行。由于原始语句写成子查询并且又包含了rowid,导致触发优化器的限制或不足,导致子查询无法unnest,从而走了FILTER影响效率。
  通过这个案例,可以知道,一些oracle的key words,在做别名的时候还是得注意,尽量避免使用key words作为别名,以防在不同版本中触发oracle的bug。
01-11 18:56