Tracing Parallel Execution with _px_trace. Part I [ID 444164.1] - Chinaunix

改时间 10-OCT-2011 类型 BULLETIN 状态 PUBLISHED

***Checked for relevance on 29-April-2010***
***Checked for relevance on 10-Oct-2011***
Tracing Parallel Execution with _px_trace. Part I

Introduction

Since 9.0.1 a new mechanism has been introduced to obtain Parallel Execution (PX) trace information. This trace mechanism is keyword and parameter based. The existing event-based tracing is still available, but for some areas in _px_trace like compilation there is no event. Because each component of PX controls its own tracing, getting an event-based trace usually means setting one or more events and then deciding at what level they should be set; this can be cumbersome and there may not be any consistency of trace format across the different events. Keyword-based tracing is simpler to set than event-based tracing and is consistent across PX components.

This note gives an overview of some parts of the new trace facility and can also be used to help diagnose PX problems. This note is written for people with some background and experience in PX.

This first part gives some examples of how to trace the compilation of a PX query. This helps for question "Why does my query (not) run in parallel?". It also gives some examples of how to trace the allocation of PX server ( slaves ) and some idea why they are maybe not can be allocated.

There are three sections to this note :

Important PX phrases Explain the main phrases/terms that are used in document.
Usage - gives some examples with annotated trace illustrating various scenarios :
- Example#1 select statement with one slave set, degree of parallelism from the dictionary.
- Example#2 select statement with two slave sets, degree of parallelism from the dictionary.
- Example#3 select statement, too few slaves available to run query in parallel.
- Example#4 requested Degree Of Parallelism (DOP) rounded down dynamically due to CPU load
- Example#5 Degree Of Parallelism taken from a hint.
- Example#6 Join of two tables each with different DOP setting, which DOP do we choose ?
- Example#7 Parallel execution in RAC.
Current SQL cursor How to find the current SQL command for what the trace information is written.
Reference - gives a full list of the options.

Important PX phrases

QC The user background process that coordinates the slaves and that does all the work that runs in serial.
Instance All the active processes and the memory of the database belong to one Oracle SID. A RAC system is made of more than one instance.
set/threadsParallel query use a divide and conquer strategy. This means if for example a large table needs to be scan, multiple processes are used to scan parts of the table. This processes is a single step are called thread and the number is the DOP. Oracle use for complex query not only 1 set of process it use 2 or more. This is necessary to avoid skewed data.
DOP This is the number of threads in 1 set.
slave This is a synonym for a thread ( single process in a set).
height This is number of thread in a set ( synonym for DOP).
server/Svrs This is a synonym for thread or slave.

Usage

All of these examples were generated with the following trace setting:

SQL> alter session set "_px_trace"="compilation","execution","messaging";

Trace files are created in the user_dump_dest and for the slaves in the background_dump_dest. On some platforms they are written in the user_dump_dest. On RAC systemes trace files are written on every node, where a slave or the qc is started. Interesting is here the QC trace files.

Example #1

This example should give an overview over the most important functions in the area of compilation and allocation of parallel execution servers.
Here's a simple query which will use one slave set with two slaves:

SQL> alter session set "_px_trace"="compilation","execution","messaging";
Create a large table for the next examples with: CREATE TABLE bigemp AS SELECT * FROM all_objects;
SQL> alter table bigemp parallel 2;
SQL> select count(*) from bigemp;

------------ start of trace --------------
--- cut some lines--- kkfdapdml
      pgadep:2 pdml mode:0 PQ allowed DML allowed not autonomous => not allowed
-->This are checks if PDML is enabled. This lines can be ignored.
--- cut some lines---
The table : 1166 ---> This is Table object id. from all_objects.
kkfdPaPrm ---> This shows the functions from the source code.
      DOP = 2 (computed from hint/dictionary)
      use dictionary DOP on table ---> Use Degree Of Parallelism from the dictionary
kkfdtsc
      DOP = 2 (from kkfdPaPrm())
kkfdPiAlo
      DOP = 2 (kkfdpideg)
      Invalid Part. No. ---> Not a partitioned table
      No Pruning Info ---> Ditto
      Pararllel Operation Type = 2 (kkfdpiOpType)
      Flags = 41 (kkfdpiflg)
      min. tranx free list = 0 (kkfdpimgi)
kkfdPaPrm
      DOP = 1 (computed from hint/dictionary)
      use dictionary DOP on table
kkfdtsc
      serial - DOP = 1 (from kkfdPaPrm()) or 10383 level 1 not set
kxfrSysInfo
      DOP trace -- compute default DOP from system info
      # instance alive = 1 (kxfrsnins)
kxfrDefaultDOP
      DOP Trace -- compute default DOP
      # CPU = 1
      Threads/CPU = 2 ("parallel_threads_per_cpu")
      default DOP = 2 (# CPU * Threads/CPU)
      default DOP = 2 (DOP * # instance)
kxfrSysInfo
      system default DOP = 2 (from kxfrDefaultDOP()) ---> This is the computed System default DOP.
kxfralo
      DOP trace -- requested thread from best ref obj = 2 (from kxfrIsBestRef()) ---> This is the DOP after all objects are checked. The DOP can be adjusted in the next functions, because there are restrictions like adaptive load algorithm.
kxfralo
      threads requested = 2 (from kxfrComputeThread())
kxfralo
      adjusted no. threads = 2 (from kxfrAdjustDOP())
kxfralo
      about to allocate 2 slaves
kxfrAllocSlaves
      DOP trace -- call kxfpgsg to get 2 slaves
kxfpgsg
      num server requested = 2
kxfpiinfo
      inst[cpus:mxslv]
      1[1:10]m ---> On this system we have: 1 instance[1 CPU : parallel_max_servers=10]
kxfpgsg
      getting 1 sets of 2 threads, client parallel query execution flg=0x30 ---> Get 1 slave set of 4 slaves
      Height=2, Affinity List Size=0, inst_total=1, coord=1
      Insts 1
      Threads 2
kxfpg1srv
      trying to get slave P000 on instance 1 ---> Let's go and grab the slaves ( repeated for each slave). We see also the name of the slave process.
kxfpg1sg
      Got It. 1 so far.
kxfpg1srv
      trying to get slave P001 on instance 1
kxfpg1sg
      Got It. 2 so far.
kxfrAllocSlaves
      actual num slaves alloc'd = 2 (kxfpqcthr) ---> Then display the number of slaves allocated
kxfxcPutSession ---> This function is only visible in version 10g or higher. This shows the CBO parameter that are used to compile the PX query.
Send Compilation environment:
      optimizer_mode_hinted = false
      optimizer_features_hinted = 0.0.0
      parallel_execution_enabled = true
      parallel_query_forced_dop = 0
      parallel_dml_forced_dop = 0
      parallel_ddl_forced_degree = 0
      parallel_ddl_forced_instances = 0
      _query_rewrite_fudge = 90
      optimizer_features_enable = 10.2.0.3
......
------------end of trace --------------
For questions like "Why does my query run in parallel?", the DOP setting in function kkfdPaPrm should be checked. Is there one with a DOP > 1? If so, this part of the trace will also show the object id from which we got the DOP setting and this id can be used to find the object name. Then you can check if there is are parallel setting in the data dictonary or if there is a hint.

Example #2

A slightly more complex query which will use 2 slaves sets each of 2 slaves and therefore require 4 slaves in total

SQL> alter table bigemp parallel 2;
SQL> select ename, count(*) from bigemp group by ename;

To make it easier to read, only extracts from the trace file will be shown:

------------ start trace --------------

kkfdPaPrm
DOP = 2 (computed from hint/dictionary)
use dictionary DOP on table ---> From ALTER TABLE BIGEMP PARALLEL 2

--- cut some trace ---

kxfpiinfo
inst[cpus:mxslv]
1[2:2] ---> Instance 1 [2 CPUS: parallel_max_servers=2]

--- cut some trace ---

kxfpgsg
      getting 2 sets of 2 threads, client parallel query execution flg=0x30
---> Try to get 2 slave sets, each of 2 slaves, i.e. 4 PQ slaves.
      Height=2, Affinity List Size=0, inst_total=1, coord=1
      Insts 1
      Threads 2

--- cut some trace ---

kxfpgsg
      got 2 threads on 1 instance(s), avg height=2
      Insts 1
      Svrs 4 ---> We got the 4 slaves.

----- end trace ---------

Example #3

SQL> alter table bigemp parallel 8;
SQL> select count(*) from bigemp;

In this example the <> was set to TRUE, which in turns forces <> to be TRUE. So the requested degree of parallelism can be reduced dynamically based on the current machine load.

---- start ----

kxfrAllocSlaves
DOP trace -- call kxfpgsg to get 8 slaves ---> Try for 8 slaves

kxfpAdaptDOP
Requested=8 Granted=4 Target=4 Load=1 Default=2 users=0 sets=1 ---> Adaptive load algorithm rounds down to 4
kxfpgsg
load adapt num servers requested to = 4 (from kxfpAdaptDOP())

---- End ----

Example #4

For this example:

- parallel_max_servers = 10
- parallel_min_percent = 0 ( we can still parallelize even if we can't grab 100% of requested slaves)
- Query requires 12 slaves

SQL> alter table bigemp parallel 12;
SQL> select count(*) from bigemp;

------ start --------

kxfpgsg
      getting 1 sets of 12 threads, client parallel query execution flg=0x30
      Height=12, Affinity List Size=0, inst_total=1, coord=1
      Insts 1
      Threads 12 ---> We try to get 12 slaves.

--- cut some lines---

kxfpgsg
      got 10 threads on 1 instance(s), avg height=10
      Insts 1
      Svrs 10 ---> We actually allocate 10 slaves.

----- end -------

If parallel_min_percent had been = 100, this would indicate we must grab all requested slaves for the query to be able to run, otherwise the user sees:

- ORA-12827: insufficient parallel query slaves available

and in the trace file:

kxfrAllocSlaves
actual num slaves alloc'd = 10 (kxfpqcthr)
DOP trace -- num slaves alloc'd < min required(min=%d) ---> We didn't find 100% of requested slaves, return error.

Example#5

In this example the trace illustrates how the degree of parallelism was derived from a query hint:

SQL> select /*+PARALLEL(BIGEMP,3) */ count(*) from bigemp;

------ start ------
kkfdPaPrm
DOP = 3 (computed from hint/dictionary)
use hint DOP on table ---> We used a hint to derive DOP
---- end -----

Example#6

In this example, two tables are joined, each has a different DOP, which DOP do we use :

SQL>alter table bigemp parallel 4;
SQL>alter table bigdept parallel 2;
SQL>select dname, avg(sal) from bigemp e, bigdept d where e.deptno=d.deptno group by dname;

---- start ----

kkfdPaPrm
      DOP = 4 (computed from hint/dictionary) ---> DOP for BIGEMP
      use dictionary DOP on table
--- cut some lines---
kkfdPaPrm
      DOP = 2 (computed from hint/dictionary) ---> DOP for BIGDEPT
      use dictionary DOP on table
--- cut some lines---
kxfralo
      DOP trace -- requested thread from best ref obj = 4 (from kxfrIsBestRef ---> Use BIGEMP's DOP

---- end ----

Example#7

The following query was run on a 2 instance RAC database.

(note Parallel_automatic_tuning = TRUE ).

SQL>alter table bigemp parallel 4;
SQL>select count(*) from bigemp;

---- start ----

kxfpiinfo
inst[cpus:mxslv]
1[2:20] 2[2:20] ---> Instance# [ CPU_COUNT : PARALLEL_MAX_SERVERS ]

kxfpclinfo
      inst(load:user:pct:fact)aff
      1(4 :0 :100:50 ) 2(0:0:100:200) ---> Load information for each instance, so instance #1 :
---> load=4
---> 0 PX users
---> User has no resource restriction, can use 100% of system
---> Load factor represents how much more load the machine can accept
the bigger the number, the better.
kxfpAdaptDOP
      Requested=4 Granted=4 Target=16 Load=4 Default=8 users=0 sets=1
---> Target of 16 is estimate of best load for the cluster.
---> DOP of queries may be reduced to maintain this target as best we can.
kxfpsori
      Sorted: 2(0:200) 1(4:50) ---> Sort instances by load and factor.
---> Put "unit" more slaves on instance that can best afford it.
---> More slaves will go to Instance 2 as it has lower load.

kxfpunit
var=4 limit=8 use_aff=0 aff_num=0 unit=2 ---> Unit is a function of CPU_COUNT on the QC node.

kxfpgsg
      getting 1 sets of 4 threads, client parallel query execution flg=0x30 ---> try for 1 slave set of 4 slaves
      Height=0, Affinity List Size=0, inst_total=2, coord=1 ---> We have two instances, QC is on instance #1.
      Insts 2 1 ---> Instances are 2 and 1
      Threads 4 0 ---> We will try to get 4 slaves on instance 2. The numbers are # of slaves corresponding to the Instances in previous line in the order they are printed.
kxfpg1srv
      trying to get slave P000 on instance 2 ---> get slaves from instance #2
kxfpg1sg
      Got It. 1 so far.
kxfpg1srv
      trying to get slave P001 on instance 2
kxfpg1sg
      Got It. 2 so far.

--- cut ----

      got 4 threads on 1 instance(s), avg height=4 ---> We managed to get 4 slaves on instance #2. Thread means here slaves in 1 slaves set
      Insts 2 1
      Svrs 4 0

---- end ----

Current SQL cursor
How to find the SQL sent to Slaves: In 9i this is not so easy, because every slave set get a new SQL command that is part of the original query. AS example:

kxfxcp1
      parsing at slave set 1:
      SELECT /*+ CIV_GB */ A1.C0,COUNT(SYS_OP_CSR(A1.C1,0)) FROM :Q2753000 A1 GROUP BY A1.C0
      TQ: 1 nbgra: 0 flg: 0x1

kxfxcp1
      parsing at slave set 2:
       SELECT /*+ PIV_GB */ A1.C0 C0,SYS_OP_MSR(COUNT(*)) C1 FROM (SELECT /*+
       NO_EXPAND ROWID(A2) */ A2."DEPTNO" C0 FROM "SCOTT"."EMP" PX_GRANULE(0
       , BLOCK_RANGE, DYNAMIC) A2) A1 GROUP BY A1.C0
      TQ: 0 nbgra: 1 flg: 0x2002

In 10g you have to search for the function kxfxcp1.

kxfxcp1
      Sending parse to slave set 1:
       User sqllen sent from QC = 74
        ; select /*+ parallel(t,5) */ owner# , count(*) from test t group by owner#
In 10g all slaves in all sets work on the same SQL command, so in every function kxfxcp1 you wil the same SQL command. It is not longer possible to see o what part of the SQL teh current set work.

Reference

This section documents describes teh syntax of _px_trace. In practice only a subset as documented above will be required.

There are up to three components that can be specified when starting the trace

- Area in which tracing is required.
- Verbosity.
- Do we want timing information.

alter session set "_px_trace"=[[Verbosity,]area],[[Verbosity,]area],..,[time];
as example
alter session set "_px_trace"=high,execution,medium,execution,time;

The possible values for each of these is

Verbosity:

high
medium
low

Area:

scheduling - ( equivalent to some of event 10384 and some of 10390)
execution - (equivalent to some of event 10390)
granule - (equivalent to some of event 10390 and some of 10391)
messaging - (equivalent to event 10392 and event 10393)
buffer - (equivalent to event 10399)
compilation - ( no equivalent event)
all - all of the above
none - none of the above.

Timing

time

Here are some examples :

In the init.ora file:

_px_trace="high","execution","medium","buffer","time"

Or, more usually, at the session level:

SQL> alter session set "_px_trace"="high","execution","medium","buffer","time";

To get the full trace:

SQL>alter session set "_px_trace"="all";

Note : Put the verbosity level followed by the areas to be traced at that level of verbosity.

To switch off the trace :

SQL> alter session set "_px_trace"="none";