- 论坛徽章:
- 0
|
诊断性能问题
其实oracle的性能诊断和解决,到后来,发现就那么回事
原文:
Technology PERFORMANCE
Diagnosing Performance Problems
By Cary Millsap
Use extended SQL trace data to see what's taking so long.
Suppose that one day you are driving to work and end up arriving late for an important meeting. You aren't able to present your revolutionary idea, so your clients aren't going to use it. You're frustrated by your tardiness and vow to never make the same mistake again. So how do you diagnose the cause in order to avoid a replay? How about this checklist?
Check the car's surface for imperfections, because surface imperfections can account for a difference of 1 percent or even greater in the car's top speed.
Check the wheel alignment, because an incorrect camber, caster, or toe angle can cause the car to handle poorly, costing time.
Test the engine to ensure that it is producing 99 percent or more of its rated horsepower. If it is not, consider rebuilding or replacing the engine.
No, you wouldn't use this checklist; that would be ridiculous. You'd probably diagnose the problem in a completely different way, by asking yourself just one simple question: What took me so long?
From there, the problem almost works itself out. If it's a 40-minute drive and you left 20 minutes before the meeting time, you'll leave an extra 30 minutes earlier next time. If you wasted 20 minutes in a traffic jam, you'll either leave even earlier the next time, take another route, or check the 7:00 a.m. traffic report more carefully. If you got lost and wasted 20 minutes driving around, then next time you'll probably study a map beforehand. And so on.
The curious part, for me, is that a database professional who is really good at solving everyday performance optimization problems uses a totally different method at work to solve database performance problems. Many database "tuners" never ask, "What took this program so long?" Instead, they consult a checklist and try to prevent everything that might be wrong:
Check that all requests for Oracle blocks are being serviced by the database buffer cache
Check that there are no full table scans
Check that all sorting is performed in memory
Check that redo logs are properly isolated from all other database files
And so on.
For some jobs, checklists can be good things. But for the task of diagnosing a performance problem, it is extremely inefficient to approach the problem by trying to fix everything that might theoretically go awry. A far more efficient approach is to find the answer to the simple question "What took so long?"
What Took So Long?
A good strategy to use for optimizing Oracle programs is just like the one you use in everyday life. It goes like this:
Watch the slow program in action, by using special instrumentation to measure the program's performance.
Create a resource profile for the slow program that breaks the program's response time down into useful categories.
Reduce the program's response time by dealing with the biggest response-time components first.
It's an easy method, after you figure out a few technical details. When you do, you'll have a method that works every time and that gives you (at long last) the ability to predict the outcome of a proposed performance improvement before you implement it.
Tracing
If you have sophisticated tools that collect timing statistics for individual execution steps in a program, use them. But any tool that collects only aggregated data (such as the data obtained by the sampling of either the system global area [SGA] or its underlying shared memory segments) is inadequate for certain problem types.
The most common aggregation mistake expensive monitoring tools make is that they aggregate resource usage across an entire Oracle Database instance for a given time interval. But a slow program can be virtually unaffected by a contention problem that completely dominates the performance of several other less important programs on the system.
Even tools that aggregate at the Oracle Database session level suffer from diagnostic shortcomings for some important problem types. For example, imagine that a program that runs for 10 minutes makes 10,000 calls to the Oracle SQL*Net message from client "wait event" and that the total time the session spends waiting for this event is 8.3 minutes. This means that the average SQL*Net message from client event latency for the session is approximately 3 seconds. But from the aggregated data alone, you have no way of knowing whether the 10,000 calls consumed 3 seconds apiece or if perhaps one of the calls consumed 5 minutes and the other 9,999 consumed 0.02 seconds each. These two scenarios require vastly different treatments.
The diagnostic data that will serve you best in this kind of situation is Oracle's extended SQL trace data. An extended SQL trace file shows a step-by-step chronological record of what the Oracle Database kernel is doing with your time. Collecting extended SQL trace data is practically free. The biggest cost is the price of the disk space you'll need for storing each interesting trace file (seldom more than a few megabytes).
Tracing Your Own Code. Turning on extended SQL trace for a program is easy if you have access to that program's source code. The first thing you have to do is ensure that your session is using appropriate settings for its TIMED_STATISTICS and MAX_DUMP_ FILE_SIZE parameters:
alter session
set timed_statistics=true
alter session
set max_dump_file_size=unlimited
If you fail to set TIMED_STATISTICS=TRUE, your database kernel will emit only zero values instead of real durations into your trace file. If your setting of MAX_DUMP_ FILE_SIZE is too restrictive, you'll suffer the chagrin of generating a message like the following in your trace file instead of the timing data you wanted:
*** DUMP FILE SIZE IS LIMITED TO 1048576 BYTES ***
Next comes activating the trace itself. There are several ways to do this. The old-fashioned way is to use the ALTER SESSION command as follows:
alter session set events
'10046 trace name context forever, level 12'
/* code to be traced goes here */
alter session set events
'10046 trace name context off'
A more elegant way to accomplish the extended SQL trace activation is to use the DBMS_SUPPORT package:
dbms_support.start_trace(waits=>;true, binds=>;true)
/* code to be traced goes here */
dbms_support.stop_trace()
Note that DBMS_SUPPORT is undocumented and may not be part of your default database installation. Refer to MetaLink ( metalink.oracle.com) for information on DBMS_SUPPORT.
Tracing Someone Else's Code. Activating extended SQL trace becomes a little trickier when you're trying to trace code to which you don't have read/write access. But it's not much more difficult. The first thing you need to do is acquire the V$SESSION.SID and V$SESSION.SERIAL# values of the session you want to trace. Then you can set the TIMED_STATISTICS and MAX_DUMP_FILE_SIZE parameters for the chosen session by using the following procedure calls:
dbms_system.set_bool_param_in_session(
sid =>; 42,
serial# =>; 1215,
parnam =>; 'timed_statistics',
bval =>; true)
dbms_system.set_int_param_in_session(
sid =>; 42,
serial# =>; 1215,
parnam =>; 'max_dump_file_size',
intval =>; 2147483647)
(For releases before Oracle8 Release 8.1.6, you can manipulate these parameters with ALTER SYSTEM commands.)
Next, you activate the trace. There are several ways to do this, including the following two.
One option is to use DBMS_SUPPORT:
dbms_support.start_trace_in_session(
sid =>; 42,
serial# =>; 1215,
waits =>; true,
binds =>; true)
/* code to be traced executes during this time window */
dbms_support.stop_trace_in_session(
sid =>; 42,
serial =>; 1215)
Do not use the DBMS_SUPPORT procedure called SET_SQL_TRACE_IN_SESSION if you want to activate extended SQL tracing. This procedure does not allow you to specify that you want waits and binds data in your trace file.
The second option is more elegant, but it won't be available until the release of Oracle Database 10g. The introduction of the DBMS_MONITOR package solves many of the complicated diagnostic data collection problems resulting from connection pooling and multithreading. Instead of specifying an Oracle Database session to be traced, you will be able to specify a service, module, or action to trace in Oracle Database 10g:
dbms_monitor.serv_mod_act_trace_enable(
service_name =>; 'APPS1',
module_name =>; 'PAYROLL',
action_name =>; 'PYUGEN',
waits =>; true,
binds =>; true,
instance_name =>; null)
/* code to be traced executes during this time window */
dbms_monitor.serv_mod_act_trace_disable(
service_name =>; 'APPS1',
module_name =>; 'PAYROLL',
action_name =>; 'PYUGEN')
With the DBMS_MONITOR package, Oracle offers a fully supported means of activating and deactivating diagnostic data collection for a specific business action (or actions) you want to trace.
Testing Extended SQL Tracing. Try it out. All it takes to see your first trace file is a simple SQL*Plus session like this one:
alter session
set timed_statistics=true;
alter session
set max_dump_file_size=unlimited;
alter session
set tracefile_identifier='Hello';
/* only in Oracle Database 8.1.7
and later */
alter session
set events '10046 trace name context forever, level 12';
select 'Howdy, it is '||sysdate from dual;
exit;
Then poke around the directory named by the value of your USER_ DUMP_DEST instance parameter for a recently written .trc file with the string Hello in its name. Open it with your favorite text editor. Read Oracle MetaLink note 39817.1 or the book Optimizing Oracle Performance to get an idea of what's in your raw trace file. By all means, run tkprof on your trace files and study its output, but don't let the existence of tkprof keep you from looking at your raw trace data as well. There are lots of things in there that tkprof doesn't tell you about.
If you want a more interesting trace file than one generated by a simple SELECT from DUAL, trace the following SQL statement:
select object_type, owner, object_name from dba_objects;
The resulting trace data will give you a good feel for the incredible amount of work the Oracle Database kernel does for you.
Create a Resource Profile
Once you have the right detailed diagnostic data in hand, you'll need to view it in a summary format that helps you most quickly formulate your response. The summary format computer programmers have used since at least the 1970s is the resource profile. A resource profile is simply a table that decomposes a duration into useful subsets listed in descending order of duration. Here is a resource profile example:
Response Time Component Duration
-------------------------- ----------
Freeway at <50% speed limit 28.3m 59%
Finding a parking spot 7.2m 15%
Waiting at traffic lights 5.2m 11%
Freeway at ≥50% speed limit 4.0m 8%
Other 3.1m 6%
-------------------------- ----------
Total 47.8m 100%
This resource profile tells you that buying a faster car isn't going to get you to work much faster.
There are two ways to create a resource profile from an Oracle trace file.
Do it yourself. The book Optimizing Oracle Performance explains how.
Use someone else's tool to do it. The Oracle tkprof and trcanalyzer tools do part of the work for you, but not all of it.
Responding to the Data
Once you have your detailed diagnostic data and a summarized perspective in hand, you have to figure out how to respond to what you see. The rule of thumb for responding to a resource profile is highly reliable and refreshingly simple: Reduce the duration of the biggest component first, by reducing the number of calls to that component. Next Steps
READ-more about Oracle performance analysis
Optimizing Oracle Performance
www.oreilly.com/catalog/optoraclep
more Oracle performance information
www.hotsos.com
asktom.oracle.com
otn.oracle.com/documentation
This is almost always the right approach. Understanding how to reduce the number of calls to a given component requires some intimacy with what the different wait event names mean. For example, when a traced Oracle session waits for a buffer busy waits wait event, it emits enough information into the trace file to reveal to you which buffer it's waiting for and why it's waiting. When a session waits for a SQL*Net message from client event, the position of the data generated in the trace file reveals to you which database call may have been executed redundantly.
In Oracle9i Release 2, there are more than 350 different wait events. In Oracle Database 10g, there are almost 700. But don't worry: You'll never have to know what they all mean. You only have to know the meaning of the ones for which your important programs are spending most of their time waiting.
Look at What You Can Do
With the right diagnostic data, you'll either solve the right problem quickly or prove that solving it is not worth the effort.
Here's a partial laundry list of problems diagnostic data can solve:
Systemwide problems and specific problems with individual user (business) actions
Query mistakes, including bad SQL, faulty indexing, and data density issues
Application mistakes, including applications that parse excessively, don't use array operations, and so on
Serialization mistakes, including unnecessarily frequent or time- consuming locking, latching, or memory buffer activity
Network mistakes, such as poor protocol selection and faulty network devices
Disk I/O mistakes, such as poorly sized caches, load imbalances, and poor configurations
Capacity shortages, such as swapping, paging, and excess CPU consumption
The best thing about using Oracle's extended SQL trace data with the "What took so long?" method is that you will never again have to guess what your performance problem might be before you can begin to diagnose and solve it.
Cary Millsap (cary.millsap@hotsos.com) is the author (with Jeff Holt) of Optimizing Oracle Performance (O'Reilly & Associates, 2003). Cary cofounded Hotsos (hotsos.com), a company that sells products, education, and consulting services dedicated to improving Oracle performance. |
|