All about Statistics

All about Statistics

Intruduction on Statistics
The CBO makes its explain-plan decisions based on statistics.
Statistics provide critical input in order for CBO to work properly; these includes information such as the number of rows in a table, distinct values in a column, number of leaf blocks in an index, etc.
The more accurate the statistics, the more efficient the results provided by Optimizer.

Statistics may be exact or estimated:

Statistics generated with a COMPUTE clause analyzes all of the data in the object. This gives the optimizer accurate information to work on and arrive at a good execution plan.
Statistics generated with an ESTIMATE clause analyzes data in the object to the extent of sample size mentioned. Sample size may be specified as number of rows or percentage of rows that should be randomly analyzed to generate the statistics. Optionally block sampling may also be specified.

Statistics are stored in a data dictionary tables owned by SYS user. The following views display the statistics collected for tables, columns and indexes.

For Tables: DBA_TABLES
NUM_ROWS - Number of rows
BLOCKS - Number of blocks below HW (used or not)
EMPTY_BLOCKS - Number of empty blocks
AVG_SPACE - Average free space (in bytes) in blocks allocated to the table. All empty and free blocks are considered for this
CHAIN_CNT - Number of chained rows (used by ANALYZE command only)
AVG_ROW_LEN - Average row length in bytes
LAST_ANALYZED - Date when the table was last analyzed
SAMPLE_SIZE - Sample size provided for ESTIMATE statistics. Equal to NUM_ROWS if COMPUTE
GLOBAL_STATS - For partitioned tables, YES - statistics collected as a whole, NO - statistics are estimated from statistics
USER_STATS - Set to YES if user has explicitly set the statistics for the table
Statistics for individual partitions of a table can be seen from DBA_TAB_PARTITIONS. Cluster statistics is available from DBA_CLUSTERS.

For Indexes DBA_INDEXES
BLEVEL - Depth of the index, from root to leaf
LEAF_BLOCKS - Number of leaf blocks
DISTINCT KEYS - Number of distinct index values
AVG_LEAF_BLOCKS_PER_KEY - Average number of leaf blocks in which each distinct key appears, should be 1 for unique indexes
AVG_DATA_BLOCKS_PER_KEY - Average number of blocks in the table that are pointed to by a distinct key
CLUSTERING_FACTOR - A count that determines the ordering of the index. Index is ordered if count is closer to the number of blocks, i.e entries in single leaf tend to point to rows in same blocks in the table. Index is randomly ordered if closer to the number of rows, i.e., entries in single leaf are pointing to rows spread across multiple blocks
NUM_ROWS - Number of rows indexed
SAMPLE_SIZE - Sample size provided for ESTIMATE statistics. Equal to NUM_ROWS if COMPUTE
LAST_ANALYZED - Date when the table was last analyzed
GLOBAL_STATS - For partitioned indexes, YES - statistics collected as a whole, NO - statistics are estimated from statistics
USER_STATS - Set to YES if user has explicitly set the statistics for the index
PCT_DIRECT_ACCESS - For secondary indexes on IOTs, percentage of rows with valid guess
Statistics for individual partitions of indexes can be seen from DBA_IND_PARTITIONS and DBA_IND_SUBPARTITIONS.

For Columns: DBA_TAB_COLUMNS
NUM_DISTINCT - Number of distinct values
LOW_VALUE - Lowest value
HIGH_VALUE - Highest value
DENSITY - Density of the column
NUM_NULLS - Number of records with null value for the concerned column
NUM_BUCKETS - Number of buckets in histograms. Refer Histograms section
SAMPLE_SIZE - Sample size provided for ESTIMATE statistics. Equal to total rows if COMPUTE
LAST_ANALYZED - Date when the table was last analyzed
DBA_TAB_COL_STATISTICS shows similar data. Partitioned Table column statistics can be viewed from DBA_PART_COL_STATISTICS and DBA_SUBPART_COL_STATISTICS

Important Initialization parameters that affect the CBO
Apart from generating statistics, the setup of the below mentioned parameters play a very crucial role in dictating how your system will work. Setting these will depend a lot on what kind of environment you want to create, Online, Batch processing, Data warehousing or a combination of more than one.

1) OPTIMIZER_MODE
This will decide the mode in which the optimizer engine should run in. Valid values are RULE, CHOOSE, ALL_ROWS, FIRST_ROWS (_n).
Optionally, the OPTIMIZER_MODE can be set to CHOOSE. This is kind of an intermediate option between RBO and CBO. In fact, it tries to run the query in either CBO or RBO depending on the availability or unavailability of statistics. Therefore, if the tables present in the query have statistics generated on them, CBO (ALL_ROWS only) is preferred or RBO is taken up.
e.g.: optimizer_mode = first_rows
For an OLTP system, FIRST_ROWS would be the ideal option for fast response time. ALL_ROWS is meant for batch-oriented applications. Note, a plan that produces the first n rows with the fastest response time might not be an optimal plan if requirement is to obtain the entire result, so decide as per the need of the application

2) OPTIMIZER_INDEX_COST_ADJ
Optimizer_Index_Cost_Ad takes a value between 1 and 10000 (default 100). Setting this value makes the index more or less friendly to the optimizer. For example setting it to 50 tells the optimizer that an execution plan that uses index(es) is half as expensive as other execution plans that are not using them. The lower the value (less than 100), the less full table scan executions will take place in the system.
Use of indexes in OLTP is a crucial factor for deciding the response time, but sometimes, full table scans provide better throughput. Nested loop joins are influenced by this parameter value. Set this parameter to a value that is optimal for your application. I recommend set this parameter between 10 - 20 for OLTP and 50 for DSS Systems..
If you are having slow performance because the CBO first_rows optimizer mode is favoring too many full-table scans, you can reset the optimizer_index_cost_adj parameter to immediately tune all of the SQL in your database to favor index scans over full-table scans. This is a "silver bullet" that can improve the performance of an entire database in cases where the database is OTLP and you have verified that the full-table scan costing is too low.

3) OPTIMIZER_INDEX_CACHING
This tells optimizer to favor nested loops and IN-list iterators over sort-merge and hash joins. The default value is 100 and makes nested loops and IN-list operator look less expensive and more appealing to Optimizer. The value basically indicates the percentage of index blocks that the optimizer should assume are present in cache. I recommend set this parameter to 85.

4) DB_FILE_MULTIBLOCK_READ_COUNT
This parameter determines the number of database blocks read in one input/output operation during a full table scan. The value set is used in computing the cost of full table scans. OLTP systems will not benefit by setting a high value for this parameter as the computed cost for full table scan execution plans would reduced. The maximum size is Operating system dependent.
e.g.: db_file_multiblock_read_count = 1 (may be set to a higher value)

5) SORT_AREA_SIZE
This parameter defines the maximum memory space allocated per user process that requires sorting of data and for insert and updates to bitmap indexes. Optimizer uses the set value for determining the cost of sorting in a query. This defaults to 64K. Normally a size of 64K to 1M is appropriate for OLTP systems.
Sort-merge joins are influenced by this parameter value. The bigger the size the more appealing will be sort-merge joins over nested-loop and hash joins. The lower the size the more sorting will take place in temporary segments. So assign a value that has been evaluated for your setup.
Oracle recommends the use of PGA_AGGREGATE_TARGET instead of this parameter from Oracle 9i.

6) SORT_MULTIBLOCK_READ_COUNT
This specifies the number of database blocks to be read each time a sort performs a read from a temporary segment. The default value is 2. General recommendation is to keep the default value. However, if you intend to hike up the SORT_AREA_SIZE considerably, you may evaluate the improvement by changing this parameter.
e.g.: sort_multiblock_read_count = 2

7) HASH_JOIN_ENABLED
Hash joins are available only in CBO. In hash joins, a hash table is created on the join key of the smallest sized table. It then joins the other tables to find the match. Hash joins may prove to be faster than other type of joins in some conditions, especially when the index is missing or search criteria is not very selective. Hash joins require a large amount of memory as the hash tables are retained; this may sometimes result in memory swapping.
Nested-loop joins return the first row faster than sort-merge and hash joins and are preferred for OLTP, but other types of joins cannot be ignored for running other aspects of the applications.
e.g.: hash_join_enabled = true

8) HASH_AREA_SIZE
This specifies the maximum amount of memory in bytes to be used for a hash join per process. It is defaulted to 2 times SORT_AREA_SIZE.
Oracle recommends the use of PGA_AGGREGATE_TARGET instead of this parameter from Oracle 9i.
e.g.: hash_area_size = 2097152
Setting this to a very low number may sometimes result in the following error.
ORA-6580: Hash Join ran out of memory while keeping large rows in memory.

9) HASH_MULTIBLOCK_IO_COUNT
This specifies how many sequential blocks a hash join reads and writes in one Input/Output activity. Oracle recommends not changing or assigning a value to this parameter; this will let oracle decide on the appropriate value for each individual query. In such casees, the value of the parameter will appear as 0 in the V$PARAMETER view.
This parameter is renamed to _HASH_MULTIBLOCK_IO_COUNT as of Oracle 9i.
e.g.: hash_multi_block_io_count = 0

10) BITMAP_MERGE_AREA_SIZE
This parameter is relevant for systems using bitmap indexes. It specifies the amount of memory Oracle uses to merge bitmaps retrieved from a range scan of a bitmap index. The default value is 1 MB, which is considered sufficient for most setups.
Oracle recommends use of PGA_AGGREGATE_TARGET instead of this parameter from Oracle 9i.
e.g.: bitmap_merge_area_size = 1048576

11) QUERY_REWRITE_ENABLED
This parameter is relevant for systems using Materialized views, Function based indexes and stored outlines. Setting this parameter enables query rewrite for the database. The materialized views should also be query rewrite enabled to allow the optimizer to redirect a query to it rather than the actual tables listed in the FROM clause. Query rewriting is internally done by the optimizer based on what results are expected and whether these are already present in materialized view form.
e.g.: query_rewrite_enabled = true

12) QUERY_REWRITE_INTEGRITY
This parameter determines the extent to which the optimizer must enforce query rewriting to Materialized views. This determines the accuracy of the query rewrite. It can be set to ENFORCED, TRUSTED or STALE_TOLERATED. ENFORCED option uses Materialized views if they contain fresh data, thus guarantying the integrity and consistency of data. TRUSTED uses relationships that have been defined in the system for rewriting the query. STALE_TOLERATED allows the use of relationships that are not enforced and may use stale data. For OLTP systems, ENFORCED would be the ideal setup, as reports need to be up-to-date.
e.g.: query_rewrite_integrity = enforced

13) ALWAYS_ANTI_JOIN
This parameter specifies the join method for anti-joins, for example when a NOT IN operator is present in your query. It can be set to NESTED_LOOPS, MERGE or HASH. It is defaulted to NESTED_LOOPS in Oracle 8i and CHOOSE in Oracle 9i.
This parameter is renamed to _ALWAYS_ANTI_JOIN as of Oracle 9i.
e.g.: always_anti_join = nested_loops

14) ALWAYS_SEMI_JOIN
This parameter specifies the join method for semi-joins. These types of joins are carried out by Optimizer after transforming a query. In such joins, duplicate values from the inner table are removed and then the type of join specified in the parameter is used to perform a semi-join. It can be set to NESTED_LOOPS, MERGE or HASH. In Oracle 8i, it is defaulted to STANDARD and in Oracle 9i it is defaulted to CHOOSE, to pick up an appropriate join.
This parameter is renamed to _ALWAYS_SEMI_JOIN as of Oracle 9i.
e.g.: always_semi_join = nested_loops

15) STAR_TRANSFORMATION_ENABLED
This specifies whether query transformation will be applied to star queries. It can be set to TRUE, FALSE or TEMP_DISABLE (transformation will take place but will not use temporary tables). I presently set it to FALSE due to some known issues of sub-optimal queries being generated. If you intend to use this, please upgrade your version to 8.1.7.4 and above.
e.g.: star_transformation_enabled = false

16) PARALLEL_BROADCAST_ENABLED
This parameter refers to parallel executions in cluster databases. It is meant for improving hash and sort-merge join operations where a very large result set is joined with a very small result set. When this option is enabled, the optimizer broadcasts a copy of all rows in the smaller result set to all cluster databases that are processing some rows of the larger result set.
It is obsolete in release 9.2.0.
e.g.: parallel_broadcast_enabled = false

17) OPTIMIZER_DYNAMIC_SAMPLING
This parameter is introduced in release 9i. It is meant for situations where tables are not analyzed. As CBO depends heavily on statistics, the parameter tells the optimizer to sample the unanalyzed tables that are being used in a query. A level of 0 to 10 can be specified, the higher the value the more time optimizer spends in sampling.
e.g.: optimizer_dynamic_sampling = 1

18) PARTITION_VIEW_ENABLED
This parameter is meant for backward compatibility to support partition views. Oracle recommends use of partition tables rather than partition views. If you are migrating to CBO, chances are that you may not be using partition views.
e.g.: partition_view_enabled = false

19) CURSOR_SHARING
This parameter determines what kind of SQL statements can share the same cursor. It can be set to FORCE, SIMILAR or EXACT. FORCE will try to squeeze statements that may differ in some literals to share the same cursor. SIMILAR is somewhat the same but will try to maintain the plan optimization for identical statements. EXACT allows statements with exact identical text to share a cursor.
Using FORCE may sometimes result in unexpected results.
e.g.: cursor_sharing = exact

20) PGA_AGGREGATE_TARGET
Introduced in Oracle 9i, this parameter specifies the aggregate PGA memory available to all server processes attached to an instance. This parameter can be set for automatic sizing of SQL working areas. It replaces other existing parameters like SORT_AREA_SIZE, BITMAP_MERGE_AREA_SIZE and HASH_AREA_SIZE.
It can be set to a value between 10 MB to 4000 GB, depending on the setup requirement.
1- Make a first estimate for PGA_AGGREGATE_TARGET based on the following rule:
PGA_AGGREGATE_TARGET = (Total Physical Memory * 0.16) - For OLTP systems (16% of Server Physical Mem)
PGA_AGGREGATE_TARGET = (Total Physical Memory * 0 .4 ) - For DSS systems (40% of Server Physical Mem)
More Information HERE

Gatering Statistics for Tables
Oracle strongly recommends that you use the DBMS_STATS package rather than ANALYZE to collect optimizer statistics. That package lets you collect statistics in parallel, collect global statistics for partitioned objects, and fine tune your statistics collection in other ways. Further, the cost-based optimizer will eventually use only statistics that have been collected by DBMS_STATS.

However, you must use the ANALYZE statement rather than DBMS_STATS for statistics collection not related to the cost-based optimizer, such as:
* To use the VALIDATE or LIST CHAINED ROWS clauses
* To collect information on freelist blocks

In order to use the DBMS_STATS Package, the user needs to get the execute privilege. Run the following as SYS:
grant execute on DBMS_STATS to user_xyz;

DBMS_STATS.GATHER_TABLE_STATS gathers statistics for a table and its columns, and optionally the associated indexes. Syntax:
dbms_stats.gather_table_stats(ownname, tabname, partname, estimate_percent, block_sample,
method_opt, degree, granularity, cascade, stattab, statid, statown);

The first two parameters are mandatory, the rest are defaulted to a value.
* ownname - owner
* tabname - table name
* partname - partition name
* estimate_percent - sample percent ratio
* block_sample - consider random blocks sampling rather than rows sampling. TRUE/FALSE
* method_opt - method options. FOR ALL COLUMNS/FOR ALL INDEXED COLUMNS. Append the phase SIZE 1 if it is required to generate statistics in parallel.
* degree - degree of parallelism.
* granularity - for partitioned tables. DEFAULT/SUBPARTITION/PARTITION/GLOBAL/ALL.
* cascade - gather statistics for indexes also. TRUE/FALSE
* stattab, statid, statown - required for user statistics.

Collect statistics for a table and its indexes:
EXEC DBMS_STATS.gather_table_stats(USER, 'LOOKUP', cascade => TRUE);

Collect stats for a table, estimating 20%, with parallelism
execute dbms_stats.gather_table_stats (ownname => 'SCOTT', tabname => 'DEPT', partname=> null, estimate_percent => 20, degree => 5, cascade => true, options => 'GATHER AUTO');

Estimate statistics for columns in a table, this will also generate statistics for tables.
exec dbms_stats.gather_table_stats(ownname => 'SCOTT', tabname => 'DEPT', estimate_percent => 5, method_opt => 'FOR ALL COLUMNS');

SQL Source - Dynamic Method for TABLES
DECLARE
sql_stmt VARCHAR2(1024);
BEGIN
FOR tab_rec IN (SELECT owner,table_name
FROM all_tables WHERE owner like UPPER('&1') ) LOOP
sql_stmt := 'BEGIN dbms_stats.gather_table_stats(ownname => :1, tabname => :2, partname => null, estimate_percent => 10, degree => 3 ,cascade => true); END;' ;
EXECUTE IMMEDIATE sql_stmt USING tab_rec.owner, tab_rec.table_name ;
END LOOP;
END;
/

Collecting Stats in a Procedure
CREATE OR REPLACE PROCEDURE analyze_any_schema ( p_inOwner IN all_users.username%TYPE)
IS
BEGIN
FOR v_tabs IN (SELECT owner, table_name
FROM all_tables
WHERE owner = p_inOwner
AND temporary <> 'Y')
LOOP
DBMS_OUTPUT.put_line ('EXEC DBMS_STATS.gather_table_stats('''||v_tabs.owner||
''','''||v_tabs.table_name||''',NULL,1);' );
BEGIN
DBMS_STATS.gather_table_stats(v_tabs.owner,v_tabs.table_name,NULL,1);
DBMS_OUTPUT.put_line ('Analyzed '||v_tabs.owner||'.'||table_name||'... ');
EXCEPTION
WHEN OTHERS THEN
DBMS_OUTPUT.put_line ('Exception on analysis of '||v_tabs.table_name||'!');
DBMS_OUTPUT.put_line (SUBSTR(SQLERRM,1,255));
END;
END LOOP;
END;
/
CREATE OR REPLACE Procedure DB_Maintenance_Weekly is
sql_stmt varchar2(1024);
v_sess_user varchar2(30);
BEGIN
select sys_context('USERENV','SESSION_USER') into v_sess_user
from dual ;
--Analyze all Tables
FOR tab_rec IN (SELECT table_name
FROM all_tables
WHERE owner = v_sess_user
and table_name not like 'TEMP_%') LOOP
sql_stmt := 'BEGIN dbms_stats.gather_table_stats
(ownname => :1,
tabname => :2,
partname => null,
estimate_percent => DBMS_STATS.AUTO_SAMPLE_SIZE,
degree => 3 ,
cascade => true); END;' ;
EXECUTE IMMEDIATE sql_stmt USING v_sess_user, tab_rec.table_name ;
END LOOP;
EXCEPTION
WHEN NO_DATA_FOUND THEN
NULL ;
end;
/

Force Statistics to a Table
You can use the following sentence to force statistics to a Table:
exec dbms_stats.set_table_stats( user, 'EMP', numrows => 1000000, numblks => 300000 );

No statistics are collected for Global Temporary Tables; handle these with care. Make use of hints to drive queries on these. Provide explicit statistics by using DBMS_STATS.SET_TABLE_STATS if your temporary table is used for processing huge amounts of data.

How often to collect stats?
When there is a 10-20% change in data, the general convention is to generate fresh statistics. You can start of with a general rule of estimating statistics on a weekly basis. If the tables are giving real bad hits because of heavy activity, you may consider using the DML Monitoring option to update statistics every few hours for such tables. Statistics are not incremental and are regenerated every time. If there is no considerable change in data, there is no advantage in generating statistics too frequently.

DML Monitoring
Used by dbms_stats to identify objects with "stale" statistics
- On by default in 10g, not in 9i
alter table monitoring;
- For the Full Database:
dbms_stats.alter_database_tab_monitoring(monitoring=>true, sysobjs=>false);

- Tracked in [DBA|ALL|USER]_TAB_MODIFICATIONS
- 9i and 10g use 10% change as the threshold to gather stats
In Oracle 10g, Oracle automatically gathers index statistics whenever the index is created or rebuilt.

Gathering statistics for a Schema, a Database and System Stats
Collect statistics for my Schema (All the tables and Indexes), estimating 10%
execute dbms_stats.gather_schema_stats(ownname => 'SCOTT', estimate_percent => 10, degree => 5, cascade => true);

Collect statistics for the Database:
execute dbms_stats.gather_database_stats(estimate_percent => 20, degree => 5, cascade => true);

Generating System Statistics
Gather statistics during the day. Gathering ends after 720 minutes and is stored in the mystats table:
exec DBMS_STATS.GATHER_SYSTEM_STATS(gathering_mode => 'interval', interval => 720, stattab => 'mystats', statid => 'OLTP');

Gather statistics during the night. Gathering ends after 720 minutes and is stored in the mystats table:
DBMS_STATS.GATHER_SYSTEM_STATS(gathering_mode => 'interval', interval => 720, stattab => 'mystats', statid => 'OLAP');

If appropriate, you can switch between the statistics gathered. It is possible to automate this process by submitting a job to update the dictionary with appropriate statistics. During the day, the following jobs import the OLTP statistics for the daytime run:
VARIABLE jobno number;
BEGIN
DBMS_JOB.SUBMIT(:jobno, 'DBMS_STATS.IMPORT_SYSTEM_STATS(''mystats'',''OLTP'');', SYSDATE, 'SYSDATE + 1');
COMMIT;
END;
/

During the night, the following jobs import the OLAP statistics for the night time run:
BEGIN
DBMS_JOB.SUBMIT(:jobno, 'DBMS_STATS.IMPORT_SYSTEM_STATS(''mystats'',''OLAP'');', SYSDATE + 0.5, 'SYSDATE + 1');
COMMIT;
END;
/

STATISTICS FOR THE DATA DICTIONARY
New in Oracle Database 10g is the ability to gather statistics for the data dictionary. The objective is to enhance the performance of queries. There are two basic types of dictionary base tables.
The statistics for normal base tables are gathered using GATHER_DICTIONARY STATISTICS. They may also be gathered using GATHER_SCHEMA_STATS for the SYS schema. Oracle recommends gathering these statistics at a similar frequency as your other database objects.
Statistics for fixed objects (the V$ views on the X$ tables) are gathered using the GATHER_FIXED_OBJECT_STATS procedure. The initial collection of these statistics is normally sufficient. Repeat only if workload characteristics have changed dramatically. The SYSDBA privilege or ANALYZE ANY DICTIONARY and ANALYZE ANY privileges are required to execute the procedures for gathering data dictionary statistics.

The Option parameter
There are several values for the options parameter that we need to know about:
- gather - re-analyzes the whole schema.
- gather empty - Only analyze tables that have no existing statistics.
- gather stale - Only re-analyze tables with more than 10% modifications (inserts, updates, deletes). The table should be in monitor status first.
- gather auto - This will re-analyze objects which currently have no statistics and objects with stale statistics.The table should be in monitor status first.
Using gather auto is like combining gather stale and gather empty.
Note that both gather stale and gather auto require monitoring. If you issue the "alter table xxx monitoring" command, Oracle tracks changed tables with the dba_tab_modifications view.
The most interesting of these options is the gather stale option. Because all statistics will become stale quickly in a robust OLTP database, we must remember the rule for gather stale is > 10% row change (based on num_rows at statistics collection time).
Hence, almost every table except read-only tables will be re-analyzed with the gather stale option. Hence, the gather stale option is best for systems that are largely read-only. For example, if only 5% of the database tables get significant updates, then only 5% of the tables will be re-analyzed with the "gather stale" option.
The CASCADE => TRUE option causes all indexes for the tables to also be analyzed. In Oracle 10g, set CASCADE to AUTO_CASCADE to let Oracle decide whether or not new index statistics are needed.

Important Notes

You should analyze the table after creating a function-based index, to allow Oracle to collect column statistics equivalent information for the expression. Optionally, you can collect histograms for the index expressions by specifying forall hidden columns size number_of_buckets in the METHOD_OPT argument to the DBMS_STATS procedures.
Oracle can gather some statistics automatically while creating or rebuilding a B-tree or bitmap index. The COMPUTE STATISTICS option of CREATE INDEX or ALTER INDEX ... REBUILD enables this gathering of statistics.
Oracle recommends setting the ESTIMATE_PERCENT parameter of the DBMS_STATS gathering procedures to DBMS_STATS.AUTO_SAMPLE_SIZE to maximize performance gains while achieving necessary statistical accuracy. AUTO_SAMPLE_SIZE lets Oracle determine the best sample size for good statistics.
By default DBMS_STATS will generate statistics for the table and not it's indexes (By default CASCADE => FALSE).
If you have a big table (around 2M rows), try to ESTIMATE statistics. In general, a sample size of 5% to 10% generates adequate statistics!
A table that is in the process of being analyzed cannot undergo DDL changes but DML activities can be carried on. Analyzing an index puts a shared lock on the related table; hence, neither DML nor DDL can be performed. Preferably avoid all activities during the statistics generation phase

Verifying Table Statistics
Syntax:
dbms_stats.get_table_stats(ownname, tabname, partname, stattab, statid, numrows, numblks, avgrlen, statown);

declare
l_numrows number;
l_numblks number;
l_avgrlen number;
begin
dbms_stats.get_table_stats(ownname => 'SYS', tabname => 'AM01', numrows => l_numrows, numblks => l_numblks, avgrlen => l_avgrlen);
dbms_output.put_line('No. of rows: ' || l_numrows);
dbms_output.put_line('No. of blks: ' || l_numblks);
dbms_output.put_line('Avg row length: ' || l_avgrlen);
end;
/
No. of rows: 4106860
No. of blks: 6219
Avg row length: 3

Verifying Index Statistics

SELECT INDEX_NAME "NAME", NUM_ROWS, DISTINCT_KEYS "DISTINCT",
      LEAF_BLOCKS, CLUSTERING_FACTOR "CF", BLEVEL "LEVEL",
      AVG_LEAF_BLOCKS_PER_KEY "ALFBPKEY"
FROM DBA_INDEXES
WHERE OWNER = 'SH'
ORDER BY INDEX_NAME;

Typical output is:

NAME                       NUM_ROWS DISTINCT LEAF_BLOCKS      CF   LEVEL   ALFBPKEY
-------------------------- -------- -------- ----------- ------- ------- ----------
CUSTOMERS_PK                  50000    50000         454    4405       2          1
PRODUCTS_PK                   10000    10000          90    1552       1          1
PRODUCTS_PROD_CAT_IX          10000        4          99    4422       1         24
PRODUCTS_PROD_SUBCAT_IX       10000       37         170    6148       2          4
SALES_PROD_BIX                 6287      909        1480    6287       1          1
SALES_PROMO_BIX                4727      459         570    4727       1          1

declare
  l_numrows   number;
  l_numlblks  number;
  l_numdist   number;
  l_avglblk   number;
  l_avgdblk   number;
  l_clstfct   number;
  l_indlevel  number;
begin
  dbms_stats.get_index_stats(ownname => 'SYS', indname => 'AM01_N1',
                 numrows => l_numrows, numlblks => l_numlblks,
                 numdist => l_numdist, avglblk => l_avglblk,
                 avgdblk => l_avgdblk, clstfct => l_clstfct,
                 indlevel => l_indlevel);
  dbms_output.put_line('No. of rows: ' || l_numrows);
  dbms_output.put_line('No. of blks: ' || l_numlblks);
  dbms_output.put_line('No. of distinct values: ' || l_numdist);
  dbms_output.put_line('Avg leaf blocks for distinct keys: ' || l_avglblk);
  dbms_output.put_line('Avg data blocks pointed to in the table: ' || l_avgdblk);
  dbms_output.put_line('Clustering factor: ' || l_clstfct);
  dbms_output.put_line('Index height: ' || l_indlevel);
end;
/
No. of rows: 3819711
No. of blks: 11092
No. of distinct values: 1
Avg leaf blocks for distinct keys: 11092
Avg data blocks pointed to in the table: 14616
Clustering factor: 14616
Index height: 2

Verifying Column Statistics

SELECT COLUMN_NAME, NUM_DISTINCT, NUM_NULLS, NUM_BUCKETS, DENSITY
FROM DBA_TAB_COL_STATISTICS
WHERE TABLE_NAME ="PA_EXPENDITURE_ITEMS_ALL"
ORDER BY COLUMN_NAME;

This returns the following data:

COLUMN_NAME                    NUM_DISTINCT  NUM_NULLS NUM_BUCKETS    DENSITY
------------------------------ ------------ ---------- ----------- ----------
BURDEN_COST                            4300      71957           1 .000232558
BURDEN_COST_RATE                        675    7376401           1 .001481481
CONVERTED_FLAG                            1   16793903           1          1 poor
COST_BURDEN_DISTRIBUTED_FLAG              2      15796           1         .5 poor
COST_DISTRIBUTED_FLAG                     2          0           1         .5 poor
COST_IND_COMPILED_SET_ID                 87    6153143           1 .011494253
EXPENDITURE_ID                      1171831          0           1 8.5337E-07 GOOD
TASK_ID                                8648          0           1 .000115634
TRANSFERRED_FROM_EXP_ITEM_ID        1233787   15568891           1 8.1051E-07 GOOD


declare
   l_distcnt number;
   l_density number;
   l_nullcnt number;
   l_srec   dbms_stats.statrec;
   l_avgclen number;
begin
  dbms_stats.get_column_stats(ownname => 'SYS', tabname => 'AM01',
              colname => 'COL1', distcnt => l_distcnt, density => l_density,
              nullcnt => l_nullcnt, srec => l_srec, avgclen => l_avgclen);
   dbms_output.put_line('No. of distinct values: ' || l_distcnt);
   dbms_output.put_line('Density: ' || l_density);
   dbms_output.put_line('Count of nulls: ' || l_nullcnt);
   dbms_output.put_line('Avg. column length: ' || l_avgclen);
end;
/
No. of distinct values: 2
Density: .5
Count of nulls: 0
Avg. column length: 3

Exporting and importing statistics with DBMS_STATS

DBMS_STATS also includes routines for gathering statistics and storing them outside the dictionary. This does not influence the optimizer. Most of the procedures in this package as mentioned above have three common parameters - STATID, STATTAB and STATOWN that are related to user processing of statistics.

Advantages of this feature:

1. Estimated statistics at different percentages could be stored and used for testing.

2. Statistics generated on one database could be transferred to another database.

Using Histograms

The cost-based optimizer can use data value histograms to get accurate estimates of the distribution of column data. A histogram partitions the values in the column into bands, so that all column values in a band fall within the same range. Histograms provide improved selectivity estimates in the presence of data skew, resulting in optimal execution plans with nonuniform data distributions.

One of the fundamental tasks of the cost-based optimizer is determining the selectivity of predicates that appear in queries. Selectivity estimates are used to decide when to use an index and the order in which to join tables. Some attribute domains (a table's columns) are not uniformly distributed.

The cost-based optimizer uses height-based histograms on specified attributes to describe the distributions of nonuniform domains. In a height-based histogram, the column values are divided into bands so that each band contains approximately the same number of values. The useful information that the histogram provides, then, is where in the range of values the endpoints fall.

When to Use Histograms

Histograms can affect performance and should be used only when they substantially improve query plans. Histogram statistics data is persistent, so the space required to save the data depends on the sample size. In general, create histograms on columns that are used frequently in WHERE clauses of queries and have a highly skewed data distribution. For uniformly distributed data, the cost-based optimizer can make fairly accurate guesses about the cost of executing a particular statement without the use of histograms.

Histograms, like all other optimizer statistics, are static. They are useful only when they reflect the current data distribution of a given column. (The data in the column can change as long as the distribution remains constant.) If the data distribution of a column changes frequently, you must recompute its histogram frequently.

Histograms are not useful for columns with the following characteristics:

All predicates on the column use bind variables.
The column data is uniformly distributed.
The column is unique and is used only with equality predicates.

Creating Histograms

You generate histograms by using the DBMS_STATS package. You can generate histograms for columns of a table or partition. For example, to create a 10-bucket histogram on the SAL column of the emp table, issue the following statement:

EXECUTE DBMS_STATS.GATHER_TABLE_STATS ('scott','emp', METHOD_OPT => 'FOR COLUMNS SIZE 10 sal');

The SIZE keyword declares the maximum number of buckets for the histogram. You would create a histogram on the SAL column if there were an unusually high number of employees with the same salary and few employees with other salaries. You can also collect histograms for a single partition of a table.

Oracle Corporation recommends using the DBMS_STATS package to have the database automatically decide which columns need histograms. This is done by specifying SIZE AUTO.

Example Using a Histogram to Improve an Execution Plan

UPDATE so_lines l
SET open_flag=null, s6=10, s6_date=sysdate,
WHERE l.line_type_code in ('REGULAR','DETAIL','RETURN')
AND l.open_flag = 'Y' AND NVL(l.shipped_quantity, 0)=0
OR NVL(l.shipped_quantity, 0) != 0
AND l.shipped_quantity +NVL(l.cancelled_quantity, 0)= l.ordered_quantity))
AND l.s6=18;

This query shows the skewed distribution of data values for s6. In this case, there are two distinct non-null values: 10 and 18. The majority of the rows consists of s6 = 10 (1,589,464), while a small number of rows consist of s6 = 18 (13,091).

S6: COUNT(*)
======================
10 1,589,464
18 13,091
NULL 21,889

The selectivity of column s6, where s6 = 18:

S = 13,091 / (13,091 + 1,589,464) = 0.008

If No Histogram is Used: The selectivity of column s6 is assumed to be 50%, uniformly distributed across 10 and 18. This is not selective; therefore, s6 is not an ideal choice for use as an index.

If a Histogram is Used: The data distribution information is stored in the dictionary. This allows the optimizer to use this information and compute the correct selectivity based on the data distribution. In this example, the selectivity, based on the histogram data, is 0.008. This is a relatively high, or good, selectivity, which leads the optimizer to use an index on column s6 in the execution plan.

Tuning by Statistics

As you know, is not good to resolve tuning issues using the old method of "ratios" only. So we will start understanding several views from Oracle that will help you to tune your database based on its use.

Parsing and Read Consistency
The view v$sysstat is the accumulated statistics for the complete system since the instance has started except for what is active now and shown in v$sesstat. The numbers in this view are held in "tens of milliseconds," (sounds like hundredths of seconds to me). So to convert to seconds, we divide the value by 100. Note that TIMED_STATISTICS must be set to true in the init.ora for these views to have timing numbers. So to see overall parsing and read consistency waits, we can use the following:

column parsing heading 'Parsing|(seconds)'
column total_cpu heading 'Total CPU|(seconds)'
column waiting heading 'Read Consistency|Wait (seconds)'
column pct_parsing heading 'Percent|Parsing'

select total_CPU, parse_CPU parsing,
      parse_elapsed-parse_CPU waiting,
      trunc(100*parse_elapsed/total_CPU,2) pct_parsing
 from (select value/100 total_CPU
         from v$sysstat where name = 'CPU used by this session'),
      (select value/100 parse_CPU
         from v$sysstat where name = 'parse time cpu'),
      (select value/100 parse_elapsed
         from v$sysstat where name = 'parse time elapsed');



Total CPU    Parsing Read Consistency    Percent
(seconds)  (seconds)   Wait (seconds)    Parsing
---------- ---------- ---------------- ----------
1236263327   21072.43          4277.93          0

We can see from this output that this instance has consumed a lot of CPU seconds, but a relatively small amount of that was used for parsing. Of the parsing, about 20 percent of the time was waiting for read consistency, but since the parsing percent is so small, this is not impacting our system performance.

Finding Free Buffers
If we need to scan buffers to find a free one, then we are wasting CPU resource. This can be checked with the script below:

select inspected,requested,(inspected/requested) ratio
  from  (select value inspected from v$sysstat where name = 'free buffer inspected') a,
        (select value requested from v$sysstat where name = 'free buffer requested') b;


INSPECTED  REQUESTED      RATIO
---------- ---------- ----------
5100       12288112   .000415035

We see that, for the vast majority of the time, the first buffer we look at is available (low ratio). So there is no problem with getting free buffers as we need them. Let's look further.

Active Waiting and Latch Contention
Active waiting can be compared to waiting at the betting window with your two dollars in your hand. You are not just passively standing there, but are focused on all the windows to see when a resource becomes available. We can see these active waits in v$system_event with the following:

select event,time_waited/100 seconds
  from v$system_event
  where event not in
        ('rdbms ipc message'
         ,'SQL*Net more data to client'
         ,'SQL*Net message from client'
         ,'SQL*Net message to client'
         ,'pmon timer'
         ,'smon timer'
         ,'wakeup time manager')
  order by seconds;


EVENT                                       SECONDS
---------------------------------------- ----------
buffer deadlock                                   0
. . .
latch free                                  3422.98
. . .
control file parallel write                80356.43
log buffer space                          106812.85

The previous code shows me that we have waited the most time for log buffer space. I will check that I have the log buffer sized correctly. This is still a small fraction of the CPU seconds used by the instance as we saw above; it was over one billion, so we are way under one percent of the time spent waiting. This also shows us that we have been waiting for latches. So let's see which latch we are not getting when needed with the following code:

Column name format a35
SELECT name, gets, misses, immediate_gets, immediate_misses
  FROM v$latch l
  WHERE (misses > .1*gets)
     or (immediate_misses > .1*(immediate_misses+immediate_gets));



NAME                                GETS   MISSES IMMEDIATE_GETS IMMEDIATE_MISSES
--------------------------------- ------ -------- -------------- ----------------
channel operations parent latch    86845        6          43300            78881

This is a latch that we can't tune and we also see that the number of misses is not increasing as we rerun this query, so we will leave it alone.
Given this, we have to conclude that the instance is not using too much CPU resource for its own maintenance — so let's check the users.

User Code Execution
In our instance, we have checked that the environment for the instance is not forcing us to consume more CPU resources. So we will check the code that is being executed. Most often, the CPU consumption problem is caused by what a user is doing.

Highest Consuming Sessions
We can see which user is consuming the most CPU at this time by creating a table with the cumulative values so we can pull out the current usage:

drop table t_CPU;
create table t_CPU storage
 (initial 64k next 64k pctincrease 0) pctfree 0
 as select a.sid,a.value
      from v$sesstat a,v$statname b
      where b.name = 'CPU used by this session'
        and a.statistic# = b.statistic#;

column name format a25
column username format a20
column CPU heading 'CPU|Seconds'

select a.sid, c.username,b.name,(a.value-t.value)/100 CPU
  from v$session c,v$sesstat a,v$statname b,t_CPU t
  where c.sid = a.sid
    and c.sid=t.sid
     and b.name = 'CPU used by this session'
     and a.statistic# = b.statistic#
     and a.value > 0
     and a.value - t.value > 0
   order by a.value-t.value;

drop table t_CPU;
                                                             CPU
  SID USERNAME             NAME                         Seconds
------ -------------------- ------------------------- ----------
   52                      CPU used by this session         .01
   . . .
   23 SYSTEM               CPU used by this session       35.82
  177 ORADBA               CPU used by this session       38.26
  167 ORADBA               CPU used by this session       52.07

You can see that SID 167 is the current largest consumer. What we are looking for, then, is code that is in a tight loop in memory, just processing over and over. If this CPU is being consumed by normal usage, then no changes are needed. So let's see what SID 167 is running:

select c.executions,trunc(a.last_call_et/60) min,b.sql_text
  from v$session a, v$sqltext b, v$sqlarea c
  where a.sid=167
    and a.sql_address = b.address
    and a.sql_address = c.address
  order by b.piece;


EXECUTIONS    MIN SQL_TEXT
---------- ------ ----------------------------------------------------------------
    94784      0 select daily_notes_id, center_code, TO_CHAR(day,'MM/DD/YYYY')
                 day, notes from daily_notes where center_code = :1
                 and day = TO_DATE(:2,'MM/DD/YYYY') order by 1

There is nothing wrong with this code and after watching for a while we see that this user is not running any code that jumps out as bad. We conclude that this session is just doing the most work.

Highest Consuming Statements
We can also check for the statements with the most buffer gets. These are probably working the most in memory:

set linesize 140
column executions format 999,999,999 heading "Executions"
column disk_reads format 999,999,999 heading "Block Reads"
column buffer_gets format 999,999,999 heading "Buffer Reads"
column sql_text format a90 heading "Statement"
break on executions on buffer_gets skip 1

--we want the 2 largest consumers

variable max_buffer_read number
begin
select max(buffer_gets) into :max_buffer_read
  from v$sqlarea
  where parsing_user_id >0;

select max(buffer_gets) into :max_buffer_read
  from v$sqlarea
  where parsing_user_id >0
    and buffer_gets < :max_buffer_read;
end;
/

select a.executions,a.buffer_gets,b.sql_text
  from v$sqlarea a,v$sqltext b
  where a.buffer_gets >= :max_buffer_read
    and a.address = b.address
    and a.parsing_user_id > 0
  order by a.buffer_gets desc, piece asc;


Executions Buffer Reads Statement
---------- ------------ -------------------------------------------------------------
6,300,634   51,221,671 SELECT DISTINCT B.FIRST_NAME || ' '  || B.LAST_NAME
                       FROM BLOCK_PHYSICIAN A,PHYSICIAN B
                       WHERE A.PHYSICIAN_FID=B.HYSICIAN_ID AND A.BLOCK_TIME_FID=:b1

   75,265   27,227,798 select client_id,greeting_lu,first_name,last_name
                       ,home_phone,work_phone,current_status_lu,center_code
                       ,start_time
                       ,TO_CHAR(start_time, 'MM/DD/YYYY HH:MI AM') start_datetime
                       ,appointment_type_lu, appointment_id , number_attending
                       ,offsite, offsite_location,treatment_info, bgcolor
                       ,(TO_NUMBER(TO_CHAR(start_time,'HH24'))*60)
                       +TO_NUMBER(TO_CHAR(start_time,'MI')) time,physician
                       ,TO_CHAR(start_time, 'HH:MI AM') start_display_time,note
                       from appointment_view_no_phys
                       where center_code = :1
                       and start_time between TO_DATE(:2,'MM,DD,YYYY')
                                          and TO_DATE(:3,'MM,DD,YYYY')+1
                       and appointment_type_lu in
                           (:4,:5,:6,:7,:8,:9,:10,:11,:12,:13,:14) 
                       and (current_status_lu not in
                           ('Cancelled','Rescheduled','Attended','NoShowed'))
                       order by start_time,last_name,first_name

Remember that these numbers are those that are emerging since the instance has come up.
We see that the first statement is doing less than 10 buffer reads per execution on an average. Because I know that these tables are in the keep buffers, this makes sense. The second statement is doing about 361 buffer gets per execution. This will be one to investigate and see how this view is joining. Maybe we can make this more efficient.

Search This Blog

World of Oracle DBA