2 Selection Of Indexes By Optimizer example essay topic

3,151 words
Table of contents Types of indexes... 2 Clustered Index... 2 Non-clustered Index... 2 Selection of indexes by Optimizer... 2 Search Arguments...

2 Index Selection... 3 Index Distribution Stats... 3 Index Density... 4 Index Covering... 4 Optimization of Cursors... 5 Optimization of temp tables...

6 Some general rules for optimization... 7 Some useful commands... 10 Types of indexes Clustered Index (CI) The data in this case is physically stored in the order of the index. The leaf level of the index is the same as the data pages.

There can be only one CI on a table as the data can physically be sorted in only one order. The select is extremely efficient with CI. The CI is extremely efficient in the following cases: a) where f name like 'Ram%'b) where author id between 1 and 7 c) where Price 345.34d) group by author ide) order by author name Non-clustered index (NI) The data in this case is not stored in the order an index is stored. The leaf-level of the index contains the various index keys and a pointer to the row as row ID (page no. + row offset). There can be 249 NI on a table. The NI should be used when a) The number of rows returned is small. b) When where clause limits the number of rows (usually ' = ' operator) c) When the query can be covered.

Selection of indexes by Optimizer SEARCH ARGUMENTS (SARG) These are the expressions on the RHS of the where clause. They act as a kind of (dis) incentive to the optimizer to use the index on the column. Some search arguments are: where author id = '13' where f name like 'Ram%' where Price 2347.32 Some expressions that are not valid SARGs are: Invalid Valid Price 1.5 = 1000 Price = 1000/1.5 Qty + 10 = 200 Qty = 200 - 10 f name + ' + l name f name = 'John' = 'John Gray' and l name = 'Gray " Substring (1, 3, f name) = 'KIR' Name like 'KIR%'is null (l name,' N') = 'N' l name is null The index might not be used in the case of following SARGs: 1) No start point for the index. where l name like '%abc'2) Non-matching data-types In SQL server, null and not null are held differently. Char null is same as var char.

So when char null and char not null is compared, the optimizer has to implicitly convert the data type, which it does not at the planning time. In both the above cases, distribution statistics are not used. INDEX SELECTION The optimizer first looks at the query if the columns contained in the where clause match with the columns specified in any of the index. If yes, then it proceeds further. The optimizer then looks if the where clause contains any SARG. If there is a valid SARG, the optimizer then looks for the distribution statistics of the index as: select distribution from where id = object id ('Authors') If there are no statistics for the index, the optimizer defaults to a fixed percentage of rows depending upon the operator in the where clause: Equality 10% Closed Interval 25% Open Interval 33% The index distribution statistics will not be used when the optimizer does not know the value of the RHS in a where clause as in the following: where Price = 2000 12, where author id = @au id where author id in (select author id from...

In all these cases, index density is used to evaluate the number of expected rows. INDEX DISTRIBUTION STATS These give an estimated number of qualifying rows. The distribution page contains fixed number of samples with fixed step size. The values are calculated as: Number of samples = (Page size - (No. of keys 2) (Size of index key + Overhead) The overhead in case of fixed length columns is 2 bytes and in the case of variable length columns it is 7 bytes. Up to a maximum of 256 samples are allowed in this page. The sample step size is calculated as: Sample step size = No. of records in the table Number of samples The more the number of samples, the less would be the sample step size and the better the statistics are.

As the number of samples depend on the column size, so, everything remaining same, an int key is always better than a char (12) key. Also, because of the limitation of the page size, the index stats are calculated only on the first column of a composite index. Hence, the most selective column should be the first column of a composite index. INDEX DENSITY It is a measure of the amount of duplication in key values of an index.

It is the average percentage of duplicate values for the index. The smaller this is the more unique the data is. INDEX COVERING NI provides a special way of optimization by index covering. When the columns selected as well as used in the where clause are included in the index, the optimizer knows that it does not have to go the actual data page to retrieve the data. For ex, the NI author in 2 on Authors is on (author id, f name, l name).

So the query, select l name from Authors where author id = 12 and f name = "John " does not have to go to the data page, as the information is stored in the leaf level of the NI. Even if we select some column that is not there in NI, the SQL server has to make just one read of the data page to get the data. Index covering works even when there is no SARG. For ex the query: select count ( ) from Authors will use the NI with the least number of leaf level pages and scan on the leaf level of the NI rather than scanning at the data page level.

NOTE: A CI is, by definition, covered although the scan at that level is the same as table scan. Optimization of Cursors Memory use for cursors Cursors use memory and require locks on tables, data pages and index pages. The following are the memory uses for the different cursor commands: 1) declare cursor: Memory is allocated to the cursor and to store the query plan that is generated. If, however, the cursor is being used in a stored procedure, no additional memory is required and the one available for procedure can be utilized.

2) Open cursor: Intent table locks are held on the tables. 3) Fetch cursor: The server holds the page locks on the data pages that store the rows, thereby locking out updates by other processes. Cursor vs. non-cursor performance The cursors require locks for their operations. This can always cause concurrency problems in a high transaction environment. In addition to this, cursors also require much more network activity. This is because the application program needs to communicate with the SQL server regarding every result row of the query.

So the cursor almost always takes more time to complete than the normal query. Moreover, the cursors incur the overhead of processing instructions. So the cursors should be used only when they are absolutely necessary, especially in cases: 1) When the magnitude of data is huge such that intermediate processing can not be done in temporary tables 2) When the data processed is critical such that very frequent commits have to be there. 3) When the data processed is secure because the temp db is always visible to all the users.

Tips for optimization of cursors 1) The intent of the cursor should be specified as 'read only' or 'for update of list'. This gives greater control over concurrency implications. If the intent is not specified, the SQL sever decides for it and very often it uses up dateable cursors. Update able cursors use update locks, thereby preventing other up dateable locks or exclusive locks 2) Optimize the cursor select queries using the cursor itself and not ad how queries.

A standalone select query may be optimized very differently than the same select statement in a cursor. 3) Use 'union' instead of 'or' or 'in'. 4) The column names should be specified in the 'for update' clause. If not, the SQL server acquires update locks on all the tables referenced in from clause. 5) Keep the cursors open across commit and rollback. If you try to keep the transaction short to avoid holding the lock for a long time, the opening and closing the cursors can affect throughput since the SQL server needs to re materialize the result set each time the cursor is opened.

If, however, you choose to keep the transaction long, there can be some problems with the concurrency. So decide as per your requirements. Optimization of temp tables Locking in temp db should be avoided as much as possible. The locking can be caused because of frequent creation and dropping of objects. For ex, the creation and dropping of tables and indexes in temp db always acquires an exclusive lock on the system tables such as, , etc as an entry has to be made in these tables. If multiple user processes are creating and dropping tables in temp db, heavy contention can occur on system tables.

The logging should also be avoided as much as possible. The various methods by which the queries or the stored procedures using temporary tables can be optimized are: 1) The logging can be minimized using 'select into' rather than create table and insert. As soon as the temporary tables are rendered useless, the occupied space can be cleared using truncate. Just like select into, truncate is also a non-log gable command. 2) Try to use shorter rows in the queries. That is, instead of using select, use only the column names that you require.

3) Limit the number of rows to just the rows that the application requires. 4) Indexes should be created on temporary tables. The index must exist at the time the query using it is optimized. So you cannot create an index and then use it in the same stored procedure. 5) The index should be created only after there is some data in the table. If you create an index on an empty table, the statistics page is not created for that.

6) If a temporary table is created in a stored procedure, the optimizer does not know its size beforehand and it assumes it to be of 100 rows with 10 data pages. So any queries referring to this table are optimized using this much magnitude of data. If, however, the table contains much more data than that, the query plan that was created could be sub-optimal. Therefore, if possible, the table should be created in the outer procedure and modified in the inner one. Some general rules for optimization 1) Try to use = instead of as in the second case all the rows qualifying case will be picked up and then separated from the rest to display the result.

2) Try to use EXISTS instead of NOT EXISTS as the former will stop searching as soon as it finds any matching row. 3) OR vs. UNION The 'OR' clause is optimized if there is a valid SARG in the where clause or when there are no joins. The 'UNION' is optimized when there are joins in the where clause. So the following should be the practice: select from Authors where f name = 'John' OR l name = 'Gray's elect a. Authors a, Titles t where a. author id = t. author id and t. title name like 'abc%' UNION select a.

Authors a, Titles t where a. author id = t. author id and t. type = 'literature' and not like: select from Authors where f name = 'John' UNION select from Authors where l name = 'Gray's elect a. Authors a, Titles t where a. author id = t. author id and t. title name like 'abc%' OR t. type = 'literature " However, the result of the queries involving OR and UNION might be different as UNION ALL allows all the duplicate rows whereas UNION does not allow any. OR might return some duplicate values. 4) IN vs. OR IN is treated the same as different ORs. For ex, the query select from Authors where author id = '12' or author id = '13' or author id = '14'is the same as the query: select from Authors where author id in ('12',' 13',' 14') 5) The optimizer should be given as much information in the where clause as possible.

For ex, the query select a. Authors a, Titles t where a. author id = '12' and t. author id = '12' and a. author id = t. author id and t. title name like 'abc%'will always give more flexibility to the optimizer during the planning phase than the query select a. Authors a, Titles t where a. author id = '12' and a. author id = t. author id and t. title name like 'abc%'In this case, the optimizer does not know at the planning time as what all values the columns t. author id could take. This is more pronounced in the case when an input parameter to a stored procedure is used to match column names in different tables. 6) There is no order in which the columns should be specified in a where clause to use an index. The optimizer is smart that way.

However, a where clause must contain the leading column of a composite index for the index to be used. 7) While creating a composite index, the most restrictive column should be the first column in it. This will be the column on which the distribution statistics are calculated. 8) Index entries should be kept as small as possible. Variable length columns should not be included if possible as they require a lot of overhead. This will result in less number of rows per page of the index and consequently more pages in leaf level and sometimes intermediate levels too.

9) When two or more indexes look exactly the same to the optimizer, the one that is created earlier might be used. For ex, if there are two indexes on Authors in the order in which they are created are: author in 1 (author id, f name, l name) author in 2 (author id, address) then a query select address from Authors where author id = '13'might use the first index as the index stats are calculated only on the first column of a composite index and so the two will appear the same to the optimizer. In this case, any of the indexes can be forced like: select address from Authors (author in 2) where author id = '13'Now, the optimizer will know that it has to use the second index. However, the stress should be on why the optimizer did not use the second index by itself. 10) JOIN Processing For every qualifying row of the outer table, the inner tables are accessed. So if we have 3 tables A, B, C in a join and A is the outermost table and C the innermost then for every qualifying row of A (say 15) the table B will be accessed and for every qualifying row of B (say 20) table C will be accessed.

So the table C will be accessed 15 20 = 300 times. So the tables used in the joins should be indexed properly. 11) There is no optimum sequence of the tables in the from clause if you are not forcing the plan using 'set force plan' command. However, if you are forcing a plan, then in the following clause: from A, B, C, D, EA will be the outermost table and E the innermost. So maintain the order accordingly. 12) If a variable has to be used as a SARG, a stored procedure should be created instead.

For ex, the query: declare @au id int select @au id = author id from Authors where l name = 'Gray's elect from Titles where author id = @au id should be converted into: declare @au id int select @au id = author id from Authors where l name = 'Gray' exec diff proc @au id create diff proc @au id int as select from Titles where author id = @au id if possible. 13) If there are a number of tables in a from clause, the optimizer will still use only four tables to create the best plan for the query. If you think that your plan is better than what the optimizer chooses, you can force the number of tables used by the optimizer using 'set table count' command. Some useful commands 1) set show plan on This will show the plan that the optimizer chooses. This is often used in conjunction with set no exec on to avoid cluttering the plan with the selected data. 2) set force plan on This will force the plan you want to use.

The stress should, however, be on why the optimizer is not using the plan that you would want it to use, as the optimizer knows the internals better than us. 3) set statistic io on This will give you the number of physical as well as the logical reads made by the optimizer. The physical reads are the reads from the disk. The logical reads are the reads from the cache. Before reading any of the pages from the disk, the SQL Server first brings it into the cache. So the number of logical reads is always greater than or equal to the number of physical reads.

4) set statistics time on This will give the time elapsed, in ms, for parsing, planning and execution of the query. 5) set table count This will force the table count to the number you specify..