Category: Statistics

Statistics - Just how accurate are they?

7/5/2013

Like many a SQL Server enthusiast; I recently attended SQLBits, the UK’s biggest annual SQL Server event. I had the pleasure of attending a number of very informative sessions, but one of the most enjoyable was the ‘lightening talks’. This session involved a number of speakers, each delivering a 5 minute talk on a chosen subject. After the last session we retired to the hotel for a quick drink before heading home, and one of the first topics of discussions centred around one of the lightening speakers.

A bloke from Microsoft delivered a talk around statistics, or specifically the possibility of comparing a histogram back to the actual data. The general consensus, and I admit it was one that I shared, was that we saw no obvious benefit to the exercise, however 24 hours later I am no longer so sure.

As discussed previously, as the size of your tables grows the frequency at which the statistic on the table are automatically updated decreases. This may not always lead to the existing statistics becoming inaccurate; much depends on the profile of the data. If the data being added has a similar profile to the existing data there is every chance that the statistics will still work just as well as when they were first created. However, if the new data is sufficiently different as to skew the overall data profile it could lead to the query optimiser making bad decisions. But how do we know if the data profile has been skewed? Or whether we have introduced values that are not even covered by the histogram? This is where a direct comparison between the data and the histogram could be useful.

For the purposes of this demonstration, I have set up a new database, and am turning off AUTO_CREATE_STATISTICS and AUTO_UPDATE_STATISTICS:

USE [master]
GO

ALTER DATABASE [StatTest] SET AUTO_CREATE_STATISTICS OFF WITH NO_WAIT
GO

ALTER DATABASE [StatTest] SET AUTO_UPDATE_STATISTICS OFF WITH NO_WAIT
GO

Next I’m going to create a table and populate it with some test data.

USE [StatTest]
GO

Create table dbo.StatTest1
(
TestValue   int
)
GO

insert      dbo.StatTest1
Select      FLOOR(1000000 * rand())
Go 50000

Now I am going to explicitely creat statistics over the test data.

Create statistics s_TestValue on dbo.StatTest1 (TestValue)
Go

So, how do we go about capturing the histogram that was produced by the statistics? There is no way to directly capture the results of a DBCC call, but there is a work around. It is possible to capture the results of an executed procedure or an executed SQL string, and in this instance the string that we are going to execute is the DBCC call:

Create table #Histogram
(
ID          int identity(1,1),
Range_Hi_Key      int,
Range_Rows int,
EQ_Rows           int,
Distinct_Range_Rows     int,
Avg_Range_rows    float
)

insert      #Histogram
exec('DBCC show_Statistics(''StatTest1'',''s_TestValue'') with Histogram')
go

Now; lets look to see if every record in the dataset is covered by the statistics:

;With Ranges
as
(
Select      ISNULL(b.Range_Hi_Key +1, 0) Range_Low_Key,
a.Range_Hi_Key,
a.Range_Rows + a.EQ_Rows Range_Rows_Total,
a.Distinct_Range_Rows
from #Histogram a
left join   #Histogram b      on    a.ID = b.ID + 1
)
Select      COUNT(*) Total_Rows,
SUM(case when r.Range_Low_Key is null then 1 else 0 end) OutOfRange_Rows
from dbo.StatTest1 s
left join   Ranges r on s.TestValue between r.Range_Low_Key and r.Range_Hi_Key

	Total_Rows	50000
	OutOfRange_Rows	0

So, lets sour the data by adding some out of range data:

insert dbo.StatTest1
Select 1000000 + FLOOR(1000 * rand())
Go 50000

Repeating the above query should now show that half of the data is not now covered by the histogram.

	Total_Rows	100000
	OutOfRange_Rows	50000

The statistics would clearly now be inefficient, and so it would be wise to update the statistics and recheck.

Update statistics dbo.StatTest1
go

Truncate table #Histogram
go

insert      #Histogram
exec('DBCC show_Statistics(''StatTest1'',''s_TestValue'') with Histogram')
go

;With Ranges
as
(
Select      ISNULL(b.Range_Hi_Key +1, 0) Range_Low_Key,
a.Range_Hi_Key,
a.Range_Rows + a.EQ_Rows Range_Rows_Total,
a.Distinct_Range_Rows
from #Histogram a
left join   #Histogram b      on    a.ID = b.ID + 1
)
Select      COUNT(*) Total_Rows,
SUM(case when r.Range_Low_Key is null then 1 else 0 end) OutOfRange_Rows
from dbo.StatTest1 s
left join   Ranges r on s.TestValue between r.Range_Low_Key and r.Range_Hi_Key
go

	Total_Rows	100000
	OutOfRange_Rows	0

So, how do we know whether new data has skewed the profile within the existing data ranges? The following query examines 2 key elements of the histogram; the number of records per range, and the number of distinct values per range. Should the estimates within the histogram prove to be wrong, then it would have a significant detrimental effect on the effectiveness of the statistics.

;With Ranges
as
(
Select      ISNULL(b.Range_Hi_Key +1, 0) Range_Low_Key,
a.Range_Hi_Key,
a.Range_Rows + a.EQ_Rows Range_Rows_Total,
a.Distinct_Range_Rows
from #Histogram a
left join   #Histogram b      on    a.ID = b.ID + 1
),
Summary as
(
Select      ISNULL(cast(Range_Hi_Key as varchar(10)), 'Out of Range') Range_Hi_Key,
COUNT(*) Actual_Rows,
isnull(max(Range_Rows_Total),0) Estimated_Rows,
COUNT(*) - isnull(max(Range_Rows_Total),0) Row_Variance,
COUNT(distinct s.TestValue) Actual_Distinct_Rows,
isnull(MAX(Distinct_Range_Rows),0) Est_Distinct_Rows,
COUNT(distinct s.TestValue) - isnull(MAX(Distinct_Range_Rows),0) DistinctVariance
from Ranges r
left join   dbo.StatTest1 s on s.TestValue between r.Range_Low_Key and r.Range_Hi_Key
group by    ISNULL(cast(Range_Hi_Key as varchar(10)), 'Out of Range')
)
Select      sum(Actual_Rows) Actual_Rows,
sum(Estimated_Rows) Estimated_Rows,
max(Row_Variance) Max_Row_Variance,
avg(Row_Variance) Avg_Row_Variance,
sum(Actual_Distinct_Rows) Actual_Distinct_Rows,
sum(Est_Distinct_Rows) Estimated_Distinct_Rows,
max(DistinctVariance) Max_DistinctVariance,
avg(DistinctVariance) Avg_DistinctVariance
from Summary

	Actual_Rows	100000
	Estimated_Rows	100000
	Max_Row_Variance	0
	Avg_Row_Variance	0
	Actual_Distinct_Rows	49722
	Estimated_Distinct_Rows	49592
	Max_DistinctVariance	1
	Avg_DistinctVariance	1

It would appear that the estimates created within the histogram are very accurate. Now lets skew the data by inserting 50000 new records in to a small number of existing ranges:

insert dbo.StatTest1
Select 700000 + FLOOR(50000 * rand())
Go 50000

And lets repeat the test:

	Actual_Rows	150000
	Estimated_Rows	100000
	Max_Row_Variance	22302
	Avg_Row_Variance	384
	Actual_Distinct_Rows	79836
	Estimated_Distinct_Rows	49592
	Max_DistinctVariance	13501
	Avg_DistinctVariance	232

As you can see, there is now a significant variance on both the rows per range and the distinct values per range, and this would most likely result in below par execution plans.

Conclusions

What I hope to have proven is that it is possible to audit statistics for accuracy. The above queries are far from a complete audit solution, but I hope they will prove a useful starting point for anyone wanting to create such a solution.

0 Comments

Statistics - The Silent Assassin

18/4/2013

0 Comments

In terms of SQL Server performance, there are very few things that are more important than statistics. Appropriate and up to date statistics are essential for the query optimiser; the service which decides how SQL Server will process any request, to design the most efficient query plan for any particular task.

I suspect by now that the first question on your lips is:

If they are so important; why are you calling them the 'Silent Assassin???'

Quiet simply; statistics in SQL Server are often under valued by developers and DBA's and are generally forgotten about. If statistics are not properly maintained, they have the potential to mislead the query optimiser, and as a consequence have a significant adverse effect on performance.

What are Statistics?

Statistics provide certain summary information about the data in a field, or set of fields, to the query optimiser, and allow SQL Server to make an intelligent and well informed decision as to how best to return the requested data. The compiled statistics include information on:

Number of rows in the table
Number of pages used by the table
Number of rows per page, and so rows per read as SQL Server reads entire pages, not rows
A histogram showing the distribution of values in the first field
The correlation between the first field values and additional field values, for multi field statistics.

Statistics are used by the query optimiser to determine the relative resource cost of any particular data retrieval method, such as a table or clustered index scan, clustered index seek, or index seek. Based on these costs the query optimiser will choose the least expensive method to use.

Creating Statistics

SQL server will automatically create statistics whenever an index is created. If the AUTO_CREATE_STATISTICS option is enabled, SQL Server will also create single column statistics whenever a field that is not the first field in an existing statistics definition is used in either a join or a where clause.

Manually creating statistics can be accomplished by using either of the following methods:

CREATE STATISTICS statistics_name ON { table_or_indexed_view_name } ( column [ ,...n ] )

This method allows for full control over then statistics beiong created. There are further options to the CREATE STATISTICS statement which can be found http://msdn.microsoft.com/en-gb/library/ms188038.aspx

Exec sp_CreateStats

This method executes CREATE STATISTICS against every field in a database which is not the first field in any existing statistics. Further information on sp_CreateStats can be found http://msdn.microsoft.com/en-gb/library/ms173804.aspx

Updating Statistics

Similar to creating statistics, updating statistics can happen either automatically or manually.

If the AUTO_UPDATE_STATISTICS option is enabled, SQL Server will update statistics whenever certain thresholds are passed:

Number of Rows in Table
<= 6
>6 and <= 500
> 500

Threshold
6 changes
500 changes
20% of row count + 500 changes

It is worth noting that these thresholds will be passed frequently for smaller tables, but should you be working with larger table, say 50 million, the number of records that would have to change for the statistics to be changed (updated, deleted or inserted) would be 10,000,500. In practice it is likely that the out of date statistics would be having an adverse effect on execution plan creation long before this threshold is passed on a large table, and so it is worth considering more frequent routine statistic maintenance.

Manually updating statistics can be accomplished by either of the following methods:

UPDATE STATISTICS table_or_indexed_view_name

This method recreates all statistics on a named table or indexed view. Further options, including updating of individual statistics and sampling options, can be found http://msdn.microsoft.com/en-us/library/ms187348.aspx.

Exec sp_UpdateStats

This method executes UPDATE STATISTICS against every table and indexed view in the current database.

Sample Size

One of the options available when manually creating or updating statistics is the size of sample that is used when compiling the statistics. Deciding on the ideal sample size is always a balancing act between the length of time required to build the statistics, and quality of the resulting statistics.

There is no fool proof rule in deciding on the sample size, it all depends on the type and distribution of data. If the data in the field contains a fairly even distribution of values, then a smaller sample may be appropriate. However, if the data contains a large proportion of values in a particular range, a smaller sample size may skew the sample towards these values, and leave the statistics woefully inefficient when querying data outside of this range.

There is however a tipping point, after which performing a full scan is actually faster than building statistics based on a restricted sample size. Again, there is no hard and fast rule as to where this tipping point lays, it is dependent on the data, the spec of the machine (particular disk speed), and the available resources, the only way to identify it is through trial and improvement. In tests that I have run, on a table of about 3 million records, the tipping point appears to be around 20%, but I have seen documented cases of the tipping point laying anywhere up to 75% sample size.

Retrieving Information on Statistics

There are a number of methods available for displaying information in statistics:

Exec sp_helpstats 'tablename'

Displays the names and included fields for all statistics on a named table. This procedure is deprecated so may be removed from future versions of SQL Server.

DBCC SHOW_STATISTICS ('tablename','index or statistic name')

Displays full details of a named statistic, including:

Date of last update
Number of record and the number sampled
Included fields and selectivity
Full histogram

sys.stats
sys.stats_columns
sys.system_internals_partition_columns
sys.dm_db_index_operational_stats
sys.dm_db_stats_properties

Depending on the version of SQL Server being used; all of these system views can be used to retrieve additional information about statistics.

Conclusion

Statistics are extremely important to SQL Server performance, and whilst by default SQL Server will maintain statistics in good working order there are times, especially with large data sets, when manual intervention is required. Whilst I have not covered this topic in as much detail as I could have, I believe that this article provides more than enough information for the majority of developers and administrators.

0 Comments

Author

I am a BI Consultant almost 20 years experience in the Finance, Logistics and Direct Marketing sectors. Whilst my career has centred primarily around the SQL Server stack, I firmly believe that no tool is right in all situations, and added Python to my toolkit several years ago to do the things that SQL can't. I believe in automation where practical, and specialise in the creation of metadata driven ETL frameworks.

Statistics - Just how accurate are they?

Statistics - The Silent Assassin

Author

Archives

Categories