A bloke from Microsoft delivered a talk around statistics, or specifically the possibility of comparing a histogram back to the actual data. The general consensus, and I admit it was one that I shared, was that we saw no obvious benefit to the exercise, however 24 hours later I am no longer so sure.
As discussed previously, as the size of your tables grows the frequency at which the statistic on the table are automatically updated decreases. This may not always lead to the existing statistics becoming inaccurate; much depends on the profile of the data. If the data being added has a similar profile to the existing data there is every chance that the statistics will still work just as well as when they were first created. However, if the new data is sufficiently different as to skew the overall data profile it could lead to the query optimiser making bad decisions. But how do we know if the data profile has been skewed? Or whether we have introduced values that are not even covered by the histogram? This is where a direct comparison between the data and the histogram could be useful.
For the purposes of this demonstration, I have set up a new database, and am turning off AUTO_CREATE_STATISTICS and AUTO_UPDATE_STATISTICS:
USE [master]
GO
ALTER DATABASE [StatTest] SET AUTO_CREATE_STATISTICS OFF WITH NO_WAIT
GO
ALTER DATABASE [StatTest] SET AUTO_UPDATE_STATISTICS OFF WITH NO_WAIT
GO
USE [StatTest]
GO
Create table dbo.StatTest1
(
TestValue int
)
GO
insert dbo.StatTest1
Select FLOOR(1000000 * rand())
Go 50000
Create statistics s_TestValue on dbo.StatTest1 (TestValue)
Go
Create table #Histogram
(
ID int identity(1,1),
Range_Hi_Key int,
Range_Rows int,
EQ_Rows int,
Distinct_Range_Rows int,
Avg_Range_rows float
)
insert #Histogram
exec('DBCC show_Statistics(''StatTest1'',''s_TestValue'') with Histogram')
go
;With Ranges
as
(
Select ISNULL(b.Range_Hi_Key +1, 0) Range_Low_Key,
a.Range_Hi_Key,
a.Range_Rows + a.EQ_Rows Range_Rows_Total,
a.Distinct_Range_Rows
from #Histogram a
left join #Histogram b on a.ID = b.ID + 1
)
Select COUNT(*) Total_Rows,
SUM(case when r.Range_Low_Key is null then 1 else 0 end) OutOfRange_Rows
from dbo.StatTest1 s
left join Ranges r on s.TestValue between r.Range_Low_Key and r.Range_Hi_Key
Total_Rows | 50000 | |
OutOfRange_Rows | 0 |
So, lets sour the data by adding some out of range data:
insert dbo.StatTest1
Select 1000000 + FLOOR(1000 * rand())
Go 50000
Total_Rows | 100000 | |
OutOfRange_Rows | 50000 |
The statistics would clearly now be inefficient, and so it would be wise to update the statistics and recheck.
Update statistics dbo.StatTest1
go
Truncate table #Histogram
go
insert #Histogram
exec('DBCC show_Statistics(''StatTest1'',''s_TestValue'') with Histogram')
go
;With Ranges
as
(
Select ISNULL(b.Range_Hi_Key +1, 0) Range_Low_Key,
a.Range_Hi_Key,
a.Range_Rows + a.EQ_Rows Range_Rows_Total,
a.Distinct_Range_Rows
from #Histogram a
left join #Histogram b on a.ID = b.ID + 1
)
Select COUNT(*) Total_Rows,
SUM(case when r.Range_Low_Key is null then 1 else 0 end) OutOfRange_Rows
from dbo.StatTest1 s
left join Ranges r on s.TestValue between r.Range_Low_Key and r.Range_Hi_Key
go
Total_Rows | 100000 | |
OutOfRange_Rows | 0 |
So, how do we know whether new data has skewed the profile within the existing data ranges? The following query examines 2 key elements of the histogram; the number of records per range, and the number of distinct values per range. Should the estimates within the histogram prove to be wrong, then it would have a significant detrimental effect on the effectiveness of the statistics.
;With Ranges
as
(
Select ISNULL(b.Range_Hi_Key +1, 0) Range_Low_Key,
a.Range_Hi_Key,
a.Range_Rows + a.EQ_Rows Range_Rows_Total,
a.Distinct_Range_Rows
from #Histogram a
left join #Histogram b on a.ID = b.ID + 1
),
Summary as
(
Select ISNULL(cast(Range_Hi_Key as varchar(10)), 'Out of Range') Range_Hi_Key,
COUNT(*) Actual_Rows,
isnull(max(Range_Rows_Total),0) Estimated_Rows,
COUNT(*) - isnull(max(Range_Rows_Total),0) Row_Variance,
COUNT(distinct s.TestValue) Actual_Distinct_Rows,
isnull(MAX(Distinct_Range_Rows),0) Est_Distinct_Rows,
COUNT(distinct s.TestValue) - isnull(MAX(Distinct_Range_Rows),0) DistinctVariance
from Ranges r
left join dbo.StatTest1 s on s.TestValue between r.Range_Low_Key and r.Range_Hi_Key
group by ISNULL(cast(Range_Hi_Key as varchar(10)), 'Out of Range')
)
Select sum(Actual_Rows) Actual_Rows,
sum(Estimated_Rows) Estimated_Rows,
max(Row_Variance) Max_Row_Variance,
avg(Row_Variance) Avg_Row_Variance,
sum(Actual_Distinct_Rows) Actual_Distinct_Rows,
sum(Est_Distinct_Rows) Estimated_Distinct_Rows,
max(DistinctVariance) Max_DistinctVariance,
avg(DistinctVariance) Avg_DistinctVariance
from Summary
Actual_Rows | 100000 | |
Estimated_Rows | 100000 | |
Max_Row_Variance | 0 | |
Avg_Row_Variance | 0 | |
Actual_Distinct_Rows | 49722 | |
Estimated_Distinct_Rows | 49592 | |
Max_DistinctVariance | 1 | |
Avg_DistinctVariance | 1 | |
It would appear that the estimates created within the histogram are very accurate. Now lets skew the data by inserting 50000 new records in to a small number of existing ranges:
insert dbo.StatTest1
Select 700000 + FLOOR(50000 * rand())
Go 50000
Actual_Rows | 150000 | |
Estimated_Rows | 100000 | |
Max_Row_Variance | 22302 | |
Avg_Row_Variance | 384 | |
Actual_Distinct_Rows | 79836 | |
Estimated_Distinct_Rows | 49592 | |
Max_DistinctVariance | 13501 | |
Avg_DistinctVariance | 232 | |
As you can see, there is now a significant variance on both the rows per range and the distinct values per range, and this would most likely result in below par execution plans.
Conclusions
What I hope to have proven is that it is possible to audit statistics for accuracy. The above queries are far from a complete audit solution, but I hope they will prove a useful starting point for anyone wanting to create such a solution.