보미네 :: i/o

« » 2025.11
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

i/o

카테고리 없음 / 2011. 7. 8. 02:49

In this post from last year, I discussed how random I/Os are slower than sequential I/Os (particularly for conventional rotating hard drives). For this reason, SQL Server often favors query plans that perform sequential scans of an entire table over plans that perform random lookups of only a portion of a table. (See the last example in this post for a simple demonstration.) In other cases, instead of performing a sequential scan, SQL Server introduces a sort operator whose sole purpose is to convert random I/Os into sequential I/Os.
Let's look at an example of such a sort. To measure the performance effects, we'll need a reasonably large table. The following script creates a 25.6 million row table that consumes about 3 GBytes of storage.

CREATE DATABASE IOTest
    ON ( NAME = IOTest_Data, FILENAME = '...\IOTest_Data.mdf', SIZE = 4 GB )
    LOG ON ( NAME = IOTest_Log, FILENAME = '...\IOTest_Log.ldf', SIZE = 200 MB )
GO
ALTER DATABASE IOTest SET RECOVERY SIMPLE
GO
USE IOTest
GO
CREATE TABLE T (
    PK INT IDENTITY PRIMARY KEY,
    RandKey INT,
    Flags TINYINT,
    Data INT,
    Pad CHAR(100))
GO
SET NOCOUNT ON
DECLARE @I INT
SET @I = 0
WHILE @I < 100000
BEGIN
    WITH
      X2 (R) AS ( SELECT RAND() UNION ALL SELECT RAND() ),
      X4 (R) AS ( SELECT R FROM X2 UNION ALL SELECT R FROM X2 ),
      X8 (R) AS ( SELECT R FROM X4 UNION ALL SELECT R FROM X4 ),
      X16 (R) AS ( SELECT R FROM X8 UNION ALL SELECT R FROM X8 ),
      X32 (R) AS ( SELECT R FROM X16 UNION ALL SELECT R FROM X16 ),
      X64 (R) AS ( SELECT R FROM X32 UNION ALL SELECT R FROM X32 ),
      X128 (R) AS ( SELECT R FROM X64 UNION ALL SELECT R FROM X64 ),
      X256 (R) AS ( SELECT R FROM X128 UNION ALL SELECT R FROM X128 )
    INSERT T (RandKey, Flags, Data, Pad)
        SELECT R * 1000000000, 0xFF, 1, '' FROM X256
    SET @I = @I + 1
END
GO
CREATE INDEX IRandKey on T (RandKey, Flags)
GO

Due to the fixed width Pad column, each row of T consumes 113 bytes (plus overhead). Roughly 65 rows fit on a single 8 Kbyte page. (The Flags column is unused in this example, but I will make use of it in a subsequent post.)

The RandKey column, as the name suggests, contains random values. Notice that we have a non-clustered index on this column. Given a predicate on the RandKey column, SQL Server can use this index to fetch qualifying rows from the table. However, because the values in this column are random, the selected rows will be scattered randomly throughout the clustered index.

If we select just a few rows from the table using a filter on RandKey, SQL Server will use the non-clustered index:

SELECT SUM(Data)
FROM T
WHERE RandKey < 1000

|--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1011]=(0) THEN NULL ELSE [Expr1012] END))
       |--Stream Aggregate(DEFINE:([Expr1011]=COUNT_BIG([T].[Data]), [Expr1012]=SUM([T].[Data])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T].[PK], [Expr1010]) OPTIMIZED WITH UNORDERED PREFETCH)
                 |--Index Seek(OBJECT:([T].[IRandKey]), SEEK:([T].[RandKey] < (1000)) ORDERED FORWARD)
                 |--Clustered Index Seek(OBJECT:([T].[PK__T__...]), SEEK:([T].[PK]=[T].[PK]) LOOKUP ORDERED FORWARD)

The non-clustered index seek selects a few rows (the use of random keys means that the exact number may vary each time the table is loaded) and looks them up in the clustered index to get the value of the Data column for the SUM aggregate. The non-clustered index seek is very efficient - it likely touches only one page - but the clustered index seek generates a random I/O for each row.

If we select a large number of rows, SQL Server recognizes that the random I/Os are too expensive and switches to a clustered index scan:

SELECT SUM(Data)
FROM T
WHERE RandKey < 10000000

This query touches only 1% of the data. Still, the query is going to touch more than half of the pages in the clustered index so it is faster to scan the entire clustered index than to perform on the order of 256,000 random I/Os.

Somewhere in between these two extremes things get a little more interesting:

SELECT SUM(Data)
FROM T
WHERE RandKey < 2500000

|--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1010]=(0) THEN NULL ELSE [Expr1011] END))
       |--Stream Aggregate(DEFINE:([Expr1010]=COUNT_BIG([T].[Data]), [Expr1011]=SUM([T].[Data])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T].[PK], [Expr1009]) WITH UNORDERED PREFETCH)
                 |--Sort(ORDER BY:([T].[PK] ASC))
                 |    |--Index Seek(OBJECT:([T].[IRandKey]), SEEK:([T].[RandKey] < (2500000)) ORDERED FORWARD)
                 |--Clustered Index Seek(OBJECT:([T].[PK__T__...]), SEEK:([T].[PK]=[T].[PK]) LOOKUP ORDERED FORWARD)

This query touches a mere 0.25% of the data. The plan uses the non-clustered index to avoid unnecessarily touching many rows. Yet, performing 64,000 random I/Os is still rather expensive so SQL Server adds a sort. By sorting the rows on the clustered index key, SQL Server transforms the random I/Os into sequential I/Os. Thus, we get the efficiency of the seek - touching only those rows that qualify - with the performance of the sequential scan.

It is worth pointing out that sorting on the clustered index key will yield rows that are in the logical index order. Due to fragmentation or due simply to the multiple layers of abstraction between SQL Server and the actual hard drives, there is no guarantee that the physical order on disk matches the logical order.

In my next post, I'll run some of these queries and demonstrate the performance implications of the sort.

In my last post, I discussed how SQL Server can use sorts to transform random I/Os into sequential I/Os. In this post, I'll demonstrate directly how such a sort can impact performance. For the following experiments, I'll use the same 3 GByte database that I created last week.

The system I'm using to run this test has 8 GBytes of memory. To exaggerate the performance effects and simulate an even larger table that does not fit in main memory, I'm going to adjust the ‘MAX SERVER MEMORY' SP_CONFIGURE option to allow SQL Server to use just 1 GByte of memory. I'm going to use CHECKPOINT to ensure that the newly created database is completely flushed to disk before running any experiments. Finally, I'm going to run DBCC DROPCLEANBUFFERS before each test to ensure that none of the data is cached in the buffer pool between tests.

CHECKPOINT

EXEC SP_CONFIGURE 'SHOW ADVANCED OPTIONS', '1'
RECONFIGURE
EXEC SP_CONFIGURE 'MAX SERVER MEMORY', '1024'
RECONFIGURE

DBCC DROPCLEANBUFFERS

Note that you will NOT want to run these statements on a production server.

As I discussed last week, SQL Server can use one of three plans for the following query depending on the value of the constant:

SELECT SUM(Data)
FROM T
WHERE RandKey < constant

To recap, if the constant is small, SQL Server uses a non-clustered index seek and a bookmark lookup. If the constant is large, SQL Server uses a clustered index scan to avoid performing many random I/Os. Finally, if the constant is somewhere in the middle, SQL Server uses the non-clustered index seek but sorts the rows prior to performing the bookmark lookup to reduce the number of random I/Os. You can review last week's post to see examples of each of these plans. I'm going to focus on the third and final plan with the sort.

To demonstrate the benefit of the sort, I need to be able to run the same query with and without the sort. A simple way to make SQL Server remove the sort is to use the following UPDATE STATISTICS statement to trick SQL Server into believing that the table is really small. To ensure that I still get the plan with the non-clustered index seek and the bookmark lookup, I need to add an INDEX hint. I'm also adding a RECOMPILE query hint to ensure that SQL Server generates a new plan after I've altered the statistics.

UPDATE STATISTICS T WITH ROWCOUNT = 1, PAGECOUNT = 1

SELECT SUM(Data)
FROM T WITH (INDEX (IRandKey))
WHERE RandKey < constant
OPTION (RECOMPILE)

I can also reset the statistics using the following statement:

UPDATE STATISTICS T WITH ROWCOUNT = 25600000, PAGECOUNT = 389323

Here is an example of the default plan with the real statistics and with the sort:

|--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1010]=(0) THEN NULL ELSE [Expr1011] END))
       |--Stream Aggregate(DEFINE:([Expr1010]=COUNT_BIG([T].[Data]), [Expr1011]=SUM([T].[Data])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T].[PK], [Expr1009]) WITH UNORDERED PREFETCH)
                 |--Sort(ORDER BY:([T].[PK] ASC))
                 |    |--Index Seek(OBJECT:([T].[IRandKey]), SEEK:([T].[RandKey] < (2000000)) ORDERED FORWARD)
                 |--Clustered Index Seek(OBJECT:([T].[PK__T__...]), SEEK:([T].[PK]=[T].[PK]) LOOKUP ORDERED FORWARD)

Here is an example of the plan after running UPDATE STATISTICS and without the sort:

|--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1009]=(0) THEN NULL ELSE [Expr1010] END))
       |--Stream Aggregate(DEFINE:([Expr1009]=COUNT_BIG([T].[Data]), [Expr1010]=SUM([T].[Data])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T].[PK]))
                 |--Index Seek(OBJECT:([T].[IRandKey]), SEEK:([T].[RandKey] < (2000000)) ORDERED FORWARD)
                 |--Clustered Index Seek(OBJECT:([T].[PK__T__...]), SEEK:([T].[PK]=[T].[PK]) LOOKUP ORDERED FORWARD)

Here are my results running this query with two values of the constant both with and without the sort. Keep in mind that these results depend greatly on the specific hardware. If you try this experiment, your results may vary.

Execution Time
% Increase

with Sort
without Sort

Constant
2,000,000
(0.2% of rows)
91 seconds
352 seconds
286%

4,000,000
(0.4% of rows)
97 seconds
654 seconds
574%

% Increase
100%
6%
86%

There are a two points worth noting regarding these results. First, it should be very clear that the plan with the sort is significantly faster (up to 7 times faster) than the plan without the sort. This result clearly shows the benefit of sequential vs. random I/Os. Second, doubling the number of rows touched had hardly any effect on the execution time for the plan with the sort but nearly doubled the execution time for the plan without the sort. Adding additional I/Os to the plan with the sort adds only a small incremental cost since the I/Os are sequential and the disk head will pass over the required data exactly once either way. Adding additional I/Os to the plan without the sort adds additional disk seeks and increases the execution time proportionately to the increase in the number of rows. In fact, if the constant is increased further, the execution time of the plan with the sort will continue to increase only gradually with the execution time of the plan without the sort will continue to increase rapidly.

In my past two posts, I explained how SQL Server may add a sort to the outer side of a nested loops join and showed how this sort can significantly improve performance. In an earlier post, I discussed how SQL Server can use random prefetching to improve the performance of a nested loops join. In this post, I'm going to explore one more nested loops join performance feature. I'll use the same database that I used in my two prior posts. Let's start with the following simple query:
SELECT SUM(Data)
FROM T
WHERE RandKey < 1000

Notice that the nested loops join includes an extra keyword: OPTIMIZED. This keyword indicates that the nested loops join may try to reorder the input rows to improve I/O performance. This behavior is similar to the explicit sorts that we saw in my two previous posts, but unlike a full sort it is more of a best effort. That is, the results from an optimized nested loops join may not be (and in fact are highly unlikely to be) fully sorted.

SQL Server only uses an optimized nested loops join when the optimizer concludes based on its cardinality and cost estimates that a sort is most likely not required, but where there is still a possibility that a sort could be helpful in the event that the cardinality or cost estimates are incorrect. In other words, an optimized nested loops join may be thought of as a "safety net" for those cases where SQL Server chooses a nested loops join but would have done better to have chosen an alternative plan such as a full scan or a nested loops join with an explicit sort. For the above query which only joins a few rows, the optimization is unlikely to have any impact at all.

Let's look at an example where the optimization actually helps:

SELECT SUM(Data)
FROM T
WHERE RandKey < 100000000 AND
    Flags & 0x1 = 0x1 AND
    Flags & 0x2 = 0x2 AND
    Flags & 0x4 = 0x4 AND
    Flags & 0x8 = 0x8

|--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1014]=(0) THEN NULL ELSE [Expr1015] END))
       |--Stream Aggregate(DEFINE:([Expr1014]=COUNT_BIG([T].[Data]), [Expr1015]=SUM([T].[Data])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T].[PK], [Expr1013]) OPTIMIZED WITH UNORDERED PREFETCH)
                 |--Index Seek(OBJECT:([T].[IRandKey]), SEEK:([T].[RandKey] < (100000000)), WHERE:(([T].[Flags]&(1))=(1) AND ([T].[Flags]&(2))=(2) AND ([T].[Flags]&(4))=(4) AND ([T].[Flags]&(8))=(8)) ORDERED FORWARD)
                 |--Clustered Index Seek(OBJECT:([T].[PK__T__...]), SEEK:([T].[PK]=[T].[PK]) LOOKUP ORDERED FORWARD)

The Flags column contains the value 0xFF in every row. Thus, every one of the bitwise AND predicates evaluates to true and this query returns about 2.5 million rows or 10% of the table. Ordinarily, when faced with a query like this one, SQL Server would resort to a sequential scan of the entire table. Indeed, if you try this query without the extra bitwise filters, you will get a sequential scan. However, SQL Server does not realize that these predicates are always true, estimates a much lower cardinality of less than 10,000 rows, and chooses a simple nested loops join plan. Note that I would generally recommend against using predicates like these ones in a real world application precisely because they will lead to cardinality estimation errors and poor plans.

To see what effect the optimized nested loops join has, let's compare the above plan with an "un-optimized" nested loops join. We can eliminate the optimization by using the following UPDATE STATISTICS statement to trick SQL Server into believing that the table is very small:

UPDATE STATISTICS T WITH ROWCOUNT = 1, PAGECOUNT = 1

I'll compare the above query with the following simpler query which uses essentially the same plan and touches the same data but has an "un-optimized" nested loops join:

SELECT SUM(Data)
FROM T WITH (INDEX (IRandKey))
WHERE RandKey < 100000000

|--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1009]=(0) THEN NULL ELSE [Expr1010] END))
       |--Stream Aggregate(DEFINE:([Expr1009]=COUNT_BIG([T].[Data]), [Expr1010]=SUM([T].[Data])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T].[PK]))
                 |--Index Seek(OBJECT:([T].[IRandKey]), SEEK:([T].[RandKey] < (100000000)) ORDERED FORWARD)
                 |--Clustered Index Seek(OBJECT:([T].[PK__T__...]), SEEK:([T].[PK]=[T].[PK]) LOOKUP ORDERED FORWARD)

We can reset the statistics using the following statement:

UPDATE STATISTICS T WITH ROWCOUNT = 25600000, PAGECOUNT = 389323

As in my last post, I'm going to simulate a larger table by reducing the memory available to the server to 1 GByte with SP_CONFIGURE 'MAX SERVER MEMORY' and I'm also going to flush the buffer pool between runs with DBCC DROPCLEANBUFFERS.

Note that you will NOT want to run these statements on a production server.

I ran both of the above queries with three different constants. Here are my results. Keep in mind that these results depend greatly on the specific hardware. If you try this experiment, your results may vary.

Execution Time
Increase

OPTIMIZED
"un-OPTIMIZED"

Constant
10,000,000
(1% of rows)
6.5 minutes
26 minutes
4x

100,000,000
(10% of rows)
10.4 minutes
4.3 hours
25x

250,000,000
(25% of rows)
11.3 minutes
10.6 hours
56x

Clearly the optimized nested loops join can have a huge impact on performance. Moreover, as the plan touches more rows the benefit of the optimization grows dramatically. Although a full scan or a nested loops join with an explicit sort would be faster, the optimized nested loops join really is a safety net protecting against a much worse alternative.

저작자표시 (새창열림)

Posted by 보미아빠

, |

보미네

카테고리

달력

공지사항

태그목록

최근에 올라온 글

i/o

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

링크

티스토리툴바