The Final Heats and Pole Position

By this point, we’ve had our new hardware racked ‘n’ stacked and ran a series of disk subsystem tests all based on Microsoft’s successor to SQLIO, DiskSpd. The numbers look promising, but we can’t overlook the fact that it’s not a SQL query that was stressing the system; it was basically the operating system pegging the disk subsystem to read and write from disk as quickly as it can.
We need to run SQL queries to see how effective the new disk subsystem will be in a real-world setting. But we want to be fair and use a proper benchmark so that we can test different hardware configurations and use the one that’s most effective for our data warehouse workload.

Maximum Potential

One of the foundations for the Fast Track Data Warehouse initiative, and for testing hardware performance across the computer server industry, is the use of Transaction Processing Performance Council (TPC) benchmarks. The TPC is a nonprofit organization that defines a set of standards for data retrieval for database engines. Through their toolkits, they provide data models, data generation scripts for their models, and query generating scripts to test their models. For data warehousing, the latest incarnation is TPC-H. This benchmark utilizes a relatively small third normal form sales order data model. While 3NF doesn’t sound appropriate for a data warehouse (you may be thinking of a Ralph Kimball star schema instead), the Order and OrderDetail tables for TPC-H are quite good for testing sequential read throughput for our new hardware. The key thing to do is to build a large enough fact table to surpass the cache for the SAN so that you’re reading from disk and not in the buffer which might skew the read performance for the SAN.

Fast Track indicates two baseline tests to run when you’re testing vendor-defined gear as part of a Proof of Concept, or when you’re building your own configuration.

The first baseline test is to calculate the Maximum Consumption Rate (MCR) per core on the host. This is done by observing the throughput of the host running a table-scan query with the table already the buffer cache. It’s a measure of how well the CPU performs on the host, ignoring the disk subsystem.

The second baseline test is to calculate the Benchmark Consumption Rate (BCR), which is a real-world test of a cross-section of queries you anticipate running in your data warehouse. Like in the real world, your data won’t be in memory forever so shouldn’t be cached in the buffer pool. You’ll perform a DBCC DROPCLEANBUFFERS command to clear the buffer pool to do a proper end-to-end test from the disk subsystem to the CPUs.

Both Consumption Rate calculations are based on tabulating the number of logical reads for the query and dividing by the CPU time expended. The basic formula is this…

(Total Logical Reads/(CPU time in milliseconds/1,000))*8k/1024/1024=MB/sec/core

The logical reads are provided to you in pages, which are 8k (8,192 bytes) in size apiece. We divide the result twice by 1,024 bytes to go from bytes to KB and finally MB per second per core.

More Stats than Fantasy Baseball

To capture the logical reads for the query and CPU time utilized, you’ll need to issue two statements at the start of the session.

SET STATISTICS IO ON
SET STATISTICS TIME ON

1 2	SET STATISTICS IO ON SET STATISTICS TIME ON

The former statement will capture CPU time in milliseconds. The latter statement will capture your logical reads from the buffer pool, physical reads from disk and, most importantly for Enterprise Edition users, read-ahead reads. While most SQL Server read activity is either 8K or 64K in size, read-ahead reads will be up to 512K in size which is where we’re getting our optimal throughput.

Your first baseline test, the MCR, can be done on the order detail table from TPC-H benchmark or comparable fact table, provided the table fits within the size of the buffer pool you indicated with the Max Server Memory setting for the instance and the physical RAM onboard. You’ll run the query once to initially load the table into the buffer pool, then run the query at least 5 times (to weed out the outliers) at MAXDOP of 4 to determine the base throughput capabilities of a single core on the host.

On my laptop, I compiled the TPC-H dbgen utility to build a data set and figured I’d run an initial test on 1.7 GB TPC-H Orders table on my laptop to try out the methodology. My laptop, an ASUS G751JT desktop-equivalent gaming rig running SQL Server 2014 Developer edition with 16 GB RAM and a quad-core Intel i7-4720HQ processor @ 2.6GHz. is not exactly your typical business laptop.

Here’s the first query…

SET STATISTICS IO ON
SET STATISTICS TIME ON
USE [TPC_H]
GO
SELECT 
 [O_ORDERDATE],
 SUM(1) Tally
 FROM 
 [dbo].[ORDERS]
 GROUP BY 
 [O_ORDERDATE]
 OPTION (MAXDOP 4)
GO

SET STATISTICS IO ON

SET STATISTICS TIME ON

USE [TPC_H]

SELECT

[O_ORDERDATE],

SUM(1) Tally

FROM

[dbo].[ORDERS]

GROUP BY

[O_ORDERDATE]

OPTION (MAXDOP 4)

And here are the results from the SET statements in the Messages tab…

And here is the execution plan verifying the DB engine performed a full-table scan.

The output is chock full of information for our calculations.

Scan count: this reflects the number of threads in play for parallelization. With a MAXDOP setting of 4, scan count will be 4 worker threads + 1 for the controller thread that consolidates the workloads, totaling 5.
Logical Reads: For the DB engine to work on any data pages, they are always read from the buffer pool. This represents the total reads performed for the query.
Physical Reads: If any data wasn’t in the buffer pool, the reads from disk will be here to load the data into the buffer pool. For an MCR test, you need be pre-buffered, so the 0 reading here means we were set up correctly.
Read-Ahead Reads: For Enterprise edition users, the 512K reads would be indicated here. For the MCR test, this was 0 as expected.
CPU time: the actual work time spend by the CPU, not reflecting any resource wait time.
Elapsed time: this is the time elapsed as experienced by the client (in this case SSMS on my laptop) from the moment the query started to the display of the last row in the result set.

Using these numbers above, let’s calculate our MCR…

(Total Logical Reads/(CPU time in milliseconds/1000))*8k/1024/1024=MB/sec/core
(221,180/(5,187/1000))*8,192/1,024/1,024=333.13 MB/sec/core

Based on this single test, we were getting up to 333 MB/sec per core in sequential throughput. Considering the rule of thumb indicated in the reference guides ~200-250 MB/sec/core in throughput for a simple query, we’re outperforming the rule of thumb, but with an enhanced laptop and a relatively small fact table, these may not be unexpected.

The key thing is to know your MCR before you run your next series of tests for your benchmark consumption rate (BCR).

Getting Off the Bench

Your BCR uses the same formula as the MCR, but the stats used should be based on running a cornerstone query or basket of queries that will be the common workload for the data warehouse. They also should be run with no consideration for data in the buffer pool. In fact, issuing a DBCC DROPCLEANBUFFERS command between each BCR test run is indicated as it’ll clear the buffer pool and test not only the CPU, RAM and the other host components but also the disk subsystem end-to-end. Also, according to the 2012 Fast Track reference guide, the testing should start as MAXDOP 8 and go up incrementally to compare CPU utilization with overall runtime. This would be done on a query-by-query basis as, unlike the controlled test with a single table and simple aggregate for the MCR, your BCR is based on real-world complexities. You can have a consumption rate at 50 MB/sec/core or lower if you’re doing multiple joins and in-line expressions.

Let’s run two queries based on the Orders table as before, but with a twist or two…

Test 1:

USE [TPC_H]
GO
SET ROWCOUNT 105
DBCC DROPCLEANBUFFERS
SELECT 
	[O_ORDERDATE],
	SUM(1) Tally
  FROM 
	[dbo].[ORDERS]
  GROUP BY 
	[O_ORDERDATE]
  OPTION (MAXDOP 8)
GO

USE [TPC_H]

SET ROWCOUNT 105

DBCC DROPCLEANBUFFERS

SELECT

[O_ORDERDATE],

SUM(1) Tally

FROM

[dbo].[ORDERS]

GROUP BY

[O_ORDERDATE]

OPTION (MAXDOP 8)

Note the clearing of the buffer pool and the increase in MAXDOP. There’s also a ROWCOUNT statement to short-circuit the return of the resultset to the client; it doesn’t affect the overall query workload, but ends the test as soon as 105 rows are written to the results grid so you can reset for the next test faster.

Here are the stats and BCR calculations…

(221,180/9,481/1,000)*8,192/1024/1,024=182.25 MB/s

Now this this more in line with the rule of thumb and is probably due to the CXPACKET wait overhead. Note also that my physical reads were entirely of the read-ahead variety, so throughput was maximized.

Test 2:

USE [TPC_H]
GO
SET ROWCOUNT 105
DBCC DROPCLEANBUFFERS
SELECT 
	[O_ORDERDATE],
	[O_CLERK],
	SUM(1) Tally
  FROM 
	[dbo].[ORDERS]
  GROUP BY 
	[O_ORDERDATE],
	[O_CLERK]
  OPTION (MAXDOP 8)
GO

USE [TPC_H]

SET ROWCOUNT 105

DBCC DROPCLEANBUFFERS

SELECT

[O_ORDERDATE],

[O_CLERK],

SUM(1) Tally

FROM

[dbo].[ORDERS]

GROUP BY

[O_ORDERDATE],

[O_CLERK]

OPTION (MAXDOP 8)

This is essentially Test #1 with an additional grouping on O_CLERK. What could possibly happen?

(221,180/22,109/1,000)*8,192/,1024/1,024=7.83 MB/s

Hello!? Where this this come from? A second ago we we’re cruising along at a decent 182 MB/sec/core and now it dropped down to less than 8. Query complexity has a big impact on your BCR and it will determine where you could sustain a bottleneck.

If your BCR is consistently lower than your MCR, you’ll probably end up being CPU-bound. This makes sense: for my second query, even though I just added one more grouping, it required a lot more CPU time to aggregate the same number of pages. You’ll end up running out of cycles before you use up all your disk throughput. If you can customize your solution, you can either tone down the disk subsystem to match the CPU or get higher clock speed CPUs, or more CPUs (with more SQL licensing) to achieve proper alignment.

On the flip side, if your BCR is consistently higher than your MCR, you’ll probably end up being IO bound. Your CPUs can handle the query complexity handily, but can your disk subsystem keep up? If you can customize, you’ll look at improving disk throughput to the SAN with HBA cards, adding more RAM to keep more data in the buffer for longer, or toning down the core count.

In ideal conditions, your BCR should be at least 80% of the MCR. The key is to test your hardware configuration with as realistic a query workload as possible. When you’re spending six figures servers, SAN storage and SQL licensing, you want all the components to be matched with each other and not be underutilized.

Pole Position

For our new server, I ran a battery of tests using a 10 GB TPC-H Order Detail table. Here are the pertinent stats for the MCR testing, along with a simple BCR test of the same query with the higher MAXDOP and empty buffer pool…

The new server’s MCR is in the target zone as indicated in the reference guide for Simple query throughput. With 24 cores deployed, we’ll be able to get up 5,900 MB/sec in throughput on the simple fact table scans. Coupled with the fact that the DiskSpeed test indicated we can get up to 10,400 MB/s, we’re confident it’ll handle the load for both the Production host and a UAT clone concurrently…as long as our data in the data warehouse is loaded in a fashion to maximize the benefits of this hardware configuration…