Confused by sp_who2 (Dear SQL DBA Episode 30)

February 16, 2017, 6:25 am

≫ Next: Index Maintenance and Performance (Dear SQL DBA Episode 38)

≪ Previous: Should I Learn Fulltext Indexing? (Dear SQL DBA Episode 29)

This week’s ‘Dear SQL DBA’ question gets us down to the essentials

Recently, when I was checking if there are any hanging transactions in my database via “sp_who2 ” procedure…

A transaction is “AWAITING COMMAND”

LastBatch date is more than a week ago

ProgramName is “SQLAgent – Generic Refresher”

Is the transaction hanging?

Learn the answer in this 15 minute video, or scroll down to read a written version of the answer. You can subscribe to my YouTube channel or the podcast (and I would love it if you left a review).

Let’s talk about sp_who2

I started out using sp_who2, also! And I was often confused by sp_who2.

sp_who2 is a built-in stored procedure in SQL Server.

Shows a lot of sessions, even on an idle instance
Doesn’t tell you much about what it’s doing

Here’s what an idle SQL Server looks like in sp_who2

This is my dev SQL Server instance. I have a few sessions open, but the only one which was executing anything was the one you see, where I ran sp_who2. I don’t have any agent jobs scheduled on this thing. It’s just waiting for something to happen.

It’s hard to know what to look at, because we see so much. And only 19 of the 49 sessions is on the screen, too.

Scrolling down, I can see a session similar to the one our question is about

Session 52 is a very similar case: it’s awaiting command, and it’s part of the SQL Server Agent.

I just started up my instance, so the “Last Batch” column isn’t long ago. But even if it’s more recent, can we tell if this is causing a problem? Does it have an open transaction?

We can use additional old school commands like DBCC INPUTBUFFER and DBCC OPENTRAN to find out

When I learned sp_who2, I also learned to use these two commands:

DBCC INPUTBUFFER (SPID) – what’s the last statement from the client? It returns only NVARCHAR(4000).
DBCC OPENTRAN – what’s the oldest open transaction in a database?

And these commands work. They’re worth knowing about, just for times when they come up. (Foreshadowing: I’ll show you a better way than all of this soon.)

DBCC INPUTBUFFER

I can plug the session id from the SQL Agent activity I saw in sp_who2 into this and get an idea of the last thing it ran.

But wow, this is inconvenient when I have more than one thing I’m interested in! And also, I still don’t know if it has an open transaction or not.

DBCC OPENTRAN

I saw in sp_who2 that session 52 was in the msdb database. I can run DBCC OPENTRAN and check what the oldest active transaction is. In this case it tells me that there’s no open transaction, so session 52 seems like it’s OK.

That was a lot of steps, and it was pretty clunky.

The problem isn’t you. The problem is sp_who2. There’s a better way!

All the commands we’ve been talking about so far are from the SQL Server 2000 days. They’re old enough to drive.

In SQL Server 2005, Microsoft gave us a better way to do these things. They gave us Dynamic Management Objects. (There are views and functions.)

Microsoft regularly improves and updates the DMV queries. They’re awesome! Commands like sp_who2 and friends are still there for backwards compatibility.

Major pros to using Dynamic Management Objects: way more information
Small downside: complex to write your own queries

Meet sp_WhoIsActive

That downside isn’t really a downside: Adam Machanic is a SQL Server expert, and he has a great free stored procedure that gives you all the DMV queries you need to see what’s running in your SQL Server.

Free download: whoisactive.com

Right out of the gate, sp_WhoIsActive doesn’t show you the stuff you don’t need to worry about

Only shows you a ‘sleeper’ if it has an open transaction
Unlike DBCC OPENTRAN, sp_WhoIsActive is instance level: you don’t have to run it per database

sp_WhoIsActive immediately shows that my idle instance has nothin’ goin’ on

Here’s what my lazy dev instance looks like in sp_WhoIsActive:

Nothing is actually running. It’s easy to see, and that’s great — because when something really is running, it makes it easy to see.

We can see sleepers if we want — even if they don’t have an open transaction

Sp_WhoIsActive has a boatload of parameters. If you want to see user processes who are sleeping and who don’t have an open transaction, use @show_sleeping_spids = 2.

If we want to see the sql_text they ran, we can click on that column. No need to run DBCC INPUTBUFFER, it’s right there.

And if we scroll to the right, we can see all sorts of things like the calling program, the status (sleeping in this case), and lots more info. It’s everything sp_who2 shows you, plus more.

Sleepers are usually OK

A sleeping session without an open transaction…

Isn’t holding locks on a table
Uses very little resources
May be re-used by the application (by something like connection pooling)

What if I have hundreds or thousands of sleepers?

I have seen a few cases where something was going wrong in connection pooling, and there were hundreds and hundreds of sleeping sessions. The longer the SQL Server was on, the more there would be.

If these sessions were to become active all at once, then things could go very badly, so it’s kind of creepy to find thousands of sleepers: it’s like looking out at a field of zombies. It’s worth addressing, and usually means connection pooling isn’t configured properly on the application servers.

SQL Server does have a maximum number of connections, by the way: 32,767.

But I’m getting a little far away from the question, which was about one sleeping session.

What if my sleeper does have an open transaction?

I cover how to diagnose and treat “problem” sleepers who have open transactions in my new course, “Troubleshooting Blocking & Deadlocks for Beginners.”

This is a new course on SQLWorkbooks.com, and a 6 month enrollment is currently free. I can’t promise that it’ll be free forever, so get in there and enroll.

Thanks to Adam Machanic for writing sp_WhoIsActive and supporting it for many years!

If sp_WhoIsActive has helped you, tweet to @AdamMachanic and let him know
Follow his blog at: http://sqlblog.com/blogs/adam_machanic

Got a question for Dear SQL DBA?

I’d love to hear it! Submit it anytime at https://www.littlekendra.com/dearsqldba.

↧

Index Maintenance and Performance (Dear SQL DBA Episode 38)

April 13, 2017, 8:05 am

≫ Next: Are Bad Statistics Making My Query Slow? (Dear SQL DBA Episode 39)

≪ Previous: Confused by sp_who2 (Dear SQL DBA Episode 30)

They made their index maintenance job smarter, and their queries got slower in production afterward. Could the index maintenance have harmed performance? In this 29 minute episode…

00:50 Thinking about plan freezing in Query Store and multi-team process
03:15 This week’s question about index maintenance and query performance

Subscribe to my YouTube channel, or check out the audio podcast to listen anywhere, anytime. Links from this episode are in this post below the video and in the YouTube description.

Links and further reading from the show this week…

Free, configurable index maintenance options:

Are Bad Statistics Making My Query Slow? (Dear SQL DBA Episode 39)

April 20, 2017, 8:02 am

≫ Next: Finding Plans and Stats for Queries like '%something%'

≪ Previous: Index Maintenance and Performance (Dear SQL DBA Episode 38)

An important query is suddenly slow. Is it because statistics are out of date? This is tricky to figure out, and updating statistics right away can make troubleshooting even harder. Learn how to use query execution plans to get to the heart of the question and find out if stats are really your problem, or if it’s something else.

In this 35 minute episode:

00:39 SQL Server 2017 Announced
01:10 New video from Microsoft’s Joe Sack demonstrating Adaptive Query Processing
03:05 This week’s question: Are bad stats making my query slow?
05:26 Demo of finding plan in cache and analyzing stats begins
28:17 What to do when stats ARE the problem

Code samples: https://gist.github.com/LitKnd/f07848d59cedc61fd057d12ab966f703

Finding Plans and Stats for Queries like '%something%'

May 16, 2009, 10:50 am

≫ Next: Average Daily Job Runtime

≪ Previous: Are Bad Statistics Making My Query Slow? (Dear SQL DBA Episode 39)

I often need to find a query plan in the cache for a process that has run long overnight. Typically I’ll be able to figure out from our logging some of the tables involved in the query. Sometimes I will have most of the executing text but won’t know exactly what dates or reference points were included.

Even when I have enough information to get an estimated plan, it’s usually really helpful if I can pull the actual plan out of the cache along with runtime statistics.

The query below is what I use at this point to try to find these plans– I also sometimes use it just to look for long running queries in general.

One note to remember– the last_execution_time field is the time of the plan activities at the last execution. So if you’re looking for a query that ran for an hour, this time would show at the beginning of that execution. (The logging on my systems is done after a batch of activities complete, so I always have to do a bit of work to figure out approximately when the activity would have started and look around that time for the plan.)

--Query plans and text looking for a given pattern

SELECT TOP 100
	qs.Plan_handle
	, cp.objtype
	, qs.last_execution_time
	, cp.useCounts
	, st.
	, query_plan
	, lastElapsedTimeMinutes = cast(qs.last_elapsed_time/1000000./60. as decimal(10,2))
	, maxElapsedTimeMinutes= cast(qs.max_elapsed_time/1000000./60. as decimal(10,2))
	, totalElapsedTimeMinutes= cast(qs.total_elapsed_time/1000000./60. as decimal(10,2))
	, totalWorkerTimeMinutes=cast(qs.total_worker_time/1000000./60. as decimal(10,2))
	, lastWorkerTimeMinutes=cast(qs.last_worker_time/1000000./60. as decimal(10,2))
	, qs.total_physical_reads
	, qs.total_logical_reads
	, qs.total_logical_writes
	, qs.last_physical_reads
	, qs.last_logical_reads
	, qs.last_logical_writes
FROM sys.dm_exec_query_stats AS qs
JOIN sys.dm_exec_cached_plans cp on
	qs.plan_handle=cp.plan_handle
CROSS APPLY sys.dm_exec_sql_text (qs.sql_handle) as st
CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle)
where
	st.text not like '%sys.dm_exec_query_stats%' --ignore queries looking for the plan
	and st.text like '%InvalidPlacementAdDay%'  -- look for queries against this table
	-- and cp.objtype &lt;&gt; 'Proc' --optional restriction by type
	-- and cast(qs.max_elapsed_time/1000000./60. as decimal(10,2)) &gt; 10 --optional restriction by longest time run
	--and last_execution_time &gt; dateadd(hh,-1,getdate())
ORDER BY
	last_execution_time DESC
GO

↧

Average Daily Job Runtime

June 20, 2009, 5:40 pm

≫ Next: Who’s Using All that Space in tempdb, and What’s their Plan?

≪ Previous: Finding Plans and Stats for Queries like '%something%'

Here’s a query I found useful today– this week we moved many of our production datamart servers to SQL 2K5 SP3 CU4, and today among the course of other issues I wanted to take a look at my job runtimes to see if they might be noticeably slower or faster than prior runs. I often am in a similar situation after deploying significant changes to our codebase.

Since most of my processing runs in SQL agent jobs, looking at average runtime per day is a pretty convenient index of performance. However, the load in processing varies by day of week, so it’s frequently useful to check activity for only a certain day of the week.

This script allows for both. I usually want to tweak the conditions, so I don’t set them in variables at the top, I edit them within the query itself each time:

use msdb;

select
	d.jobname
	,d.servername
	, avgDurationMinutes=avg(d.durationMinutes)
	, daydate=convert(char(10),startdatetime,101)
from (
	select
		jobname=j.name
		,servername=server
		,startdatetime=
			CONVERT (DATETIME, RTRIM(run_date))
			+ (
				run_time * 9
				+ run_time % 10000 * 6
				+ run_time % 100 * 10
			) / 216e4
		, durationMinutes=
				(CAST(SUBSTRING((right('0000000' + convert(varchar(7), run_duration), 7)),1,3) AS INT) * 60 * 60
				 + CAST(SUBSTRING((right('0000000' + convert(varchar(7), run_duration), 7)),4,2) AS INT) * 60
				 + CAST(SUBSTRING((right('0000000' + convert(varchar(7), run_duration), 7)),6,2) AS INT)
				)/60.

		,enddatetime =
		dateadd
			(ss,
				(CAST(SUBSTRING((right('0000000' + convert(varchar(7), run_duration), 7)),1,3) AS INT) * 60 * 60
				 + CAST(SUBSTRING((right('0000000' + convert(varchar(7), run_duration), 7)),4,2) AS INT) * 60
				 + CAST(SUBSTRING((right('0000000' + convert(varchar(7), run_duration), 7)),6,2) AS INT)
				)
			,
			(CONVERT (DATETIME, RTRIM(run_date))
			+ (
				run_time * 9
				+ run_time % 10000 * 6
				+ run_time % 100 * 10
			) / 216e4 )
			)
		, retries_attempted
	from sysjobs j (nolock)
	join sysjobhistory h  on
		h.job_id = j.job_id
		and h.step_id = 0 -- look only at the job outcome step for the total job runtime
	where
		j.name in ('&lt;strong&gt;JobName&lt;/strong&gt;')  -- Set the jobname here

) d
where
	datepart(dw,startdatetime)=7 -- Set  your day of week here if desired. 7=Saturday
group by
	d.jobname
	,servername
	,convert(char(10),startdatetime,101)
order by
	d.jobname
	,servername
	,cast(convert(char(10),startdatetime,101)as datetime) desc

↧

Who’s Using All that Space in tempdb, and What’s their Plan?

August 27, 2009, 3:40 pm

≫ Next: SQL PASS Day 1: To Free or Not To Free the Proc Cache?

≪ Previous: Average Daily Job Runtime

Whatcha Doing in My TempDb???

This post contains a script that I adapted from the fantastic SQL Server Storage Engine Blog.

It comes in handy in my job all the time! Sometimes tempdb is filling up, but sometimes I just want to monitor the amount of tempdb and check out execution plans of heavy tempdb users while watching performance on a server. It just really comes in handy more frequently than I would have thought before I started using it.

Note: This script returns space used in tempdb only, regardless of the db context it’s run in, and it only works for tempdb.

The Sample Code

and here it is…

--Modified from http://blogs.msdn.com/sqlserverstorageengine/archive/2009/01/12/tempdb-monitoring-and-troubleshooting-out-of-space.aspx

select
    t1.session_id
    , t1.request_id
    , task_alloc_GB = cast((t1.task_alloc_pages * 8./1024./1024.) as numeric(10,1))
    , task_dealloc_GB = cast((t1.task_dealloc_pages * 8./1024./1024.) as numeric(10,1))
    , host= case when t1.session_id <= 50 then 'SYS' else s1.host_name end
    , s1.login_name
    , s1.status
    , s1.last_request_start_time
    , s1.last_request_end_time
    , s1.row_count
    , s1.transaction_isolation_level
    , query_text=
        coalesce((SELECT SUBSTRING(text, t2.statement_start_offset/2 + 1,
          (CASE WHEN statement_end_offset = -1
              THEN LEN(CONVERT(nvarchar(max),text)) * 2
                   ELSE statement_end_offset
              END - t2.statement_start_offset)/2)
        FROM sys.dm_exec_sql_text(t2.sql_handle)) , 'Not currently executing')
    , query_plan=(SELECT query_plan from sys.dm_exec_query_plan(t2.plan_handle))
from
    (Select session_id, request_id
    , task_alloc_pages=sum(internal_objects_alloc_page_count +   user_objects_alloc_page_count)
    , task_dealloc_pages = sum (internal_objects_dealloc_page_count + user_objects_dealloc_page_count)
    from sys.dm_db_task_space_usage
    group by session_id, request_id) as t1
left join sys.dm_exec_requests as t2 on
    t1.session_id = t2.session_id
    and t1.request_id = t2.request_id
left join sys.dm_exec_sessions as s1 on
    t1.session_id=s1.session_id
where
    t1.session_id > 50 -- ignore system unless you suspect there's a problem there
    and t1.session_id <> @@SPID -- ignore this request itself
order by t1.task_alloc_pages DESC;
GO

↧

SQL PASS Day 1: To Free or Not To Free the Proc Cache?

November 4, 2009, 1:03 pm

≫ Next: SQLPASS Day 2- Optimization Timeouts and All about TLogs

≪ Previous: Who’s Using All that Space in tempdb, and What’s their Plan?

Yesterday was day 1 of SQL PASS 2009. I am attending a variety of sessions on execution plans this year, and along the way I heard three very different opinions yesterday on managing the procedure cache in presentations.

Rule of Thumb: The “it depends” answer is usually right.

Opinion 1: Never Ever Clear the Proc Cache on a Production Server

This first opinion came in a good, solid presentation on using execution plans for troubleshooting. There were some good examples of when you want sql to look at the statistics and trigger generating a new plan, and when you don’t. (AKA when parameter sniffing is a good or a bad thing.) But the speaker was wholeheartedly against clearing the proc cache in production.

While I can definitely see this being true for some systems, I have definitely seen advantages of clearing the proc cache on others (more to come below), so I already knew this was too simple an answer for me– at least until I’ve solved the problems I have with out of date statistics on frequently modified large tables.

(Thanks to Grant Fritchey for a great presentation.)

Opinion 2: Be Free, Procedure Cache, be Free!

This second opinion came in a session on using DMVs to troubleshoot performance. This session was even geared toward OLTP systems, and the speaker said he regularly frees the procedure cache on his production sql servers at a given interval. He sees slight CPU pressure after doing so, but has the benefit of being able to capture and trend exactly what procedures go into the cache using the DMVs afterward (with the benefit of clean timestamps).

So in his environment, he has no issues clearing the proc cache.

(Thanks to Dr.DMV for a great talk!)

Opinion 3: It Depends: Check the Size of Your Proc Cache, Free if You Need To (and can handle the CPU for Recompilations)

The third speaker (Maciej Pilecki) talked about looking at the total size of the proc cache, and stressed that as this cache grows, it can steal space from the buffer pool. For each system, you should look at the size of the procedure cache and the amount of execution plan reuse you are getting on the system.

There are two main performance benefits to plan reuse (whether parameterized adhoc queries or procedure queries):

Speed: (recompiling takes time and CPU resources)
Smaller proc cache / More room for buffer pool to hold data in memory

Bonus: Maciej also mentioned how the ‘Optimize for Adhoc Workload’ option in sql 2008 can help alleviate bloat of the adhoc procedure cache. When enabled, this will only cache adhoc plans on their second run– for the first run sql will just store a small record that the query was executed once.

I really enjoyed these sessions, and one of the great things about PASS is the opportunity to hear and synthesize different perspectives on these topics.

Love it!

↧

SQLPASS Day 2- Optimization Timeouts and All about TLogs

November 5, 2009, 12:11 pm

≫ Next: Review: A day of doing many things at once with @AdamMachanic

≪ Previous: SQL PASS Day 1: To Free or Not To Free the Proc Cache?

SQLPass unfortunately can’t last forever, but happily it’s still going strong. Here’s some highlights from my Day #2.

Paul Randal Knows Exactly What’s Going on in Your Transaction Log…

A definite highlight of day 2 was attending Paul Randal‘s session on Logging and Recovery in SQL Server. I’ve read Pauls’ blog posts on this topic and attended his classes before, but even being familiar with the material I find I always learn something from his talks. You just can’t beat being strong on the basics!

I took a lot of notes in the session, this is my favorite excerpt from my notes:

SQL Server must reserve space in the TLOG so that it can roll back the active transactions, if needed.
Once a VLF no longer contains log records that are required, it can be cleared
This is done by a log backup in full or bulk_logged recovery models, or by checkpiont in simple
All that happens when a VLF is “cleared” is that it is marked as inactive
- Nothing is cleared at that time
- Nothing is truncated
- Nothing is overwritten
- The log file size does not change
- The only thing that happens is that whole VLFs are marked inactive if possible (no active transactions)

Ben Nevarz asks, “How You Doing, Optimizer?”

One of my favorite pieces of information on day 2 was in Ben Nevarez‘s talk on how the query optimizer works. He mentioned this DMV, which I hadn’t used before yesterday:

Sys.dm_exec_query_optimizer_info ←Check me out!

The other useful bit of info is that the timeout flag is recorded in the xml for the sql plans, so plans which the optimizer finds so complicated that it times out on compilation can be queried from the cache!

SQLPASS homework assignment: Write and test this query, determine how to automate running it and collecting the information.

Sample Queries

This sample from BOL to find excessive compiles/recompiles:

select *
from sys.dm_exec_query_optimizer_info
where counter = 'optimizations'
or counter = 'elapsed time'

Review: A day of doing many things at once with @AdamMachanic

December 3, 2010, 6:00 am

≫ Next: The 9th Day of SQL: Things Aren't as Simple as They Seem

≪ Previous: SQLPASS Day 2- Optimization Timeouts and All about TLogs

A day of doing many things

At SQLPass this year I was fortunate to attend “A day of doing many things at once: Multitasking, Parallelism, and Process distribution” by Adam Machanic (blog | twitter). This was a day long post-conference.

So, how was it?

This was a fantastic seminar. There was a really good flow to the talk, which started in CPU background and architecture, then moved through Windows Internals, SQL Server internals, and on to specifics of parallelism in queries. Then we finally moved on to administration topics, as well as different methods of process distribution. A full outline of the day is here.

I think the presentation worked very well because of the balance of theory and practice. Essentially, there was a very good ratio between ‘what, ‘why’, and ‘how’.

I’ll look back at the outline for this seminar when designing longer presentations myself.

Did I learn anything useful?

Yes! The information on plan shapes and tricks to manipulate them was incredibly interesting, and is something I know will be useful. I also learned some interesting specifics about how the DAC works, and have a much more holistic view of how SQL Server uses processors and parallelism. Check out my tweets below for a little more insight into what my day was like!

Free webcasts. Yep, free.

Adam has some webcasts on parallelism available for download which you can watch for free.

My tweetstream from the session…

Here’s what my day was like, according to Twitter.

Postcon fun with @AdamMachanic today! #sqlpass Processes do not run, *threads* do.
Quick discussion of fiber mode for SQL Server: very limiting (http://bit.ly/bn6RoK)
Thread starvation: pre-emption by high priority threads can prevent some threads from ever running.
Threads running on client OS get a smaller amount of quantum units than on a server os (more frequent interrupt frequency)
Three types of processor affinity: none, hard affinity, and ideal affinity
Lots of love for sysinternals (http://bit.ly/WPxha) and theCPU-Z tool (w/ props to @BrentO for recommending http://bit.ly/1iBcg6)
Interrupt counts include not just when a quantum expires, but also when a thread finishes.
Lots of cool WMI queries being run from inside SSMS
Mine is still getting even better RT @whimsql: Amen Tom! RT @SQLRockstar Best. Summit. Ever.
Meeting the SQLOS! It’s a “cooperative” scheduling system: everyone’s equal
SQLOS provides an abstraction layer so storage engine, qp, etc can all talk to it instead of directly to the OS
Proc Affinity at sql server level may be worth testing w/ multi instances. WIth virtualization taking predominance, is less common.
Differences between resource waits and signal waits being explained
484 Wait types in SQL Server 2008– plug for #sqlhelp hash tag for those with limited documentation.
I totally just got called on in a “what feature uses a hidden scheduler” Pop Quiz. #FAIL
@PaulRandal yep, we were all “so THAAAAAAAT’S how that works.”
Don’t think of operators in QPs as being parallelized. Think more of each set of rows as being prallelized.
Very few iterators are actually parallel-aware. Most do not need to be, even if being used by parallel streams.
OH: “I trust myself, but I don’t know if you should.” <– always an appropriate comment when referring to production environment
And now we return to our discussion of the “Big O” and the Query Processor.
We just covered tempdb spills and @crysmanson ‘s old enemy, the resource_semaphore wait type.
Few outer rows demo showing repartitioning scheme and rows redistributed on threads– very cool
Verrrrrrry interesting stuff with CROSS APPLY and parallelism
Cost threshold for parallelism default is still what it was set originally in 7.5, for many contemporary systems it may be too low.
And that makes me happy to hear since we do raise the default cost threshold for parallelism on our prod servers
@AdamMachanic just actually turned it up to 11.
If you hit THREADPOOL waits, don’t just up the max worker threads permanently, find the root cause for the situation.
Finishing up with a monitoring parallelism section — really nice flow to the talk today!
Piles o’ DMV fun, including the reason sys.dm_exec_requests has some funkiness: it shows wait state only for the root task
@AdamMachanic is demoing how sp_whoisactive will display your wait types, find your tempdb contention, and wash your dishes.
Demo of manipulating memory grants to cause a query to spill to tempdb purposefully… we’re not in kansas anymore.
@TheSQLGuru I’ve enjoyed it a ton– great combo of really interesting demos and information.

↧

The 9th Day of SQL: Things Aren't as Simple as They Seem

December 21, 2010, 6:10 am

≫ Next: Date Rounding Tactics and the Tiny Devil of SMALLDATETIME

≪ Previous: Review: A day of doing many things at once with @AdamMachanic

The 12 days of SQL

Brent Ozar (blog | twitter) had an idea: a group of people should blog about writing which they’ve loved this year by people in the SQL community. For each “day of SQL,” someone picks a blog which they thought was great and writes about it.

Yesterday was Day 8, when Karen Lopez talked about a post by Louis Davidson and asked “What is your over/under?” Karen is a great speaker, an inspiring writer, and just an incredibly interesting person. Check out her post!

On the 9th Day of SQL the engine gave to me: Something a little different than I expected.

Day 9: The Day of Paul White

This day of SQL is not about nine ladies dancing. (Sorry Karen!) Instead, it’s devoted to one New Zealander writing: his name is Paul White (blog | twitter).

First off, let me say that Paul White’s blog, “Page Free Space,” is just plain awesome. When I see Paul’s written a new post I know to allocate some time for it and read it through slowly, and that I should expect to have to think about what I’m reading to understand it.

I swear I can sometimes feel things moving around in my head when I read Paul’s posts. Apply the warning about overhead bins during flight: be careful, contents may shift while you’re reading Paul White’s blog.

So What’s My Favorite Post of the Year?

I picked Paul’s post, The Case of the Missing Shared Locks.

There’s a lot to love about this post. It is a great demonstration that things aren’t as simple as they seem.

Paul starts the post with the question:

If I hold an exclusive lock on a row, can another transaction running at the default read committed isolation level read it?

The answer to that would seem fairly straightforward. But in fact, things are pretty complicated. However, if you go through it slowly and really look at the examples, it can help you understand a lot about locking.

This is good.

Why is it Good that Things Are So Complicated? It’s CONFUSING.

Have you ever said something along these lines? “I’d like to give a presentation sometime, but I don’t have anything to talk about.”

Or, “I’m not sure that I have anything that interesting to fill a whole hour.”

Well, take a look at Paul’s post. He took something small, and he looked very closely at it. He played with it a couple of different ways, and he worked on it to see how it behaved. He stepped through it in a series of short, straightforward steps.

You can do the same thing with many things you’re familiar with. You can take a topic, or a feature, or a method of doing something and distill it into an interesting question. You can then look closely at the question and work with it carefully. Use it as a chance to explore something. You’re probably familiar with it, but by taking the time to write about it or present it, you’ll have the opportunity to get to know it better than you ever thought you could.

Who’s Next?

I’m handing the dreidl off to Crys Manson (blog | twitter) for Day 10.

Crys is a seriously great DBA, a fantastic friend, and she sometimes makes me snort liquid through my nose laughing.

Tag, Crys, you’re it!

How’d We Get Here?

If you want to check out where we’ve been so far, we’ve had:

Brent O’s 12 Days of SQL post
Day1: Jeremiah Peschka
Day 2: Grant Fritchey
Day 3: Dave Stein
Day 4: Andy Leonard
Day 5: Erin Stellato
Day 6: Tim Ford
Day 7: Yanni Robel
Day 8: Karen Lopez

A Little Present

You don’t need to be Jewish for this to be your favorite holiday song this year. Rock on with the Maccabeats, y’all. (You will need to click the “watch on YouTube” link.)

[youtube=http://www.youtube.com/watch?v=qSJCSR4MuhU]

↧

Date Rounding Tactics and the Tiny Devil of SMALLDATETIME

January 6, 2011, 6:30 am

≫ Next: Read from the Right End of the Index: BACKWARD Scans

≪ Previous: The 9th Day of SQL: Things Aren't as Simple as They Seem

Tiny Devils

With every new year I think a little bit about time and dates. This posts looks a little more at that in TSQL.

Rounding Dates: Which way is best?

Sometimes in TSQL you need to round a datetime value to the precision of either a day, hour, minute, or second.

I realized recently that I have a few ways I know how to do this, but I wasn’t sure which was the most efficient.

I did a little searching and didn’t find anything super conclusive. I had a little chat with Jeremiah Peschka (blog | twitter) and he told me which way he thought was fastest and why.

And so I decided to run some tests. Jeremiah has a way of being right about these things, but I had to see for myself.

I’ll go ahead and tell you: He was totally right, and I’ll show you why. But I learned a couple things along the way.

Reference: Types and Storage Sizes

To get started, let’s review some of our friends, the SQL Server datatypes. Hi friends!

Rounding to the Day

The most frequent case in which I need to round dates is to the day level. So instead of ‘1/4/2011 6:15:03.393921’, I want just ‘1/4/2011’.

SQL 2008’s date type made this a lot easier for everyone– now we can just cast a datetime or datetime2 value as a date, and we’ve got what we need. PLUS, our new value is nice and small, weighing in at 3 bytes.

I think most everyone agrees, we like this!
SELECT CAST('1/1/2010 23:59:59.000' AS DATE) AS [I'm a date!]

Rounding to Hours, Minutes, or Seconds:
Beware the tiny devil of SMALLDATETIME

This is still a bit more complicated. When you start thinking about these and different datatypes, you need to make sure you understand what you mean by rounding.

In SQL Server, our datatypes actually have some different opinions about what rounding means. Check this out:

SELECT
CAST('1/1/2010 23:59:59.000' AS DATETIME) AS [I'm a DATETIME!],
CAST('1/1/2010 23:59:59.000' AS DATETIME2(0))  AS [I'm a DATETIME2(0)!'],
CAST('1/1/2010 23:59:59.000' AS SMALLDATETIME) AS [I'm a SMALLDATETIME, and I'm very confused.],
CAST('1/1/2010 23:59:59.000' AS DATE) AS [I'm a DATE!]

This returns:

The SMALLDATETIME value rounds this up to January 2nd, instead of January 1. The Date datatype does not.

In considering whether or not to use SMALLDATETIME, you need to learn and establish whether or not to round up for minutes and date values. With a different example, if something occurred at 12:30:31 AM, would that be represented as having happened in the 12:30 minute, or at 12:31?

Most of us actually want to round down. We want the largest minute number which is less than or equal to the datetime value. This is similar to what FLOOR does for integers. You could also call this truncating the portion of the datetime value you don’t want. This is not, however, what SMALLDATETIME gives you, so use it with care.

So this is what I’m saying:

Like, seriously, SMALLDATETIME: you are SO messed up.

Comparing Methods of Rounding Dates

So given that warning, let’s actually round some date values, and let’s compare the efficiency of each method.

To start out with, let’s create a table and toss in a bunch of date values. We’ll run queries against these dates and measure SQL Server’s abilities to work with it.

To make up a bunch of datetime data, I’m using my trusty recursive CTE from my prior post.

--Populate a table with some data
CREATE TABLE dbo.Donuts ( DonutTime DATETIME2(7) )

DECLARE
@startDate DATETIME2(7)= '2010-12-01 00:00:00' ,
@endDate DATETIME2(7)= '2010-12-11 01:30:00' ;

WITH    MyCTE
AS ( SELECT
@startDate AS [Makin' the Donuts]
UNION ALL
SELECT
DATEADD(ms, 1225, [Makin' the Donuts])
FROM
MyCTE
WHERE
[Makin' the Donuts] < @endDate )
INSERT  dbo.Donuts
SELECT
[Makin' the Donuts]
FROM
MyCTE
OPTION
( MAXRECURSION 0 ) ;

SELECT @@ROWCOUNT
--We now have 709716 rows of DonutTime

Now let’s look at different methods to manipulate datevalues. For our examples I’ll be rounding to the minute.

Contestant 1 –
DATEPART: isolate each part of the date, then concatenate

As we learn TSQL, this is the first method that occurs to us. We know DATEPART will return part of a date (great name!), so we can chop apart the bits. However, to get them back together properly we have to turn each part into a string to clue them back together. And then if we want to treat it like a date (which we pretty much always do), we have to cast it back.

Just look at this baby. It’s pretty ugly.

SELECT
CAST(CAST(DATEPART(YY, DonutTime) AS CHAR(4)) + '-' + CAST(DATEPART(MM, DonutTime) AS NVARCHAR(2)) + '-'
+ CAST(DATEPART(DD, DonutTime) AS NVARCHAR(2)) + '  ' + CAST(DATEPART(hh, DonutTime) AS NVARCHAR(2)) + ':'
+ CAST(DATEPART(mi, DonutTime) AS NVARCHAR(2)) + ':00.000' AS DATETIME2(0)) AS [Wow, that was a lot of typing.]
FROM
dbo.Donuts

Running this (after cleaning out buffers), I got these results:

Contestant 2 –
Subtracting what you don’t want

There’s a couple of variations on contestant #2. I’ll take the one I like best, which is casting to a smaller byte size by using DATETIME2(0), which is 6 bytes rather than 8 and effectively truncates to the second. Then I’ll subtract the seconds.

SELECT
DATEADD(ss, -DATEPART(ss, DonutTime), CAST (DonutTime AS DATETIME2(0)))
FROM
dbo.Donuts

Running this one (yes, I cleaned out the buffers), I got these results:

Well now, that’s much lower CPU time there.

NB: I did test, and in all my trials it was lower CPU time to cast into DATETIME2 rather than using a nested DATEADD function to subtract milliseconds.

Contestant 3-
Convert to a shorter character string, then back to date

This contestant is near and dear to my heart. I like it because it’s easy for me to remember. You take a short trip into CHAR() with the 121 date format and set the length to chop off the parts of the date you don’t want. Then you cast or convert back to a DATETIME2(0).

I think I like this one because it feels just a little bit violent. But not in a bad way. It’s like roller derby.

SELECT
CAST(CONVERT(CHAR(16), DonutTime, 121) AS DATETIME2(0))
FROM
dbo.Donuts

Oh, sad. This one didn’t do very well. It’s definitely better than Contestant #1, at least.

Contestant 4-
Use DATEADD to calculate the minutes since a given date, then add them back

Here’s the method Jeremiah suggested to me. The way he described it was “Just figure out the number of minutes since the beginning of time, and use that.”

Being a philosophy major, I of course asked “So, when was the beginning of time?”

Being a developer, he answered, “Just call it zero.”

SELECT
DATEADD(mi, DATEDIFF(mi, 0, CAST(DonutTime AS DATETIME2(0))), 0)
FROM
dbo.Donuts

Here are the results (clean buffers, as usual):

Ooo, check out the CPU time on that one.

Note: I ran a few trials and this is faster on the CPU when you cast as DATETIME2(0) before doing your maths. I did that to make all things equal with the other contestants, who had the same benefit.

Who Won, and Why

Here’s a recap of how everyone performed:

Why did contestants 2 and 4 do so well?

Jeremiah pointed out that datetime values are stored internally as two four byte integers. (BOL reference: see “Remarks”) Performing mathematic functions on an integer value is a nice fast activity on the CPU.

Performing conversions back and forth to character based datatypes, however, is not so natural, nor so fast.

What’s the internal storage format of DateTime2? Well, I’m not sure about that one. BOL isn’t so up-front about these things anymore. If you happen to know, please tell me in the comments. I can tell, however, that it’s something that enjoys mathematics.

↧

Read from the Right End of the Index: BACKWARD Scans

January 21, 2011, 6:00 am

≫ Next: Dirty Pages and Statistics IO

≪ Previous: Date Rounding Tactics and the Tiny Devil of SMALLDATETIME

Optimizing queries is the most fun when you don’t need to add indexes. There’s nothing quite so nice as finding a way to make reading data faster, without slowing down writes or creating new data structures that need to be maintained.

Here’s one way you can use BACKWARD scans to do this.

The Scenario: Clustered index on an increasing integer, and you’d like recently created rows

This is a common enough situation: you have a table with a clustered index on an integer value which increases with each row. You have another column which records the date the row was created.

You’d like frequently query the most recently created rows over some period of time.

The table has very frequent inserts, so for performance reasons you want to use the minimal indexes required. (And in general, this is the best practice.)

Question: Do you need to add a nonclustered index on the column containing the date the row was created?

Answer: Maybe not!

Getting the right clustered index scan

Say we’re working with the following table, which we have filled with five million rows of Tweetie birds. (Note: This generation technique is a tally table population technique which I found on Stack Overflow, which is attributed to Itzik Ben-Gan.)

CREATE TABLE dbo.Birds (
    birdId INT NOT NULL ,
    birdName NVARCHAR(256) NOT NULL,
    rowCreatedDate DATETIME2(0) NOT NULL )
GO	

--Insert 5 million Tweetie birds
--Make them as if they were all created a minute apart.
;WITH
  Pass0 as (select 1 as C union all select 1),
  Pass1 as (select 1 as C from Pass0 as A, Pass0 as B),
  Pass2 as (select 1 as C from Pass1 as A, Pass1 as B),
  Pass3 as (select 1 as C from Pass2 as A, Pass2 as B),
  Pass4 as (select 1 as C from Pass3 as A, Pass3 as B),
  Pass5 as (select 1 as C from Pass4 as A, Pass4 as B),
  Tally as (select row_number() over(order by C) as Number from Pass5)
INSERT dbo.Birds (birdId, birdName, rowCreatedDate)
SELECT Number AS birdId ,
    'Tweetie' AS birdName ,
    DATEADD(mi, number, '2000-01-01')
FROM Tally
WHERE Number <= 5000000

--Cluster on BirdId. We won't add any other indexes.
CREATE UNIQUE CLUSTERED INDEX cxBirdsBirdId ON dbo.Birds(BirdId)

Say we would just like to see the maximum value in the rowCreatedDate column.

The most basic way to get this row is with this query:

SELECT MAX(rowCreatedDate)
FROM dbo.Birds

However, that leads to a table scan. We get lots of reads: 22,975 logical reads and 201 physical reads.

If we know we have a strong association between the BirdId column and the RowCreatedDate column, and that the highest ID in the table is the most recent row, we can rewrite the query like this:

SELECT MAX(rowCreatedDate)
FROM dbo.Birds
WHERE birdId = (SELECT MAX(birdId) FROM dbo.Birds)

This query still does a clustered index scan. But yet it does only 3 logical reads and 2 physical reads.

Looking in the execution plan, our query was able to use the extra information we provided it to scan the index backwards. It stopped when it had everything it needed, which was after a short distance– after all, it only needed recent rows, and those are all at one end of the table.

This backwards scan can be very useful, and can make using the MAX aggregate very useful.

But you usually need more than just the max value…

To see a bit more about how you extend this logic, compare these three queries:

Query A

This makes you think you need that non-clustered index: it does 22,975 logical reads, 305 physical reads, and 22968 read-ahead reads.

--Only run against a test server, not good for production
DBCC DROPCLEANBUFFERS

SELECT birdId, birdName, rowCreatedDate
FROM dbo.Birds
WHERE rowCreatedDate >= '2009-07-01 05:00:00'

Query B

We can introduce the backwards scan by adding an ORDER BY BIrdId DESC to the query. Now we get 23019 logical reads, 47 physical reads, and 22960 read-ahead reads.

--Only run against a test server, not good for production
DBCC DROPCLEANBUFFERS

SELECT birdId, birdName, rowCreatedDate
FROM dbo.Birds
WHERE rowCreatedDate >= '2009-07-01 05:00:00'
ORDER BY birdid desc

Query C

The this last query gives the optimizer extra information about using BirdId to do a BACKWARD scan to grab the maximum BirdId, and then use that to do a BACKWARD seek of the clustered index in nested loops to get the data. It does only 50 logical reads, 4 physical reads, and 817 read-ahead reads.

--Only run against a test server, not good for production
DBCC DROPCLEANBUFFERS

SELECT birdId, birdName, rowCreatedDate
FROM dbo.Birds
WHERE birdId >=
	(SELECT MAX(birdId)
	FROM dbo.Birds
	WHERE rowCreatedDate <= '2009-07-01 05:00:00')
AND rowCreatedDate >= '2009-07-01 05:00:00'
ORDER BY birdId DESC

Be Careful Out There

The examples I’m using work because there is a correlation between the integer field and the date field. Not all tables may be like this. As with all queries, you need to be familiar with your data.

Consider Your Options– Even the Ones You Don’t Think Are Great

I’m quite sure BACKWARD index reads are covered in some talks and publications on tuning. But I learned about this by considering multiple approaches, even those I didn’t think would work at first. It pays to try things out, and you can look a lot by looking carefully at execution plans (including the properties) and your Statistics IO output.

What this means to me: it’s good to keep an open mind.

↧

Dirty Pages and Statistics IO

February 17, 2011, 6:00 am

≫ Next: The Magic of the Self-Assigned Lab (SQLSkills Immersion Training Days 2 and 3)

≪ Previous: Read from the Right End of the Index: BACKWARD Scans

Warning: The DROPCLEANBUFFERS command referenced in this post is appropriate for test systems only and impacts the entire SQL Instance. If you are new to SQL Server, please use this commands with care, and be careful to read the linked Books Online documentation. Happy testing!

You were hoping for a picture, right?

The other day I was running some test queries and looking at the number of reads, and I noticed something funny.

I was dropping clean buffers prior to running a query, but I would sometimes see that there had been no physical reads.

No physical reads? Where was the data coming from?

I was working on a small number of rows, but it still bothered me.

The output looked like this:

The Set-Up

Here’s a simple simulation of what I was doing. First, create a database and insert some values.

SET NOCOUNT ON;
SET STATISTICS IO OFF;
create database dirtyBuffers
GO
USE dirtyBuffers
GO
--Create a table and insert some values
create table dbo.testme (
	i int identity,
	j char(2000) default 'baroo'
)
GO
insert dbo.testme default values
GO 20

Then, turn on Statistics IO so we can see read information. Drop clean buffers, so data isn’t in memory. Then run a query.

SET STATISTICS IO ON;

DBCC DROPCLEANBUFFERS
GO

--Select some rows
select * from dbo.testme

It should read it from disk, right?

What I Forgot

I was forgetting about dirty pages. In order to get a “cold cache”, you need to first run a CHECKPOINT command to flush dirty pages to disk, then run DBCC DROPCLEANBUFFERS to remove everything from the bufferpool. This is very well documented in Books Online.

This was easy to forget because typically I test execution of queries against a restored copy of a production database, or a dataset which isn’t changing.

What I Hadn’t Realized

I don’t think I ever specifically realized that dirty pages could be immediately re-used for query results– but it makes perfect sense. I had only thought about clean pages, which were read in for one query, to be available for re-use.

I felt a little silly when I realized this. Shouldn’t I have known this? But after thinking about it I realized: there’s little gaps like this in most everyone’s knowledge. Sometimes it takes a little bit of extra experience to notice the gap and fill it in. It happens to us all.

After rerunning the commands and including a CHECKPOINT with DBCC DROPCLEANBUFFERS, I see the expected output– a physical read.

↧

The Magic of the Self-Assigned Lab (SQLSkills Immersion Training Days 2 and 3)

February 24, 2011, 8:00 am

≫ Next: How To: Automate Continuous SQLServer Activity with Stored Procedures and Powershell Jobs

≪ Previous: Dirty Pages and Statistics IO

The duck has obtained scuba gear-- it can go to new depths on its own.

Today, more blog from SQLSkills Immersion Training on Internals and Performance in Dallas, TX. For more about the training, see my prior post.

Where We’ve Been

We’ve moved through most of Module 8. We’ve been through the land of the transaction log, locking and blocking, isolation and versioning, table design and partitioning, and many neighborhoods of index internals.

A Sample Case

I love learning about how the transaction log works. I know a bit about how the log operates, and as a DBA I keep this knowledge pretty fresh. So I was looking forward to yesterday’s session on transaction logging quite a bit.

This week, I’ve learned a lot about the internals of how the log works with crash recovery, as well as how log buffers work in general. (And I mean a lot. Pages and pages of notes.)

Looking over my notes, what makes me happy is how much I was able to note down and absorb much more than I have been in trainings previously. This is attributable to three things:

A lot of time devoted to the topic;
Plenty of room for questions (and there were lots!) with lots of rich data in the answer;
A good context for the information.

I’m really proud that I know enough now that I’m able to understand and note this level of detail. That’s a great feeling.

The magic of the Self Assigned Lab

When I’m in a great session or a good conversation and I learn something that works differently than I thought, or something very interesting, I made a note to myself with a “To Do”. These are basically self-assigned lab assignments: I’ve learned from blogging and presentations that I learn a ton by setting up my own scenarios and working to show that something works (and look at how it works), or the opposite.

I already have enough self-driven labs to keep me learning, and blogging about the best parts, for several months. Here are a few:

ToDo: find an automation opportunity with DBCC PAGE WITH TABLERESULTS
ToDo: create a scenario where you can’t get rid of a secondary filegroup without unusual operations.
ToDo: Look at/experiment with transaction savepoints

There are lots more good ones, but I’m hoarding quite a few of them. (I have 41 at the time of this writing.)

That’s the magic of great training– not only do you pick up a lot and you receive pre-designed labs you can learn from, but you also find paths you hadn’t imagined to explore and create tools on your own. And you’re inspired to go there.

↧

How To: Automate Continuous SQLServer Activity with Stored Procedures and Powershell Jobs

March 2, 2011, 7:00 am

≫ Next: There’s more than one way to skin an eggplant: Using APPLY for calculations

≪ Previous: The Magic of the Self-Assigned Lab (SQLSkills Immersion Training Days 2 and 3)

The Goal

It’s often useful to be able to run a bunch of stored procedures in the background over a period of time against a test instance.

This can be nice for:

Demos and presentations.
Populating DMVs with data you can slice and dice.
Learning to use things like extended events and server side trace (which are much more interesting with something to look at).
Testing a variety of automation scripts.

This post shows you how to create several stored procedures for AdventureWorks2008R2 which will provide different result sets and have slightly different run times when run with a variety of parameters– in this case, individual letters of the alphabet.

You can then run PowerShell commands which start jobs in the background. Each job runs a stored procedure and loops through all letters of the alphabet, providing each one as a parameter. You can set the job to do that loop a configurable amount of times (the commands are set to 100). In other words, as given, each stored procedure will be run 2600 times. Since you’re running multiple jobs and they’re all going asynchronously in their own threads, you’ll have a variety of commands trying to run at the same time.

Optional: you can start the PowerShell jobs under different credentials if you need.

Alternatives: In the past, I’ve typically done things like this with T-SQL loops (often with dynamic SQL) and multiple Management Studio windows. This works OK, but it’s a little time consuming to open each window, paste everything in (or open multiple files), and start them all up. I find it much more convenient now to use scripts.

Step 1: Create Stored Procedures with a single alphabet-based parameter

Let’s get one thing clear: these procedures aren’t designed to run optimally, and they aren’t coded nicely.

You’ll notice these procedures have all sorts of problems. And that’s by design– my goals are for testing things around these, so it’s really a little better for me if they don’t play perfectly nice.

In other words, these sure ain’t for production.

/****************
Jump in the kiddie pool
********************/
USE AdventureWorks2008R2;
go

/****************
CREATE THE SCHEMA
********************/
IF SCHEMA_ID(N'test')  IS NULL
	EXEC sp_executesql N'CREATE SCHEMA test AUTHORIZATION dbo'
GO

/****************
CREATE Silly Stored Procedures in the Schema
********************/
IF OBJECT_ID(N'test.EmployeeByLastName', 'P') IS NULL
	EXEC sp_executesql N'CREATE PROCEDURE test.EmployeeByLastName as return 0'
GO
ALTER PROCEDURE test.EmployeeByLastName
	@lName nvarchar(255)
AS
	SELECT @lName = N'%' + @lName + N'%'

	select *
	FROM HumanResources.vEmployee
	WHERE LastName LIKE @lName
GO

IF OBJECT_ID(N'test.EmployeeByFirstName', 'P') IS NULL
	EXEC sp_executesql N'CREATE PROCEDURE test.EmployeeByFirstName as return 0'
GO
ALTER PROCEDURE test.EmployeeByFirstName
	@fName nvarchar(255)
AS
	SELECT @fName = '%' + @fName + '%'

	select *
	FROM HumanResources.vEmployee
	WHERE FirstName LIKE @fName
GO

IF OBJECT_ID(N'test.EmployeeDepartmentHistoryByLastName', 'P') IS NULL
	EXEC sp_executesql N'CREATE PROCEDURE test.EmployeeDepartmentHistoryByLastName as return 0'
GO
ALTER PROCEDURE test.EmployeeDepartmentHistoryByLastName
	@lName nvarchar(255)
AS
	SELECT @lName = N'%' + @lName + N'%'

	select *
	FROM HumanResources.vEmployeeDepartmentHistory
	WHERE LastName LIKE @lName
GO

IF OBJECT_ID(N'test.EmployeeDepartmentHistoryByFirstName', 'P') IS NULL
	EXEC sp_executesql N'CREATE PROCEDURE test.EmployeeDepartmentHistoryByFirstName as return 0'
GO
ALTER PROCEDURE test.EmployeeDepartmentHistoryByFirstName
	@fName nvarchar(255)
AS
	SELECT @fName = '%' + @fName + '%'

	select *
	FROM HumanResources.vEmployeeDepartmentHistory
	WHERE FirstName LIKE @fName
GO

IF OBJECT_ID(N'test.ProductAndDescriptionByKeyword', 'P') IS NULL
	EXEC sp_executesql N'CREATE PROCEDURE test.ProductAndDescriptionByKeyword as return 0'
GO
ALTER PROCEDURE test.ProductAndDescriptionByKeyword
	@keyword nvarchar(255)
AS
	SELECT @keyword = '%' + @keyword + '%'

	select *
	FROM Production.vProductAndDescription
	WHERE Name LIKE @keyword OR ProductModel like @keyword OR description LIKE @keyword
GO

Once you’ve got the procedures written, you just need to set up your PowerShell commands.

Step 2: Create PowerShell Jobs to Run the Procedures in Loops

These commands use PowerShell background jobs.

Even if you don’t know PowerShell, if you look at these commands you can pretty easily pick out where the 1 to 100 loop is, where the a to z loop is, and what commands are being run.

Since the jobs are running to create load in the background and I don’t care about collecting query results, I pipe the output all to Out-Null.

#test.EmployeeByLastName
Start-Job -ScriptBlock {Import-Module sqlps; foreach($_ in 1..100) {foreach ($_ in [char]"a"..[char]"z") {Invoke-Sqlcmd -Query "exec test.EmployeeByLastName '$([char]$_)'" -ServerInstance "YOURMACHINE\YOURINSTANCE" -Database AdventureWorks2008R2 | Out-Null }}}

#"test.EmployeeByFirstName"
Start-Job -ScriptBlock {Import-Module sqlps; foreach($_ in 1..100) {foreach ($_ in [char]"a"..[char]"z") {Invoke-Sqlcmd -Query "exec test.EmployeeByFirstName '$([char]$_)'" -ServerInstance "YOURMACHINE\YOURINSTANCE" -Database AdventureWorks2008R2 | Out-Null }}}

#"test.EmployeeDepartmentHistoryByFirstName"
Start-Job -ScriptBlock {Import-Module sqlps; foreach($_ in 1..100) {foreach ($_ in [char]"a"..[char]"z") {Invoke-Sqlcmd -Query "exec test.EmployeeDepartmentHistoryByFirstName '$([char]$_)'" -ServerInstance "YOURMACHINE\YOURINSTANCE" -Database AdventureWorks2008R2 | Out-Null }}}

#"test.EmployeeDepartmentHistoryByLastName"
Start-Job -ScriptBlock {Import-Module sqlps; foreach($_ in 1..100) {foreach ($_ in [char]"a"..[char]"z") {Invoke-Sqlcmd -Query "exec test.EmployeeDepartmentHistoryByLastName '$([char]$_)'" -ServerInstance "YOURMACHINE\YOURINSTANCE" -Database AdventureWorks2008R2 | Out-Null }}}

#"test.ProductAndDescriptionByKeyword"
Start-Job -ScriptBlock {Import-Module sqlps; foreach($_ in 1..100) {foreach ($_ in [char]"a"..[char]"z") {Invoke-Sqlcmd -Query "exec test.ProductAndDescriptionByKeyword '$([char]$_)'" -ServerInstance "YOURMACHINE\YOURINSTANCE" -Database AdventureWorks2008R2 | Out-Null }}}

Each command will start an asynchronous background job.

Step 3: Manage Jobs (if needed)

Once the jobs are running in the background, you may want to check on their status. You can do so by running:

get-job

if you want to remove a job from the list, you can use Remove-Job with the job number, or you can remove all jobs (whether or not they are running) with:

Remove-Job * -Force

If you want to see the output of a job, you can use Receive-Job– supply the jobnumber. If you’re troubleshooting and want to see errors, you probably want to remove | Out-Null from the command that starts the job, and use a fewer number of loops. Then you can can receive the job’s output and see any errors.

Receive-Job JOBNUMBER

↧

There’s more than one way to skin an eggplant: Using APPLY for calculations

March 29, 2011, 7:00 am

≫ Next: DBCC USEROPTIONS: See Your Session Settings in SQL Server

≪ Previous: How To: Automate Continuous SQLServer Activity with Stored Procedures and Powershell Jobs

Choices, choices

Here’s a little TSQL snack. I picked this up in a presentation by Itzik Ben-Gan at the PNWSQL user group recently, and it’s become a fast favorite.

CROSS APPLY and OUTER APPLY- another use

The APPLY operator is perhaps more flexible than you think. You may already know that you can use it to inline a function, or to replace a join.

But wait, there’s more! You can also use APPLY to perform calculations and simplify your query syntax– this is because the APPLY operator allows you to express a calculation that can be referred to:

in further joins (which may or may not use APPLY)
by columns
in the where clause
in the group by

This is really helpful, because you can’t refer to the results of a computation in one column from anywhere but the ORDER BY. This is because of the order of evaluation of parts of the statement.

I know this sounds confusing. It’ll make more sense in an example.

A sample query– the ‘before’ version

Here is a query written for the AdventureWorks sample database. There’s all sorts of examples that are possible for this, but I decided to go with one grouping data by month, using my favorite formula to round dates.

It shows the total quantity of orders by Product for an entire order month, for orders placed on or after 2004-07-01.

SELECT  DATEADD(MM, DATEDIFF(MM, 0, oh.OrderDate), 0) AS OrderDateMonth,
        p.Name AS ProductName,
        SUM(orderQty) AS totalQuantity
FROM    sales.SalesOrderHeader oh
JOIN    Sales.SalesOrderDetail od
        ON oh.SalesOrderID = od.SalesOrderID
JOIN    production.Product p
        ON od.ProductID = p.ProductID
WHERE   oh.OrderDate >= '2004-07-01'
GROUP BY DATEADD(MM, DATEDIFF(MM, 0, oh.OrderDate), 0),
        p.Name
ORDER BY OrderDateMonth,
        p.Name

Notice that to group the date at the month level, we need to include the calculation in the column in the column list, as well as in the group by clause.

The query rewritten using APPLY for the calculation

This can be rewritten with CROSS apply to move the calculation into the JOIN area and only specify it once.

The benefits: this will simplify your syntax and reduce the chance of typos and errors, particularly when you need to go in and change the calculation. In cases when you’re displaying a sum in one column and showing a percentage using it in another column, this trick is *fantastic*. (Query numbers from the DMVs a lot? you’ll love this.)

Here, the calculation on the date is moved into the cross apply. It can be referenced as oh1.OrderDateMonth in both the list of columns, and in the GROUP BY portion of the query without rewriting the calculation.

SELECT  oh1.OrderDateMonth,
        p.Name AS ProductName,
        SUM(orderQty) AS totalQuantity
FROM    sales.SalesOrderHeader oh
CROSS APPLY ( SELECT    DATEADD(MM, DATEDIFF(MM, 0, oh.OrderDate), 0) AS OrderDateMonth ) AS oh1
JOIN    Sales.SalesOrderDetail od
        ON oh.SalesOrderID = od.SalesOrderID
JOIN    production.Product p
        ON od.ProductID = p.ProductID
WHERE   oh.OrderDate >= '2004-07-01'
GROUP BY oh1.OrderDateMonth,
        p.Name
ORDER BY OrderDateMonth,
        p.Name

What does the execution plan look like?

Click for a larger image

The execution plan for these two queries are identical.

In this case, the optimizer looks at these two queries and realizes the activities it needs to do will be the same.

Other options

You can create further CROSS APPLY or OUTER APPLY joins that refer to computations in prior joins.

You can also refer to the resulting computation in the where clause.

But be careful….

As with anything, you want to make sure you’re getting a good execution plan, and not shooting yourself in the foot with a new trick.

One big area to watch: although you can refer to these computations conveniently in the WHERE clause, you still want to be careful you’re using appropriate criteria.

For instance, if we were to change the example above to refer to the result from the CROSS APPLY oh1 in the where clause like this:

SELECT  oh1.OrderDateMonth ,
        p.Name AS ProductName ,
        SUM(orderQty) AS totalQuantity
FROM    sales.SalesOrderHeader oh
CROSS APPLY ( SELECT    DATEADD(MM, DATEDIFF(MM, 0, oh.OrderDate), 0) AS OrderDateMonth ) AS oh1
JOIN    Sales.SalesOrderDetail od
        ON oh.SalesOrderID = od.SalesOrderID
JOIN    production.Product p
        ON od.ProductID = p.ProductID
WHERE   oh1.OrderDateMonth >= '2004-07-01'  ---Don't do this!
GROUP BY oh1.OrderDateMonth ,
        p.Name
ORDER BY OrderDateMonth ,
        p.Name

… then in this case the query would not be able to use an index on OrderDate on the sales.SalesOrderHeader table, if one exists.

This is not specifically because of the CROSS APPLY, but because we are forcing SQL Server to apply the functions to every value to identify if it satisfies the criteria. That prevents a seek.

↧

DBCC USEROPTIONS: See Your Session Settings in SQL Server

November 7, 2013, 8:00 am

≫ Next: Did My Query Eliminate Table Partitions in SQL Server?

≪ Previous: There’s more than one way to skin an eggplant: Using APPLY for calculations

This is a super old command, but it still comes in handy when working with SQL Server.

Want to know your default isolation level in the current database? Run this. (If optimistic locking is turned on in your current database context, your default will be “read committed snapshot”)

Want to know what your ANSI settings are, or how arithabort is set? (Those settings can impact your query results, and determine whether you can successfully use a filtered index, an indexed computed column, or an indexed view.) DBCC USEROPTIONS helps out!

The biggest limitation: this tells you the settings for your current session– but not for anyone else’s session. That might help you figure out why something is slow in the application and fast in SSMS, but you’ve still got to do some legwork to figure out what the other session’s settings are. (Perhaps by using sys.dm_exec_requests?)

↧

Did My Query Eliminate Table Partitions in SQL Server?

November 17, 2015, 8:00 am

≫ Next: Does OPTION (RECOMPILE) Prevent Query Store from Saving an Execution Plan?

≪ Previous: DBCC USEROPTIONS: See Your Session Settings in SQL Server

Working with table partitioning can be puzzling. Table partitioning isn’t always a slam dunk for performance: heavy testing is needed. But even getting started with the testing can be a bit tricky!

Here’s a (relatively) simple example that walks you through setting up a partitioned table, running a query, and checking if it was able to get partition elimination.

In this post we’ll step through:

How to set up the table partitioning example yourself
How to examine an actual execution plan to see partition elimination and which are accessed. Spoiler: you can see exactly which partitions were used / eliminated in an an actual execution plan.
Limits of the information in cached execution plans, and how this is related to plan-reuse
A wrap-up summarizing facts we prove along the way. (Short on time? Scroll to the bottom!)

How to Get the Sample Database

We’re using the FactOnlineSales table in Microsoft’s free ContosoRetailDW sample database. The table isn’t very large. Checking it with this query:

SELECT 
    index_id, 
    row_count, 
    reserved_page_count*8./1024. as reserved_mb
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('FactOnlineSales');
GO

Here’s the results:

The table has 12.6 million rows and only takes up 363 MB. That’s really not very large. We probably wouldn’t partition this table in the real world, and if we did we would probably use a much more sophisticated partition scheme than we’re using below.

But this post is just about grasping concepts, so we’re going to keep it super-simple. We’re going to partition this large table by year.

First, Create the Partition Function

Your partition function is an algorithm. It defines the intervals you’re going to partition something on. When we create this function, we aren’t partitioning anything yet — we’re just laying the groundwork.

CREATE PARTITION FUNCTION pf_years ( datetime )
    AS RANGE RIGHT
    FOR VALUES ('2006-01-01', '2007-01-01', '2008-01-01', '2009-01-01', '2010-01-01');
GO

Unpacking this a bit…

DATETIME data type: I haven’t said what column (or even table) I’m partitioning yet — that comes later. But I did have to pick the data type of the columns that can use this partitioning scheme. I’ll be partitioning FactOnlineSales on the DateKey column, and it’s an old DateTime type.

RANGE RIGHT: You can pick range left or range right when defining a partition function. By picking range right, I’m saying that each boundary point I listed here (the dates) will “go with” the columns on the partition to the right.

This means that the boundary point ‘2007-01-01’ will be included in the partition with the dates above it. That’s the rest of the dates for 2007.

Usually with date related boundary points, you want RANGE RIGHT. (We don’t usually want the first instant of the month, day, or year to be with the prior year’s data.)

VALUES: Why doesn’t the partition function go to present day? Well, the Contoso team apparently decided to use some other database after the end of 2009. That’s the lastest data we have.

Second, Create the Partition Scheme and Map it to the Function

A partition scheme tells SQL Server where to physically place the partitions mapped out by the partition function. Let’s create that now:

CREATE PARTITION SCHEME ps_years 
    AS PARTITION pf_years
    ALL TO ([PRIMARY])
GO

Let’s talk about “ALL TO ([PRIMARY])”. I’ve done something kind of awful here. I told SQL Server to put all the partitions in my primary filegroup.

You don’t always have to use a fleet of different filegroups on a partitioned table, but typically partitioned tables are quite large. Dumping everything in your primary filegroup doesn’t give you very many options for a restore sequence.

But we’re keeping it simple.

Now Partition the Table on the Partition Scheme

This is where it gets real. Everything up to this point has been metadata only.

Currently, the FactSales table has a clustered Primary Key on the SalesKey column and no nonclustered indexes. We’re going to partition the table by the DateKey column. The first step is to drop the clustered PK, like this:

ALTER TABLE dbo.FactSales 
  DROP CONSTRAINT PK_FactSales_SalesKey;
GO

Now partition the table by creating a unique clustered index on the partition scheme, like this:

CREATE UNIQUE CLUSTERED INDEX cx_FactSales
  on dbo.FactSales (SalesKey, DateKey)
ON [ps_years] (DateKey)
GO

We made a couple of important changes. The table used to have a clustered PK on SalesKey, but we replaced this with a unique clustered index on TWO columns: SalesKey, DateKey. There’s a reason for this: if we’re partitioning on DateKey and we try to create a unique clustered index on just SalesKey, I’ll get this message:

Msg 1908, Level 16, State 1, Line 31
Column 'DateKey' is partitioning column of the index 'cx_FactSales'. Partition columns for a unique index must be a subset of the index key.

DateKey is elbowing its way into that clustered index, whether I like it or not.

All right, now that we have a partitioned table, we can run some queries and see if we get partition elimination!

Query the Partitioned Table and Look at the Actual Execution Plan

Our example query is this stored procedure:

CREATE PROCEDURE dbo.count_rows_by_date_range
  @s datetime,
  @e datetime 
AS
  SELECT COUNT(*)
  FROM dbo.FactSales
  WHERE DateKey between @s and @e;
GO

exec dbo.count_rows_by_date_range '2008-01-01', '2008-01-02';
GO

If we run that call to dbo.count_rows_by_date_range with “Actual Execution Plans” enabled, we get the following graphic execution plan:

It’s a clustered index scan, but don’t jump to conclusions.

We have a clustered index scan operator on the fact sales table. That looks like it’s scanning the whole thing– but wait, we might be getting partition elimination! This is an actual execution plan, so we can check.

Hovering over the Clustered Index Scan operator on Fact Sales, a tooltip appears!

Partitioned = True!

It knows the FactSales table is partitioned, and “Actual Partition Count” is 1. That’s telling us that it only accessed a single partition. But which partition?

To tell that, we need to right click on the Clustered Index Scan operator and select “properties”:

Decoding this: The clustered index scan accessed only one partition. This was partition #4.

Let’s re-run our query to make it access more than one partition! We’re partitioning by year, so this should touch two partitions:

exec dbo.count_rows_by_date_range '2007-12-31', '2008-01-02';
GO

Running this query with actual execution plans on, right clicking the Clustered Index Scan, and looking at properties, this time we see it accessing two partitions, partition #3 and partition #4:

Just because you see “Clustered Index Scan” doesn’t mean you didn’t get partition elimination. However, even if you did get partition elimination, it may have needed to read from multiple partitions.

Can You See Partition Elimination in the Cached Execution Plan?

So far we’ve been looking at Actual Execution plans, where I’ve run the query in my session. What if this code was being run by my application, and I wanted to check if it was getting partition elimination?

If the execution plan was cached, I could find information on its execution and cached plan with this query:

SELECT 
  eqs.execution_count,
  CAST((1.)*eqs.total_worker_time/eqs.execution_count AS NUMERIC(10,1)) AS avg_worker_time,
  eqs.last_worker_time,
  CAST((1.)*eqs.total_logical_reads/eqs.execution_count AS NUMERIC(10,1)) AS avg_logical_reads,
  eqs.last_logical_reads,
    (SELECT TOP 1 SUBSTRING(est.text,statement_start_offset / 2+1 , 
    ((CASE WHEN statement_end_offset = -1 
      THEN (LEN(CONVERT(nvarchar(max),est.text)) * 2) 
      ELSE statement_end_offset END)  
      - statement_start_offset) / 2+1))  
    AS sql_statement,
  qp.query_plan
FROM sys.dm_exec_query_stats AS eqs
CROSS APPLY sys.dm_exec_sql_text (eqs.sql_handle) AS est 
JOIN sys.dm_exec_cached_plans cp on 
  eqs.plan_handle=cp.plan_handle
CROSS APPLY sys.dm_exec_query_plan (cp.plan_handle) AS qp
WHERE est.text like '%FROM dbo.FactSales%'
OPTION (RECOMPILE);
GO

Here’s the results for our query:

Sys.dm_exec_query_stats has great info! The difference between the average logical reads and the last logical reads shows us that sometimes this query reads more than others– that’s because the first time we ran it, it had to scan one partition. The second time we ran it, it had to query two. If it was always scanning the whole table, we’d have the same number of logical reads for the average and the last.

We can also see that the same execution plan was reused for both queries. Clicking on the cached query plan to open it up, we see something similar… but it doesn’t have all the same info.

The clustered index scan is the same…

But in the properties we can only see that it knows the table is partitioned

The cached execution plan does not contain information on the number of partitions accessed or which ones were accessed. We can only see that in the Actual Execution plan.

TLDR; (Too long, didn’t eliminate partitions)

Here’s a quick rundown of what we did and saw:

We partitioned the FactSales table by creating a partition function and partition scheme, then put a unique Clustered Index on the SalesKey and DateKey columns
When we ran our query with actual execution plans enabled, we could see how many partitions were accessed and the partition number
When we looked at the cached execution plan, we could see that the same execution plan was able to be re-used across multiple runs, even though:
- It was a parameterized stored procedure
- The query accessed a different number of partitions on each run (one partition on the first run, two partitions on the second run)
The cached execution plan did not contain the number of partitions accessed. (Makes sense, given the plan re-use!)
We could see the average and last number of logical reads from sys.dm_exec_query_stats, which could give us a clue as to whether partition elimination was occurring

Super simple, right?

If you liked this post and you’re ready for something more challenging, head on over to Paul White’s blog and read about a time when partition elimination didn’t work.

↧

Does OPTION (RECOMPILE) Prevent Query Store from Saving an Execution Plan?

November 25, 2015, 8:00 am

≫ Next: Joins, Predicates, and Statistics in SQL Server

≪ Previous: Did My Query Eliminate Table Partitions in SQL Server?

Recompile hints have been tough to love in SQL Server for a long time. Sometimes it’s very tempting to use these hints to tell the optimizer to generate a fresh execution plan for a query, but there can be downsides:

This can drive up CPU usage for frequently run queries
This limits the information SQL Server keeps in its execution plan cache and related statistics in sys.dm_exec_query_stats and sys.dm_exec_procedure_stats
We’ve had some alarming bugs where recompile hints can cause incorrect results. (Oops! and Whoops!)
Some queries take a long time to compile (sometimes up to many seconds), and figuring out that this is happening can be extremely tricky when RECOMPILE hints are in place

The new SQL Server 2016 feature, Query Store may help alleviate at least some of these issues. One of my first questions about Query Store was whether recompile hints would have the same limitations as in the execution plan cache, and how easy it might be to see compile duration and information.

Let’s Turn on Query Store

I’m running SQL Server 2016 CTP3. To enable query store, I click on the database properties, and there’s a QueryStore tab to enable the feature. I choose “Read Write” as my new operation mode so that it starts collecting query info and writing it to disk:

Query Store: ACTIVATE!

If you script out the TSQL for that, it looks like this:

USE [master]
GO
ALTER DATABASE [ContosoRetailDW] SET QUERY_STORE = ON
GO
ALTER DATABASE [ContosoRetailDW] 
SET QUERY_STORE (OPERATION_MODE = READ_WRITE, 
CLEANUP_POLICY = (STALE_QUERY_THRESHOLD_DAYS = 367), 
DATA_FLUSH_INTERVAL_SECONDS = 900, 
INTERVAL_LENGTH_MINUTES = 60, 
MAX_STORAGE_SIZE_MB = 100, 
QUERY_CAPTURE_MODE = ALL, 
SIZE_BASED_CLEANUP_MODE = AUTO)
GO

And Now Let’s Test Drive that RECOMPILE Hint

Now that Query Store’s on, I make up a few queries with RECOMPILE hints in them and run them– some once, some multiple times. After a little bit of this, I check out and see what query store has recorded about them:

SELECT 
  qsq.query_id,
  qsq.query_hash,
  qsq.count_compiles,
  qrs.count_executions,
  qsq.avg_compile_duration,
  qsq.last_compile_duration,
  qsq.avg_compile_memory_kb,
  qsq.last_compile_duration,
  qrs.avg_logical_io_reads,
  qrs.last_logical_io_reads,
  qsqt.query_sql_text,
  CAST(qsp.query_plan AS XML) AS mah_query_plan
FROM sys.query_store_query qsq
JOIN sys.query_store_query_text qsqt on qsq.query_text_id=qsqt.query_text_id
JOIN sys.query_store_plan qsp on qsq.query_id=qsp.query_id
JOIN sys.query_store_runtime_stats qrs on qsp.plan_id = qrs.plan_id
WHERE qsqt.query_sql_text like '%recompile%';
GO

Note: I’ve kept it simple here and am looking at all rows in sys.query_store_runtime_stats. That means that if I’ve had query store on for a while and have multiple intervals, I may get multiple rows for the same query. You can add qrs.runtime_stats_interval_id to the query to see that.

Here’s a sample of the results:

query store results for recompile queries

(Click to see the beauty of query store in a larger image)

YAY! For all my queries that were run with RECOMPILE hints, I can see information about how many times they were run, execution stats, their query text and plan, and even information about compilation.

And yes, I have the execution plans, too — the “CAST(qsp.query_plan AS XML) AS mah_query_plan” totally works.

Want to Learn More about Query Store and Recompile?

In this post, I just talked about observing recompile overhead with Query Store. Grant Fritchey has an excellent post that addresses the question: what if you tell Query Store to freeze a plan for a query with a recompile hint? Will you still pay the price of recompile? Read the answer on Grant’s blog here.

↧

Joins, Predicates, and Statistics in SQL Server

December 8, 2015, 7:00 am

≫ Next: 3 Things I Wish I’d Learned Earlier as a SQL Server DBA

≪ Previous: Does OPTION (RECOMPILE) Prevent Query Store from Saving an Execution Plan?

Joins can be tricky. And where you put your ‘where’ clause may mean more than you think!

Take these two queries from the AdventureWorksDW sample database. The queries are both looking for data where SalesTerritoryCountry = ‘NA’ and they have the same joins, but the first query has a predicate on SalesTerritoryCountry while the second has a predicate on SalesTerritoryKey.

/* Query 1: Predicate on SalesTerritoryCountry */
select 
  ProductKey, OrderDateKey, DueDateKey, ShipDateKey, CustomerKey, PromotionKey, CurrencyKey, 
  fis.SalesTerritoryKey, SalesOrderNumber, SalesOrderLineNumber, RevisionNumber, OrderQuantity, 
  UnitPrice, ExtendedAmount, UnitPriceDiscountPct, DiscountAmount, ProductStandardCost, 
  TotalProductCost, SalesAmount, TaxAmt, Freight, CarrierTrackingNumber, CustomerPONumber, 
  OrderDate, DueDate, ShipDate
from dbo.FactInternetSales fis
join dbo.DimSalesTerritory st on 
  fis.SalesTerritoryKey=st.SalesTerritoryKey
where st.SalesTerritoryCountry = N'NA'
GO

/* Query 2: Predicate on SalesTerritoryKey (for the exact same country) */
select 
  ProductKey, OrderDateKey, DueDateKey, ShipDateKey, CustomerKey, PromotionKey, CurrencyKey, 
  fis.SalesTerritoryKey, SalesOrderNumber, SalesOrderLineNumber, RevisionNumber, OrderQuantity, 
  UnitPrice, ExtendedAmount, UnitPriceDiscountPct, DiscountAmount, ProductStandardCost, 
  TotalProductCost, SalesAmount, TaxAmt, Freight, CarrierTrackingNumber, CustomerPONumber, 
  OrderDate, DueDate, ShipDate
from dbo.FactInternetSales fis
join dbo.DimSalesTerritory st on 
  fis.SalesTerritoryKey=st.SalesTerritoryKey
where st.SalesTerritoryKey = 11;
GO

Take a look at the difference in their estimated execution plans: 1_estimated_plan_differences

Although these queries return the same data, the plans and performance are very different. Query 1 (predicate written against SalesTerritoryCountry) estimates too high and chooses a much larger plan than it needs. It doesn’t have a clue that there are zero rows for SalesTerritoryCountry = ‘NA’.

2_query_comparison

Hash joins aren’t necessarily bad, but we don’t need one for this query. Why do the heavy lifting for no rows?

Where is Query #1 Getting That 6,039.8 Row Estimate?

SQL Server uses statistics for estimates. It’s using them for both of these queries, just in different ways. For the query “where st.SalesTerritoryCountry = N’NA'”, it uses two statistics:

dbo.DimSalesTerritory This is a small dimension table. SQL Server uses a column statistic on the SalesTerritoryCountry column. It’s able to look the value NA up in a detailed histogram that describes the data distribution to see that there’s just one row for that value in the table. Super simple!

3_statistics_dimension

dbo.FactInternetSales Things get more complicated here. The FactInternetSales table doesn’t know anything about SalesTerritoryCountry. It only has the column SalesTerritoryKey.

And although it’s joining on the column, it doesn’t understand that the SalesTerritoryCountry = NA is the same thing as SalesTerritoryKey = 11.

Query optimization has to be fast, and SQL Server has to figure everything out before it begins executing the query. It doesn’t have the ability to go run a query like “SELECT SalesTerritoryKey from dbo.DimSalesTerritory WHERE SalesTerritoryCountry = N’NA'” before it can even optimize the query.

So it needs to make a guess about how many rows an unknown Country has in FactInternetSales.

It does this using a part of the statistics called the “Density Vector”. SQL Server has statistics on an index that I created on the SalesTerritoryKey column in this case. The density vector describes how many rows on average any given SalesTerritoryKey has associated with it in the fact table.

4_statistics_density_fact

The average density is .1 and there are 60398 rows in the table. 60398 * 0.1 = 6039.8 … there’s our row estimate!

In this case, 6,039.8 rows is enough that SQL Server decides that many nested loop lookups would be a drag. It decides to build some hash tables and figure it out in memory. Honestly, it’s not a terrible choice in this case. Yeah, it needs a memory grant, but it gets the work done in a very small amount of milliseconds and calls it a day.

If this was just one part of a much larger and more complex plan, it could have much bigger consequences, and make a more significant difference in runtime.

One Cool Thing About Query #2

Notice that on Query #2, I wrote the predicate against the dimension table, not the fact table. It was able to see that I joined on those columns and use that predicate against the fact table itself to get a very specific estimate.

That’s pretty cool!

What Does this Mean for Writing Queries?

Whenever you have a chance to simplify a query, it can be beneficial.

In this case, if we’re writing a predicate against the SalesTerritoryKey column, it’s fair to ask if we need to join the two tables at all. If we have a checked foreign key that ensures that every SalesTerritoryKey has a matching parent row in DimSalesTerritory and we don’t actually want to return any columns from DimSalesTerritory, we don’t even need to do the join.

In complex situations when performance is important, thinking carefully about how you write queries and where you put predicates can sometimes help you tune.

↧