Copy
Buckle up: time to learn more about SQL Server,
or whatever I'm obsessed with this week.

WHERE GETDATE() BETWEEN StartDate AND COALESCE(CancelDate, EndDate) Is Even Harder to Tune.

In my last post, we started with a memberships table, and each membership had start & end dates. I’m going to create the table and populate it with everyone having an active membership – their StartDate is the same as their Stack Overflow account creation date, and their EndDate is around a year or two from now:

DropIndexes;
GO
USE StackOverflow;
GO
DROP TABLE IF EXISTS dbo.UsersMemberships;

CREATE TABLE dbo.UsersMemberships
  (Id INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
   UserId INT NOT NULL,
   StartDate DATETIME NOT NULL,
   EndDate DATETIME NOT NULL,
   CancelledEarlyDate DATETIME NULL);
GO
INSERT INTO dbo.UsersMemberships(UserId, StartDate, EndDate)
  SELECT Id, CreationDate,
  DATEFROMPARTS(YEAR(GETDATE()) + 2,MONTH(LastAccessDate), DAY(LastAccessDate))
  FROM dbo.Users
  WHERE NOT(MONTH(LastAccessDate) = 2 AND DAY(LastAccessDate) > 28); /* Avoid Feb 29 problems */
GO
CREATE INDEX IX_StartDate_EndDate ON dbo.UsersMemberships(StartDate, EndDate) INCLUDE (CancelledEarlyDate);
CREATE INDEX IX_EndDate_StartDate ON dbo.UsersMemberships(EndDate, StartDate) INCLUDE (CancelledEarlyDate);
GO

If I run either of these queries to check how many active users we have:

SELECT Id, StartDate, EndDate
  FROM dbo.UsersMemberships
  WHERE GETDATE() BETWEEN StartDate AND EndDate;
GO
SELECT Id, StartDate, EndDate
  FROM dbo.UsersMemberships
  WHERE StartDate <= GETDATE() 
    AND EndDate >= GETDATE();
GO

The actual plans show that SQL Server does a great job of accurately estimating that all 9 million of the rows in the table will match:

But now let’s introduce that CancelledEarlyDate.

I’m going to modify my reporting query to include a COALESCE, taking the first non-null column between CancelledEarlyDate and EndDate. If the user cancelled their membership early, we’ll use that cancellation date – otherwise, use EndDate. This is similar to the ISNULL function, but COALESCE lets me pass in multiple values rather than just two. (ISNULL performs the same in this case.)

SELECT Id, StartDate, EndDate
  FROM dbo.UsersMemberships
  WHERE GETDATE() BETWEEN StartDate AND COALESCE(CancelledEarlyDate, EndDate);
GO
SELECT Id, StartDate, EndDate
  FROM dbo.UsersMemberships
  WHERE StartDate <= GETDATE() 
    AND COALESCE(CancelledEarlyDate, EndDate) >= GETDATE();
GO

Note that I haven’t updated anyone’s CancelledEarlyDate yet – they’re all still null. SQL Server’s actual execution plans do a beautiful job of estimating that all 9M rows will still match our search. Note that we’re getting index seeks here (without key lookups) because I included the CancelledEarlyDate in my indexes this time around.

And if I update 50% of the rows to have a CancelledEarlyDate of yesterday:

UPDATE dbo.UsersMemberships
  SET CancelledEarlyDate = DATEADD(dd, -1, GETDATE())
  WHERE Id % 2 = 0;
GO

Now our estimates go off the rails.

Then the actual plans show that SQL Server believes all of the rows will match our filter – even though only half of them do. He estimates 9M rows will come back, but in reality, only 4.5M do:

In the last post, I had to introduce a more real-world data distribution to show how estimation went off the rails as your population started to shift. Here, I don’t – even a very simple population causes our query plans to go wildly wrong.

Now let’s take a breather here for a second because I can almost hear the comments rolling in.

First, you might be thinking that this particular illustration isn’t really all that bad: SQL Server uses the right access method (an index seek rather than a table scan), and it shouldn’t have approached this problem in any other way. However, in the real world, this incorrect row estimation has a huge impact on things like memory grants, how SQL Server approaches other tables in the query, and the order of operations in a query.

Second, you might be typing, “Brent, how might different indexing strategies fix this problem?” That’s why I use the free Stack Overflow database for my demos, and why I give you all of the scripts. You can go test out different approaches for yourself to learn.

Back to the stuff I wanna teach you here. I don’t think it’s fair for us to ask SQL Server to estimate this particular part of the query on the fly:

WHERE StartDate <= GETDATE() 
    AND COALESCE(CancelledEarlyDate, EndDate) >= GETDATE();

That’s just too hard. Sure, SQL Server may have statistics on each of those columns individually, but it doesn’t understand how those columns relate to each other on particular rows. He just has a rough idea of each one’s date distribution. In my simple scenario, it might be theoretically doable, but in real-world data distributions with memberships & policies spread across decades, it’d be really tough. You would want whole separate statistics based on the relationships between columns (and no, multi-column stats aren’t anywhere near that powerful.)

We can help by combining the two columns.

While SQL Server isn’t able to do this itself automatically, we can build our own computed column that performs the calculation for us:

ALTER TABLE dbo.UsersMemberships
  ADD CoalescedEndDates
   AS COALESCE(CancelledEarlyDate, EndDate);
GO

And suddenly – with no other changes, like no indexes or no query modifications – the actual plan’s estimates are way better:

That’s because when we add a computed column, SQL Server automatically adds a statistic on that column, and it can use that statistic when estimating rows.

It’s still not perfect because the plan has a variety of issues:

  • We’re doing an index scan here, not a seek
  • The scan reads all 9M rows of the index by doing 37,591 logical reads
  • We’re calculating the COALESCE every row, every time we do the query, burning about 1 second of CPU time

In this case, indexing the computed column solves all 3 problems.

I’ll create two indexes – one that leads with StartDate, and one that leads with the computed column – because SQL Server may choose different ones depending on our data distribution:

CREATE INDEX IX_StartDate_CoalescedEndDates_Includes
  ON dbo.UsersMemberships(StartDate, CoalescedEndDates) INCLUDE (EndDate);

CREATE INDEX IX_CoalescedEndDates_StartDate_Includes
  ON dbo.UsersMemberships(CoalescedEndDates, StartDate) INCLUDE (EndDate);
GO

The new actual plans of our reporting queries show that we fixed ’em:

We get an index seek, we only read the rows we need to output, and we don’t have to compute the coalesced dates on every row. Yay!

But only in this case.

All of our reporting queries better be exactly the same. Remember how I said earlier that I’m using COALESCE because it takes multiple arguments, and how ISNULL performed the same in that query when I was discussing it? Now, not so much. If I try ISNULL here, SQL Server ignores the computed columns altogether.

And even if they’re all the same, I’m still back to the same problems I had in the last post: as our data distribution becomes more real-world, with a mix of expired, current, and future memberships, I’m going to have estimation problems.

For more about the computed column technique, including when to persist them, when not to, and how to level up with indexed views, check out the Mastering Index Tuning class module on filtered indexes, indexed views, and computed columns.

This week's sponsor: Learn powerful ways to assess the performance of your queries. Join Kevin Kline for our upcoming #S1Webinar on “Intro to Query Tuning on #Microsoft #SQLServer.” Attend in North America on March 4th.


 

Agree? Disagree? Leave a comment.

 
sfs_icon_twitter.png
sfs_icon_forward.png
icon_feed.png
Copyright © 2020 Brent Ozar Unlimited®, All rights reserved.