Basics of Common Table Expressions and Windowing Functions

10 March 2015

Most developers are aware of how to use basic SQL constructs (Select, Update, Insert) even if their level of database programming is very minimal. I was fortunate that before I was a fulltime developer, I worked in a role that required me to read, write, and troubleshoot an intense amount of SQL (and actually wasn’t a DBA job). Over the course of time, I became pretty fair at SQL and others used to bring me random SQL problems for help. Most of these were straightforward enough problems, but occasionally I would get one that was really difficult, and I really enjoyed those. Last night a current colleague of mine emailed me about a problem he was having doing some integration work and the solution was a set of SQL Server features that I think deserve to be talked about more: common table expressions (CTE) and windowing functions. There’s a lot of power there and it can be overwhelming so the point of this post is to simply describe the problem and then illustrate how CTE and windowing functions made the solution pretty trivial.

Here’s the problem: we have a table of data about employment applicants (i.e. people who have applied for jobs) in a legacy system. We’re moving to a new system so we need to take the data from the first system and transform it into a format required by the second system so it can be imported.

First, here’s the data in the legacy system in a single table (greatly stripped down for illustrative purposes, the real one has about 250+ columns):

NamePhoneEyeColorPositionIDTitle
Franklin1212Blue123General Manager
Franklin1212Blue67Salesman
Pierce1313Blue234Digger
Pierce1313Blue456Porter
Hoolihan1414Green123General Manager
Bob1515Blue86Maid
Bob1515Blue90Electrician

We’ll call Name, Phone, and EyeColor the “key” fields. That’s a bit of a misnomer since these keys don’t uniquely identify a single record, but I’m using the term simply to mean that those values are the same for a single person. We need to sequentially number each record with the same key (and we’ll call it PositionCount). Franklin has two records so the first is 1 and the second is 2, but Hoolihan has only a single record so its sole record is 1. We only want to print the key fields for the first position record; otherwise we want to print blanks for the key fields, but still print the position data. Here’s how we want the data to look for loading into the target system.

NamePhoneEyeColorPositionIDTitlePositionCount
Franklin1212Blue123General Manager1
67Salesman2
Pierce1313Blue234Digger1
456Porter2
Hoolihan1414Green123General Manager1
Bob1515Blue86Maid1
90Electrician2

The first part to tackle: how can we determine the position counts? Most people can instantly think of an imperative solution (just loop through the records, adding an incrementing value to that column, resetting for each person), but in SQL we want to think relationally. A Group By would let us group the records by our keys, but at best we could obtain a count of the records in the group, not quite what we want. It’s here that a windowing function really shines.

At the most basic level, a windowing function lets us define a partition (which is similar to a group) against which we can execute some statement. Below is the SQL that will do the job for me:


select *, row_number() over (partition by [Name] order by [Name] asc) as [PositionCount] from Persons)

Probably not terribly hard to understand, but the OVER function is the windowing function. We instruct it to partition our set based on Name, then order it by Name (which is required), then apply SQL Server’s row_number() function within each partition. This will have the effect we want: the first record in each partition will get the value 1, the next 2, and so on, but the counts reset for each partition. By modifying the partition columns, you can control to what level the records are numbered. See the documentation on the OVER function on MSDN to learn more. The shown SQL is now capable of providing this result set:

NamePhoneEyeColorPositionIDTitlePositionCount
Franklin1212Blue123General Manager1
Franklin1212Blue67Salesman2
Pierce1313Blue234Digger1
Pierce1313Blue456Porter2
Hoolihan1414Green123General Manager1
Bob1515Blue86Maid1
Bob1515Blue90Electrician2

From here, what we want to achieve is conceptually pretty easy: we only want to print the value of the key fields when PositionCount is 1, otherwise we want to print a blank (or null). We could of course make a new table (either temporary or real) and put our newly formed result set in that, then operate against it, but a common table expression (a CTE) is for me faster to develop (albeit with some caveats I’ll cover at the end).

What is a common table expression (CTE)? Without cribbing directly from Microsoft on the subject, it’s in many ways an in-memory view. You provide it with a name and you can thereafter use that name for the result set, same as you would a query. The CTE can be used recursively in other queries and combined in some very powerful ways. They are incredibly useful if you’ll be referencing the same table or result set multiple times within a query as you can apply a name to that set and just use the name. And due to how they’re defined, they make queries a lot easier to read as they basically build from the top down in an organized fashion, which is genearlly far easier to decipher than a query with a lot of nested queries.

A CTE is defined in either of the forms:


WITH (CTE Name> AS (<query definition>) 

WITH <CTE Name> (<column names>) AS (<query definition>)

The first form defines a CTE with the given name and its columns will be all the columns from the query definition. The second allows you to specify the column names you want available in the CTE; these are drawn from the query definition. Only those columns defined in the CTE are then avaiable. So what’s the point of the CTE? You can then use it like a table:


with PositionsCounted as (select *, row_number() over (partition by [Name] order by [Name] asc) as [PositionCount] from Persons)
select * from PositionsCounted

The above SQL is equivalent to just executing the query definition directly, all we’ve really done at this point is given the result set a name we can operate against. But for the final piece, we can do this to produce our final result set:


with PositionsCounted as (select *, row_number() over (partition by [Name] order by [Name] asc) as [PositionCount] from Persons)
select case when [PositionCount]>1 then '' else [Name] end,
	   case when [PositionCount]>1 then '' else [Phone] end,
	   case when [PositionCount]>1 then '' else [EyeColor] end,
	   [PositionID], [Title], [PositionCount] 
	   from PositionsCounted

The trick to this SQL is that for each of our key fields, we inspect the value of PositionCount and that determines what we should output (either the value of the column or a blank). Otherwise, we simply want the non-key fields returned as-is. Obviously this approach is difficult to scale if there are hundreds of key fields (as each would have to be added manually as a case expression), but in this case when there’s only a minimal number of key fields, it’s trivial.

CTEs can be combined as well, such as the below, which simply puts the output of the first CTE into the second, a trivial but hopefully illustrative example:


with PositionsCounted as (select *, row_number() over (partition by [Name] order by [Name] asc) as [PositionCount] from Persons),
OutputReady as 
(select case when [PositionCount]>1 then '' else [Name] end [Name],
	   case when [PositionCount]>1 then '' else [Phone] end [Phone],
	   case when [PositionCount]>1 then '' else [EyeColor] end [EyeColor],
	   [PositionID], [Title], [PositionCount] 
	   from PositionsCounted)
select * from OutputReady

So we’ve seen some benefits of CTE, what are the downsides? First, they are only in scope for a single query. If you want to reuse them, they have to be copied and pasted into all places that need to reference them. If you have such a need, temp tables or something more permanent (or semi-permanent) is a better bet. Second, because they do not exist in tempdb, you cannot add indexes or anything that would improve the performance (and constraints and stats are likewise not available). Again, if that’s your need, a temp table would be more in order.

I hope you’ve enjoyed this basic look at common table expressions and windowing functions in SQL Server. This has only scratched the surface of these features so I definitely suggest working through some examples to get a feel for how and when each can be applied in real life situations.


Introduction to Custom Specimen Builders in AutoFixture

7 March 2015

It is no secret at work that I greatly admire Mark Seemann. Anyone that comes to me and asks about software design I steer them right to his book “Dependency Injection in .NET”. In some respects, the title does it a bit of a disservice as at least half of it discusses some critical component design practices that apply regardless of whether one wants to use dependency injection. Don’t get me wrong, he covers dependency injection quite comprehensively, but there’s a lot more to DI than just picking a container. Beyond his book, Mark is an incredibly accessible guy in a couple respects. One, he blogs pretty regularly and has a lot of incredible posts on design, I find myself going back to them again and again. And two, he is the author of AutoFixture, which needs almost no introduction in the .NET world. And it is this library that is the subject of this post.

When I explain AutoFixture to other developers, I start by emphasizing its most straightforward use: when writing unit tests, it will provide you with test data with almost no setup, as if by magic. Whether you needs integers or strings or collections or even complex objects, AutoFixture will provide them to you in a semi-random fashion. Within the AutoFixture codebase, the components that handle creation of some value are termed specimen builders and they inherit ISpecimenBuilder. A few common examples:

  • Int32Generator is the specimen builder that generates integers, beginning at 1 and incrementing by 1 for each subsequent integer that the fixture creates.
  • StringGenerator generates strings and the default randomization is to append a GUID to the name of the property or field
  • RandomRangedNumberGenerator generates a random number within a given range, but doesn’t repeat numbers in that range until all values have been used (shameless plug: I submitted this to AutoFixture)

And of course there are plenty more beyond that that do a variety of things. Mark has an older but still valid post from 2010 on how to approach creating a custom specimen builder, but I think there’s some nuances to it that aren’t entirely clear to new users of AutoFixture. I know when I introduced AutoFixture at work we used it in a sub-optimal manner because it wasn’t always clear how to best customize it for certain situations and I want to save others some effort in the future since there isn’t always time to dig into new libraries like one would hope. This in no way takes away from how incredible AutoFixture is or how responsive Mark and others (Adam Chester and Nikos Baxevanis among them) are to questions about it on Stack Overflow or elsewhere, just my attempt to contribute some first-hand experience and perspective.

The ISpecimenBuilder interface has the following signature (taken directly from the AutoFixture codebase):

	public interface ISpecimenBuilder
    {        
        object Create(object request, ISpecimenContext context);
    }

Here, request is something that needs to be created (and I’ll come back to that in a minute). The context can be used if necessary to generate other values from the fixture, but is not always needed. For example, Int32Generator has no need of the context as it contains all the knowledge for satisfying its creation goals. But suppose you’re creating a specimen builder for a complex type and in the course of that, you want AutoFixture to provide some value (of any type)? In that case you can use the context to obtain it (I’m intentionally omitting why you get ISpecimenContext and not IFixture).

Request is ambiguously typed as object - what is it? Typically, request is one of:

  • A Type
  • A PropertyInfo
  • A ParameterInfo

And that brings us finally to the point of this post: why on earth is that and what does it mean for us trying to create a custom specimen builder?

We’ll start with the most obvious, Type. At the most basic level, use of AutoFixture generally looks like:

	var fixture = new Fixture();            
    var contact = fixture.Create<Contact>();

We create a Fixture, then we ask it for the type Contact, which is defined as:

	 public class Contact
    {
        public Contact(string personId)
        {
            PersonId = personId;
        }

        public string PersonId { get; private set;  }
    }

AutoFixture now attempts to satisfy the request for a contact by running it through a variety of specimen builders in an implementation of the powerful chain of responsibility pattern. The chain in this case is (among other things) a collection of specimen builders that individually assess whether they are capable of providing a value of type Contact. The request to create it falls through each builder in the chain until one finally returns an actual Contact (each returns a NoSpecimen if it cannot handle the request). But what does it mean to create a Contact?

As shown, AutoFixture will find that Contact has one constructor and that this constructor takes a string. So AutoFixture will automatically attempt to resolve creation of a string, then use that to create the Contact via its constructor. This would be true regardless of how many dependencies the constructor had or their type. And if there are multiple constructors, AutoFixture selects among them (by default, taking the one with the fewest parameters). So from that view, AutoFixture creating an instance of an arbitrary complex type is simply the creation of any dependencies of that type until the entire graph is constructed.

Now let’s say for the sake of argument that PersonId, despite being typed as a string, is actually numeric. We’ve already seen earlier that AutoFixture, by default, will make strings whose value are the property name and a GUID so we know that the default is not going to work for us here. The first and easiest approach to this is:

	var fixture = new Fixture();
    fixture.Register(() => "12345678");
    var contact = fixture.Create<Contact>();

This solves one problem (PersonId on the Contact will be “12345678”) except this will cause AutoFixture to use “12345678” for every string it creates. So we’ve overreached a bit here. Of course we could use some method that generates a random integer and uses that instead of a hard-coded one, but again, it will use that random numeric string everywhere. At some point, it might occur to us “What if we could customize the creation of Contact as a whole instead of at the string level?” Enter a custom specimen builder.

	public class ContactGenerator : ISpecimenBuilder
    {
        public object Create(object request, ISpecimenContext context)
        {
            var type = request as Type;

            if (type == null)
                return new NoSpecimen(request);

            if (type != typeof(Contact))
                return new NoSpecimen(request);

            return new Contact("12345678");
        }
    }

So here, what is request? Well remember, we’re asking the Fixture to resolve Contact for us and our specimen builder’s purpose is to create an entire Contact so request would be a Type. But it’s not just any arbitrary Type, our builder would only care about Contact. The builder above follows a standard form: attempt to cast request into what it’s able to operate on, ensure that we did indeed get a request we can handle (returning NoSpecimen if we can’t), and then create our desired instance. In the above, we’re hardcoding “12345678” as the parameter to the Contact constructor. What if we want AutoFixture to make a random integer for us to use for that purpose? Simple enough to use the context to resolve that for us:

	return new Contact(context.Create<int>().ToString());

Now we’re able to create a Contact with a properly numeric PersonId whose value is being provided by the fixture’s normal integer rules. The final step is to tell AutoFixture about the specimen builder:

	fixture.Customizations.Add(new ContactGenerator());

Now you have a specimen builder capable of constructing a given type for you and perhaps that is sufficient for your needs. But let’s say Contact gets a bit more interesting:

	public class Contact
	{
		public Contact(string personId, string firstName, string lastName, int level)
		{
			/// elided
		}
		
		public string PersonId { get; private set; }
		public int Level { get; private set; }
		public string FirstName { get; private set; }
		public string LastName { get; private set; }
	}

So here our contact has a PersonId (using the same numeric string rules as before), a level (an integer from 1 to 3), and first name and last name. Let’s assume that names are not particularly interesting, but there are valid reasons to use different levels in your testing, but again, they must be constrained within a specified range.

We can let AutoFixture just provide us with random strings for the names, but we cannot simply ask it for an integer for level as we must constrain it to a certain range. How can we support that? Of course one way is to extend our ContactGenerator class to now include rules for constraining levels. That might look like:

	 return new Contact(context.Create<int>().ToString(),
                               context.Create<string>(), context.Create<string>(),
                               new Random().Next(1, 3));

This seems to work fine, but something about it feels a little off. For one, we’ve got a lot of rules about our system and its valid inputs buried within this builder instead of having them be explicit. Sometimes, this suggests the need to make a concept more explicit, possibly by introducing PersonId and Level as value objects that properly communicate our intent. Let’s say we introduce such types as shown below:

	public class PersonId
    {
        public PersonId(string personId)
        {
            if (!personId.IsNumeric())
                throw new ArgumentException("personId");

            this.Id = personId;
        }

        public string Id { get; private set; }
    }

    public class ContactLevel
    {
        public ContactLevel(int level)
        {
            if (level <= 0 || level > 3)
                throw new ArgumentException("level");

            this.Level = level;
        }

        public int Level { get; private set; }
    }

That changes our Contact constructor to:

	public Contact(PersonId personId, string firstName, string lastName, ContactLevel level)

Now asking AutoFixture to resolve Contact will cause it to resolve PersonId’ and ContactLevel. But aren’t we back where we started? Again, PersonId must take a numeric string and ContactLevel must take an integer between 1 and 3. Should we extend ContactGenerator to include that logic? In my opinion, definitely not. ContactGenerator doesn’t have any special rules around the PersonId and ContactLevel it can take, it just requires that it receive instances of each so imbuing it with the rules for creating a Contact don’t really make sense. Of course we can create additional specimen builders for ContactLevel and PersonId and put the rules for each in those and that will work fine, but there is another main type of specimen builder that we need to discuss and that will be the subject of a future post.