Introduction to SQL Window Functions

Introduction

SQL is one of the most important skills to have as a data professional. With SQL we retrieve data from a database and format the data in a way that is useful for processing or analysis. Very important as a data professional right?

When we start learning SQL we focus on the basics. How to select data, filter data, and do some aggregations. We even learn how to filter on these aggregations. At last we learn how to connect multiple source tables together with join statements. Many stop here as they now poses all the skills to select data from a database and deform it into a certain format useful for further processing or analysis. But there are more ways of doing advanced aggregations in SQL. We can do this with window functions. They can be tricky to understand and therefore many people do want to learn them. But let me tell you that they are one of the most important parts of SQL as they allow you to do one of the most advanced aggregations with a much simpler format. So let's dive deep into what window functions are.

What are window functions?

Window functions, also known as analytics functions, are used to calculate functions on a subset of rows based on the current row. For each row a frame window is determined and a calculation is made based on the rows this frame. This calculation returns a value for each row

When to use?

GROUP BY clause with aggregates can also do calculations over the rows, but the main difference is that they collapse the output value. So, you don't consist of the original rows anymore. This is a problem when you want to calculate multiple functions simultaneously in one query. This is the most important reason to use window functions!

Structure of a window function

Window functions consist of a defined structure where different input parameters can be applied.

window_function ( expression ) 
	over ( 
		partition 
		ordering 
		frame 
	)

So let's start explaining which type of window functions there are available.

Types of Window functions

There are 3 types of window functions that can be used for calculations.

aggregates: these functions aggregate rows in the partition to calculate sum, count, averages, etc. These functions are the same for the group by clause, but do return a single value for each row.

ranking: A single rank value is calculated based on the rows in the window.

Analytical/value: To select values of prior or following rows in the window.

The expression defined in the structure above, is actually the input column that must be applied for the function.

Determine window frame

To determine the subset of rows that will be applied over the window function, we have to define it in the OVER clause. For each row, the OVER clause will be determined and can be different. In the OVER clause we can define a partition, which is a method to group the data. When no partition is set, all the rows in the table will be used for the window.

Partitioning is similar to the group by clause, but in a window function it's called a PARTITION BY clause. When we partition the data we create different smaller groups, and we actually creating different subset of rows where the window function will operate over. This window will slide trough the consecutive rows from the current row the next row. Because of this sliding window, ordering matters. For example, we want to calculate the cumulative sum, this calculation is dependent on the ordering of rows because a mixed partition will lead to incorrect results. We do this ordering with the ORDER BY clause.

As last we have the frame clause. This clause is used to determine the bounds of the window. In simple words, it is used to determine the start and end row for each window. There are two options of defining a window frame clause. You can use the BETWEEN ROW or RANGE frame as they behave somewhat differently. ROWS frame clause is used to determine the rows based on a defined number. In contrast, RANGE clause is used to determine the rows based a on range of values. E.g. ROWS selects all the rows between index 0 and 10. RANGE selects all rows based on the value between 0 and 10. Because this is a one of choice, the frame clause is also sometimes called the rows range clause.

The RANGE or ROWS do not always need a defined value that determines the rows. For example, if you want to select all the rows in the partition prior to the current index, we have to know the start value of the partition. This value also changes as the window slides up trough the rows. There is a special keyword that specifies the beginning of the partition. UNBOUNDED PRECEDING will select all rows in the partition prior to the current row. To select the last row in the partition, we use the keyword UNBOUNDED FOLLOWING. But sometimes we only need data from the start of a partition to the current row and following rows must be discarded. We can get this by using the last special keyword, CURRENT ROW.

Not every window function needs a frame clause. When we discard the frame clause, and the function accept a frame clause, a default window is actually applied. If a ORDER BY clause is specified the frame is defined by RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, meaning all the values between the current and the prior rows. Otherwise the frame is defined as ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, which means select all the rows from start to end in the partition.

Know we have a good understanding of the anatomy of a window function, let's practice with an example.

Figure

Below a visual example how window functions work.

Example with prior and following rows

In our example we want to calculate the sum of 1 row prior to the current row and 1 row following the current row per colour. We also want to do this ordered by the price. Know we have to select the first input column as this is the value that will be summed. So know we have a window function and the expression. The next step is to determine how to select the rows in the window. We do this by defining an OVER clause. As we would like to have the summed value per colour, we need to partition our data by the column colour, and because we want to order it by the price, we have to add the price as our ordering column. The last step is to determine the bounds of the window. In our example we want the one prior and one following row based on the current index. The bounds of the window is a defined number and not a range, so we have to select the ROW frame clause. In the frame we have to select 1 preceding and 1 following row.

If we want to specify the query, we would have something like:

SUM( price ).OVER( 
	PARTITION BY colour 
	ORDER BY price 
	BETWEEN 1 PRECEDING AND 1 FOLLOWING 
)

Very easy right?

What if we want to have the cumulative sum of rows. It's almost the same as the example above, but instead of having following 1 row, we want all the rows in prior to the current row. We also do not want to have following rows after the current row. The only difference between the examples is in the frame clause.

If we want to select all the rows prior the current row we have to use UNBOUDED PRECEDING clause. To select the current row we have to use the CURRENT ROW clause. Specified:

SUM( price ).OVER( 
	PARTITION BY colour 
	ORDER BY price 
	BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW 
)

If we create a full SQL statement we have something like:

SELECT
	*,
	SUM( price ).OVER( 
		PARTITION BY colour 
		ORDER BY price 
		BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW 
	)
FROM table

Conclusion

As we can see from our example, advanced calculations can be applied using a mix of a variety of functions, partitions, and row bounds. To individually change these inputs as parameters, we can create a complex calculations with a simple query syntax.

The examples are not complete, so I challenge you to start experimenting with different window functions and frame clauses to see how it affects the output.

Let me know if the diagram helps! And, please share your views or suggestions in the comment section.