Implementing high-performance multiple sort-rules query in SQL

Here I will talk how to execute multiples order rules according to the row index in the same query

Hello everyone! I will talk about SQL in this article, more precisely on how to have two completely different order rules for your returning records according to the index. In my case, I was using PostgreSQL, but this will work in any database. You only need to do some small changes according to your database.

So, let’s get started!

What is the problem?

Recently, I got a peculiar task that asked to add two sort rules. The top 3 rows should be order by one specific rule, let’s name it top_3_rows_order_rule. The rest of the rows, from the fourth onwards, should be ordered by another specific rule, let’s name it rest_of_the_rows_order_rule.

I never wrote any SQL query like that before. Usually, we just have one sort rule that may have multiple conditions that will be applied to all rows, but in this scenario, it’s a little bit different.

Something like this:

SELECT *
FROM <table_name>
ORDER BY
  <table_name>.column_1,
  <table_name>.column_2,
  ...

Sometimes we need to add a CASE or write a more complex ORDER BY statement, but for this task was different.

Just to give an example, imagine we have a employments table with name (text), department (text), and salary (integer) columns.

And we want the top 3 rows to return the top 3 salary where department is development and the rest of the rows to return the name ordered alphabetically.

In this case, we need two completely different logics to order. A SQL query should be like this:

SELECT *
FROM <table_name>
ORDER BY
    CASE WHEN <is_top_3_row>
        THEN <top_3_rows_order_rule>
        ELSE <rest_of_the_rows_order_rule>
    END

At first, you may think this is just writing the two logic like the above sample, but we have a problem here: how to identify the 3 first rows?

What I mean is that we don’t know the row number, so we don’t have a way to figure out this: is_top_3_row.

My first attempt was to use the UNION operator so, let’s try it out.

Using union

You may think that using UNION or UNION ALL may be the best solution. Because you can write two queries:

  1. Create a query for the top 3 rows.
  2. Create another query for the rest of the rows.
  3. Combine both using UNION ALL.

Example:

-- Query for the Top 3 rows
SELECT *
FROM <table_name>
ORDER BY <top_3_rows_order_rule> 
LIMIT 3

UNION ALL

-- Query for the rest of the rows
SELECT *
FROM <table_name>
ORDER BY <rest_of_the_rows_order_rule>

But unfortunately, this will not solve the issue.

I can mention at least two problems here.

Duplicated rows problem

The second query is not ignoring the rows from the first SELECT, and this will lead to duplicated rows.
In other words, the second query (rest_of_the_rows_order_rule) is not ignoring the rows from the first query (top_3_rows_order_rule).

To fix that, you would have to remove the result from the first query in the second query, like:

-- Query for the Top 3 rows
SELECT *
FROM <table_name>
ORDER BY <top_3_rows_order_rule>
LIMIT 3

UNION ALL

-- Query for the rest of the rows
SELECT *
FROM <table_name>
WHERE id NOT IN (
    -- The first query
    SELECT id
    FROM <table_name>
    ORDER BY <top_3_rows_order_rule>
    LIMIT 3
)
ORDER BY <rest_of_the_rows_order_rule>

The problems are:

Ordering rows problem

Another problem is that the UNION or UNION ALL does not ensure that the order will be the same.

In other words, if the first SELECT returns:

  • row 1
  • row 2
  • row 3

And the second SELECT returns:

  • row 4
  • row 5

There is no guarantee that the result will be:

  • row 1
  • row 2
  • row 3
  • row 4
  • row 5

To fix that, you need to add an extra "order" column inside each query. Then, wrap the two queries inside another one and use this "order" column to sort again!

You can check this Stack Overflow answer for more info: https://stackoverflow.com/a/21181766/4538055

The downsides are:

  • We need to add an extra column to sort.
  • Even after ordering each sub-query, we need to order everything again.

Using ROW_NUMBER()

The ROW_NUMBER() window function allows us to create a new column with the number of each row.

Inside this window function, you must provide a ORDER BY, and it will return the row number starting with 1.

You can check more about ROW_NUMBER() here: SQL Server ROW_NUMBER Function

If you want to know more about window function, check this article.

Now, going back to our problem. We can use this window function to solve it.

Let me share the SQL query, and explain each part of it:

-- Query A
SELECT * FROM (
    -- Query B
    SELECT *
        ROW_NUMBER() OVER (
            -- Query B - Order By
            ORDER BY <top_3_rows_order_rule>
        ) AS top_3_order_row_number
    FROM <table_name>
) temp_table
-- Query A - Order By
ORDER BY
    CASE WHEN top_3_order_row_number <= 3 THEN top_3_order_row_number ELSE 4 END,
    <rest_of_the_rows_order_rule>;

Inside of ROW_NUMBER(), we are ordering by top_3_rows_order_rule ( Query B - Order By). So, the top_3_order_row_number column with the values 1, 2 and 3 will have the top 3 rows we want to show first.

Then, we add a SELECT (Query A) around the original SELECT (Query B), because this way we can use top_3_order_row_number column name. In other words, the column top_3_order_row_number is not accessible inside Query B.

The last ORDER BY (Query A - Order By) is where the magic happens:

-- Query A - Order By
ORDER BY
    CASE WHEN top_3_order_row_number <= 3 THEN top_3_order_row_number ELSE 4 END,
    <rest_of_the_rows_order_rule>;

Since we already know that the rows with top_3_order_row_number equals 1, 2, and 3 are the top rows, we just return the top_3_order_row_number if it’s bellow or equals 3, otherwise, we will return 4.

This will pass the values 1, 2, 3 and 4 for ORDER BY, that will order it.

So, if top_3_order_row_number is higher than 3, this rule will be "ignored", since all will have the same value (4):

CASE WHEN top_3_order_row_number <= 3 THEN top_3_order_row_number ELSE 4 END

Then, the rest of the rows will be ordered by another rule:

<rest_of_the_rows_order_rule>;

And that will solve all of our problems! 🚀

Now, let me show an example so you can understand it better.

Test database

I created a simple table so you can see better what is going on and even test locally.

Here is the gist with the SQL to create and populate the database: Gist

This script will add a employments table with the following columns:

  • id -> Auto-increment (integer).
  • name -> The employee name (text).
  • department -> The employee’s department (text).
  • salary -> The employee’s salary (integer).

Using multiple sorts

Let’s say we want to show first the top 3 development salaries and the higher salaries regardless of which department is in.

Using ROW_NUMBER() we can create this SQL Query:

SELECT *
FROM (
    SELECT *,
        ROW_NUMBER() OVER (
            ORDER BY
                CASE department
                WHEN 'development' THEN 1
                ELSE 2
            END,
            salary DESC
        ) AS top_dev_salaries_row_number
    FROM employments
) employments
ORDER BY
    CASE WHEN top_dev_salaries_row_number <= 3 THEN top_dev_salaries_row_number ELSE 4 END,
    salary DESC;

Again, let’s check each part of this query!

We will put the top 3 ORDER BY logic inside ROW_NUMBER():

ROW_NUMBER() OVER (
    ORDER BY
        CASE department
        WHEN 'development' THEN 1
        ELSE 2
    END,
    salary DESC
) AS top_dev_salaries_row_number

First, we are checking the department. If the department is development we are going to return 1, otherwise, 2 so, the ORDER BY will put first the rows where department = 'development'.

This is the same thing we’ve done with CASE WHEN top_dev_salaries_row_number <= 3 THEN top_dev_salaries_row_number ELSE 4 END.

After sorting first the development, we also need to order by salary. Because of that, we added salary DESC.

Then, we have the last ORDER BY:

ORDER BY
    CASE WHEN top_dev_salaries_row_number <= 3 THEN top_dev_salaries_row_number ELSE 4 END,
    salary DESC;

Again, the top_dev_salaries_row_number will have the top 3 order so, we just want to preserve that order. Otherwise, it returns 4 for the rest of the rows.

Then, all the rest of the rows will have the same weight, and the next ORDER BY clause will affect the order: salary DESC.

In other words, the rest of the rows will be ordered by salary DESC.

Avoiding the SELECT wrapper

If you don't want to create an extra SELECT to wrap your original query only to have access to row_number column, I have two news:

The good news is that there is a way to prevent it.
The bad news is that it's not so pretty 😅.

You just need to replace all row_number inside ORDER BY by the actual ROW_NUMBER() logic.

In order words, copy the whole ROW_NUMBER() logic and paste it in ORDER BY:

SELECT *,
    ROW_NUMBER() OVER (
        ORDER BY
            CASE department WHEN 'development' THEN 1 ELSE 2 END,
            salary DESC
        ) AS top_dev_salaries_row_number
FROM employments
ORDER BY
    CASE WHEN
        ROW_NUMBER() OVER (
            ORDER BY
                CASE department WHEN 'development' THEN 1 ELSE 2 END,
                salary DESC
        ) <= 3
    THEN ROW_NUMBER() OVER (
            ORDER BY
                CASE department WHEN 'development' THEN 1 ELSE 2 END,
                salary DESC
            )
    ELSE 4 END,
    salary DESC;

As you can see, the query is a little messy, but it’s good to know you can do that.

Adding more sorting rules

Here I'm using two different sorting rules, but you can use as many as you want to.

You only need to add more ROW_NUMBER() and using them in your final ORDER BY, like:

SELECT * FROM (
    SELECT *
        ROW_NUMBER() OVER (
            ORDER BY <rule_1>
        ) AS row_number_rule_1,
        ROW_NUMBER() OVER (
            ORDER BY <rule_2>
        ) AS row_number_rule_2,
        ROW_NUMBER() OVER (
            ORDER BY <rule_3>
        ) AS row_number_rule_3
    FROM <table_name>
) temp_table
ORDER BY
    CASE WHEN row_number_rule_1 <= 3 THEN row_number ELSE 4 END, -- Rows 1-3 using rule 1
    CASE WHEN row_number_rule_2 <= 2 THEN row_number ELSE 3 END, -- Rows 4-5 using rule 2
    CASE WHEN row_number_rule_3 <= 5 THEN row_number ELSE 6 END, -- Rows 6-8 using rule 3
    <rule_4>; -- Rows 9~ using rule 3

Performance

When I was creating this query in my client against a huge table, I didn’t see any performance issues.

My change was adding the ROW_NUMBER() and the last ORDER BY in an existing query, and the time to execute the query didn’t increase too much.

I also ran EXPLAIN ANALYZE and the cost didn’t increase much.

But if you are adding this logic to a vital part of your application or a huge table, I recommend you check it.

Using our database

Here I will run some queries with EXPLAIN ANALYZE. But keep in mind that this is a really small database, and it didn’t have an "original query" like in my client repository.

What I mean by "original query" is the current query we were using, with a bunch of INNER/LEFT JOINs, WHERE‘s, and custom SELECT‘s columns.

In other words, in my client, I had to add more rules in an existing complexity query, and here we don’t have that. It’s just a simple database to help us to understand the problem.

So, I will share here some queries with EXPLAIN ANALYZE just to have a clue how PostgreSQL will handle it that.

Simple SELECT

EXPLAIN ANALYZE SELECT *
FROM employments;

Simple SELECT + ORDER BY

EXPLAIN ANALYZE SELECT *
FROM employments
ORDER BY salary DESC;

Simple SELECT + ROWS_NUMBER

EXPLAIN ANALYZE SELECT *,
    ROW_NUMBER() OVER (
        ORDER BY
            CASE department
            WHEN 'development' THEN 1
            ELSE 2
        END,
        salary DESC
    ) AS top_dev_salaries_row_number
FROM employments;

Final Query

EXPLAIN ANALYZE SELECT *
FROM (
    SELECT *,
        ROW_NUMBER() OVER (
            ORDER BY
                CASE department
                WHEN 'development' THEN 1
                ELSE 2
            END,
            salary DESC
        ) AS top_dev_salaries_row_number
    FROM employments
) employments
ORDER BY
    CASE WHEN top_dev_salaries_row_number <= 3 THEN top_dev_salaries_row_number ELSE 4 END,
    salary DESC;

As you can see, the "Planning Time" and "Execution Time" increases when we add more logic (obviously), but it didn’t increase drastically our performance.

Again, you must check that in your database with your query to see if this logic increases or not the query’s performance.

Conclusion

I really liked how this order works. In the beginning, I thought this task would be really hard but doing some deep research and mainly, talking with my coworkers help me a lot.

If you have any thoughts or better solutions, let us know 😉!

See you in the next post! 🚀

We want to work with you. Check out our "What We Do" section!