CockroachDB Range Splitting Using ntile()

This blog shows how to use value-based splitting to manually split tables into ranges with approxiimately-equal sizes when the data in a CockroachDB table is unevenly distributed. This is an advanced technique, because CockroachDB automatically manages ranges, so you usually don’t need to take manual control. Further, even when you want to take control, there are other approaches that often work better. However, value-based manual splitting has its place, and the goal of this post is to show one technique to answer the question, “What values should I split the table at?”

Along the way we will look at:

  • CockroachDB’s mission and core features
  • How ranges support those features
  • How ranges are automatically split and combined
  • Options to take manual control of range splitting
  • Value-based splitting as one option
  • Identifying the value-based range split points using SQL

Mission and Core Features

CockroachDB’s mission is “To simplify how businesses build and operate world-changing applications“.

To support this mission, CockroachDB provides an interlocking set of capabilities:

  • High availability (seamlessly surviving a configurable level of loss with zero RPO and near-zero RTO),
  • Horizontal scalability (both reads and writes, both in-region and across regions),
  • Multi-region support (providing an elegant global single logical database, while respecting regulatory requirements for where data is stored),
  • Always-consistent fully-ACID transactions, across both AZs and regions (eliminating obscure application errors caused by databases that use “eventual consistency”),
  • SQL query language, for application simplicity, expressive power, and to leverage ecosystem tools and developer skillsets,
  • Built-in change-data-capture, to support modern event-driven application architectures using an efficient push model instead of polling,
  • Low-friction deployment with low TCO, on-prem and in the cloud (including multi-cloud!), both self-hosted and Cockroach Labs-managed solutions, with SOC 2 Type II and PCI DSS compliance available.

These capabilities enable CockroachDB to be the trusted global database to run mission-critical applications for the world’s most important businesses.

Automatic Sharding

One part of the CockroachDB architecture is so fundamental that it touches on most of the above capabilities. Under the hood, CockroachDB automatically shards its data. That is, CockroachDB automatically breaks tables into into units called “ranges”. (Further, CockroachDB makes copies of these ranges, called “replicas”, and distributes them among the cluster nodes for both high availability and performance.)

I want to emphasize that CockroachDB does this automatically!

  • You don’t have to manually choose a shard key ahead of time.
  • You don’t have to manually specify a sharding function.
  • You don’t have to manually re-shard when adding or removing nodes.
  • You don’t have to manually direct queries to the nodes where the shards reside.

As you can imagine, CockroachDB constantly manages how ranges are created. It regularly splits large ranges into smaller ranges and combines smaller ranges into larger ranges.

CockroachDB balances a number of goals as it decides to split or combine ranges:

  • Ranges need to be large enough to support efficient computation and minimize the amount of internal metadata that CockroachDB uses to track ranges.
  • Ranges need to be small enough to be efficiently moved between nodes, in response to query load changes or infrastructure failures.
  • Ranges should have balanced query access, to evenly distribute the computation among nodes and fully leverage the computational power of CockroachDB’s multi-node architecture. To do this, CockroachDB can split a range into two ranges that are not equal in size, but have relatively equal query load.

Knobs and Buttons

While CockroachDB does this automatically, it provides manual controls that enable you to influence how CockroachDB splits, replicates, and distributes ranges. Regarding splitting, you can specify things like:

  • Minimum and maximum range size
  • Query load threshold for a range, that makes a range eligible for load-based splitting (you can specify this load in terms of either queries-per-second or CPU used)
  • Where to split tables into ranges, based on the values of keys in the table

These controls help you increase query performance and reduce “hot spots“. A hot spot is one or more ranges that have higher load than other ranges. It can lead to uneven CPU levels among the nodes in the cluster.

Value-based Splitting

We’re going to dig deeper into manually specifying the split points for a table. I want to emphasize that this is an advanced technique, because, in the majority of situations, other techniques will do a better job and will require less ongoing maintenance. For example, making a table’s ranges smaller is usually a better technique for distributing load and reducing hot spots. That approach lets CockroachDB pick the split points for a table.

However, sometimes it makes sense to pick the split points yourself. This can be useful when you have insight into the distribution of the values in the key space for a table. This might be especially the case when a table is small, and you want to pre-split the table into multiple ranges. Later, as the table grows, CockroachDB will not have to repeatedly split ranges to create new ranges.

The CockroachDB docs describe value-based range splitting, including

However, they don’t address the question of “What values should I split the table at?”

One approach is to split the table at values that are equally-spaced in the key space for the table. For example, if a key column has values from 1 to 1000000, you could define 4 ranges that are equally-spaced in key values:

  • 1 to 250000
  • 250001 to 500000
  • 500001 to 750000
  • 750001 to 1000000

Note these range boundaries are equally-sized in the keyspace. The ranges may still be very different in size from each other, if the distribution of values in that keyspace is uneven. That would be the case if, e.g., the number of rows with key values from 1 to 250000 was much higher than the number of rows with key values from 250001 to 1000000. (This assumes all rows are roughly equal to each other in size.)

Using ntile()

A solution to this issue is to let the data itself tell you what the split points should be. The ntile() aggregation function in SQL is useful for this, because it calculates how to break unevenly-distributed values into equally-sized groups.

You get to specify now many groups:

  • four groups is commonly called “quartiles”
  • ten groups is commonly called “deciles”
  • and the familiar statistical value of the “median” is simply the split point for when you have just two groups.

Of course, this approach assumes you have enough data to calculate the split points using ntile(). This may be difficult if you are starting out with an extremely small table. But, assuming that is not an issue, here is an example of how to do it:

Example

Note: The following example uses CockroachDB’s built-in command-line SQL client, which you run with the cockroach sql command.

First, we need to create a sample dataset. It needs to have an uneven distribution of values in the keyspace to demonstrate how ntile() creates roughly-equal splits.

Let’s start with the table definition:

CREATE TABLE t (
   part INTEGER NOT NULL,
   id UUID NOT NULL DEFAULT gen_random_uuid(),
   val CHAR(200),
   CONSTRAINT "pk" PRIMARY KEY (part, id));

This table has three columns:

  • part – a part number
  • id – a random UUID
  • val – a payload

The part and id columns together are the primary key. The purpose of part is to give us a value that we will cause to be unevenly distributed. We need the id column to be in the primary key to allow duplicate values for the part column. (Yes, this is a contrived example.) The purpose of the val column is to simply occupy space, so, later, we can see how this space is distributed.

Now let’s populate this table. A simple approach to creating unevenly-distributed values is to have ascending values of part, starting at 1. If we had just one row for each value of part then the values would be equally distributed in the keyspace, but that is not what we want. Instead, for each value of part, we will create that many rows. That is:

  • 1 row with part=1
  • 2 rows with part=2
  • 3 rows with part=3
  • etc.
INSERT INTO t (part, val) (
   WITH rowcounts AS (
      SELECT 
         generate_series(1,1000) AS n
   ), 
   rows AS (
      SELECT rowcounts.n, generate_series(1, rowcounts.n) AS gen 
      FROM rowcounts
   ) 
   SELECT rows.n AS part, repeat((rows.n % 10)::char, 200) AS val 
   FROM rows 
   ORDER BY rows.n
);

So this will insert 1000 different values of part into our test table, but with different numbers of rows for each value of part. Let’s look at the table and verify this:

> SELECT count(*) FROM t;
  count
----------
  500500
(1 row)

So we have well over 1000 rows! Let’s look at the first few rows. (Note I’m using the \x command to the cockroach sql client to change the display format. This is a useful tip. It toggles between the table display format and the records display format. I won’t show the \x command after this time, but you can tell from the output when it is in effect.)

 > \x
 > SELECT part, id, left(val, 10) || '...' || right(val, 10) AS val_shortened
-> FROM t
-> ORDER BY part
-> LIMIT 11;
-[ RECORD 1 ]
part          | 1
id            | 76a8a52e-0666-4a8a-ace6-84b232e1adc3
val_shortened | 1111111111...1111111111
-[ RECORD 2 ]
part          | 2
id            | 675928bc-42b7-4260-bd3e-e3b1eeee96ab
val_shortened | 2222222222...2222222222
-[ RECORD 3 ]
part          | 2
id            | f9c18ee7-0aef-49e7-9445-7f4e7b59694f
val_shortened | 2222222222...2222222222
-[ RECORD 4 ]
part          | 3
id            | 1d8d8bf6-b363-40aa-bf76-71b55e8b4314
val_shortened | 3333333333...3333333333
-[ RECORD 5 ]
part          | 3
id            | 3a20167b-1372-4ba2-81ab-4fd1ce9c7b0b
val_shortened | 3333333333...3333333333
-[ RECORD 6 ]
part          | 3
id            | 930a341c-3ccb-487a-bda1-28e23d89549f
val_shortened | 3333333333...3333333333
-[ RECORD 7 ]
part          | 4
id            | 4608e0e4-148b-4215-bd63-03186767493c
val_shortened | 4444444444...4444444444
-[ RECORD 8 ]
part          | 4
id            | 7ccae83a-347e-4585-9a38-e05daeecf6de
val_shortened | 4444444444...4444444444
-[ RECORD 9 ]
part          | 4
id            | 987b4a50-2fb0-47f6-b839-a852d419d00c
val_shortened | 4444444444...4444444444
-[ RECORD 10 ]
part          | 4
id            | e55baf23-44dd-4e65-ac02-22ed3277c1ad
val_shortened | 4444444444...4444444444
-[ RECORD 11 ]
part          | 5
id            | 689b53bc-934a-43ad-8cc1-faaff9655e97
val_shortened | 5555555555...5555555555

As you can see, there is one row with part=1, two rows with part=2, etc.

Let’s verify that the number of rows for each value of part is equal to that value of part:

 > SELECT part, count(part) AS the_count
-> FROM t
-> GROUP BY part
-> ORDER BY part
-> LIMIT 10;
  part | the_count
-------+------------
     1 |         1
     2 |         2
     3 |         3
     4 |         4
     5 |         5
     6 |         6
     7 |         7
     8 |         8
     9 |         9
    10 |        10
(10 rows)

So that verifies the first 10 values of part. What about the last 10 values of part?

 > SELECT part, count(part) AS the_count
-> FROM t
-> GROUP BY part
-> ORDER BY part DESC
-> LIMIT 10;
  part | the_count
-------+------------
  1000 |      1000
   999 |       999
   998 |       998
   997 |       997
   996 |       996
   995 |       995
   994 |       994
   993 |       993
   992 |       992
   991 |       991
(10 rows)

So far so good. And, to be complete, the following query verifies the correct number of rows for all values of part:

 > SELECT part, count(part)
-> FROM t
-> GROUP BY part
-> HAVING part <> count(part);
  part | count
-------+--------
(0 rows)

The above query returns no rows, as expected.

Ranges for the Example Table

We have created our test table, so now let’s see how CockroachDB created one or more ranges for this table — remember it does this automatically.

To do this we use the SHOW RANGES statement. Because the behavior of this statement changed in CockroachDB version 23.1, we will need to change a cluster setting to see the range size and other details. Let’s do that first:

 > SET CLUSTER SETTING sql.show_ranges_deprecated_behavior.enabled = 'FALSE';
SET CLUSTER SETTING

Now let’s use the SHOW RANGES statement to see the ranges for our test table:

> SELECT start_key, end_key, range_id, range_size_mb FROM [SHOW RANGES FROM TABLE t WITH DETAILS];
   start_key   |   end_key    | range_id |     range_size_mb
---------------+--------------+----------+------------------------
  …/<TableMin> | <after:/Max> |      148 | 132.27215300000000000
(1 row)

Only one row was returned, so that means that the entire test table is stored in just one range. That is because this range is about 132 MiB, which you can see from the range_size_mb column of the output above.

CockroachDB tries to keep ranges around 512 MiB by default, so it created just one range. That may be OK, but having just one range for a table can be a source of performance issues. Even though the table is relatively small, most queries against this table (more specifically, all writes and all reads of the current values in this table) will be automatically directed by CockroachDB to a single node, which is the node that is currently responsible for this range.

This can cause what is called a “hot spot”. A hot spot can cause the CPU for one node in the CockroachDB cluster to be higher than the CPU for other nodes in the cluster. It can also limit the performance of queries on this table to the computing power of a single node, instead of being able to leverage the power of multiple nodes working concurrently. CockroachDB has a sophisticated distributed query planner that tries to spread the workload around multiple nodes. But, in this case, with a single range, the query planner cannot create distributed query plans for this table.

A typical solution to hot spots is to have more ranges, so the data can be distributed among more nodes. One way to achieve this is to tell CockroachDB to split a table into smaller ranges. That way, CockroachDB will automatically split the table into more ranges, but they will be smaller. This is “size-based splitting”.

That approach works well if the table will stay small indefinitely, but if the table grows larger over time, this can result in CockroachDB creating too many small ranges. That can have a different kind of negative performance impact, because CockroachDB would have to manage more metadata to keep track of many ranges.

We’re going to use a different approach: value-based range splitting. Value-based range splitting will also create more ranges, which is good to avoid a hot spot. But, unlike size-based splitting, CockroachDB will let the ranges grow larger.

Simple Value-Based Range Splitting

Suppose we want to split this table into 4 ranges. Since the key space goes from 1 to 1000, a simple way to do this would be to split into even divisions of the keyspace. This would be to split at:

  • 250
  • 500
  • 750

We can do this with ALTER TABLE and use the VALUES specification:

 > ALTER TABLE t SPLIT AT VALUES (250), (500), (750);
       key       | pretty |    split_enforced_until
-----------------+--------+-----------------------------
  \xf67e89f6fa   | /250   | 2262-04-11 23:47:16.854776
  \xf67e89f701f4 | /500   | 2262-04-11 23:47:16.854776
  \xf67e89f702ee | /750   | 2262-04-11 23:47:16.854776
(3 rows)

The output shows that CockroachDB split the table at three places, to split it into four ranges. Let’s get more information on those ranges, using the SHOW RANGES statement:

 > SELECT start_key, end_key, range_id, range_size_mb FROM [SHOW RANGES FROM TABLE t WITH DETAILS];
   start_key   |   end_key    | range_id |     range_size_mb
---------------+--------------+----------+------------------------
  …/<TableMin> | …/1/250      |      148 | 8.1815570000000000000
  …/1/250      | …/1/500      |      164 | 24.720681000000000000
  …/1/500      | …/1/750      |      165 | 41.225520000000000000
  …/1/750      | <after:/Max> |      166 | 58.144395000000000000
(4 rows)

This shows that the ranges have very different sizes. They are:

  • 8.2 MiB
  • 24.7 MiB
  • 41.2 MiB
  • 58.1 MiB

This reflects how we created the table, with different numbers of rows for different parts of the keyspace.

But we want approximately equal-size ranges, so the simple approach we just used does not work.

A Better Approach

To create equal-size ranges, we need to pick the split points better. Fortunately, SQL provides a function to help us do just that: ntile().

Let’s again split the table into four ranges, so we will be using ntile(4). This tells us which quartile each row belongs in, based on the value of part. To see ntile(4) in action, look at these two queries:

 > SELECT
->    part,
->    ntile(4) OVER (ORDER BY part) AS my_ntile
-> FROM t
-> ORDER BY part
-> LIMIT 10;
  part | my_ntile
-------+-----------
     1 |        1
     2 |        1
     2 |        1
     3 |        1
     3 |        1
     3 |        1
     4 |        1
     4 |        1
     4 |        1
     4 |        1
(10 rows)

This tells us that the low values of part are in quartile 1, as expected.

 > SELECT
->    part,
->    ntile(4) OVER (ORDER BY part) AS my_ntile
-> FROM t
-> ORDER BY part DESC
-> LIMIT 10;
  part | my_ntile
-------+-----------
  1000 |        4
  1000 |        4
  1000 |        4
  1000 |        4
  1000 |        4
  1000 |        4
  1000 |        4
  1000 |        4
  1000 |        4
  1000 |        4
(10 rows)

This tells us that the high values of part are in quartile 4, also as expected.

That is helpful, but it does not give us the split points, which are the places where:

  • quartile 1 changes to quartile 2
  • quartile 2 changes to quartile 3
  • quartile 3 changes to quartile 4

We can use the row_number() function to order the rows within each quartile. Later, this will let us find the largest value of part in each of the quartiles 1, 2, and 3. Here are a couple examples. They use the above query in a CTE (common table expression). The first query shows the row numbers for the first 10 rows in quartile 2:

 > WITH cte_ntiles AS (
->    SELECT       
->       part,                             
->       ntile(4) OVER (ORDER BY part) AS my_ntile
->    FROM t                
-> )                                                
-> SELECT    
->    part,                              
->    my_ntile,
->    row_number() OVER (PARTITION BY my_ntile ORDER BY part) AS my_row_number
-> FROM cte_ntiles                              
-> WHERE my_ntile = 2
-> ORDER BY my_row_number        
-> LIMIT 10;
  part | my_ntile | my_row_number
-------+----------+----------------
   500 |        2 |             1
   500 |        2 |             2
   500 |        2 |             3
   500 |        2 |             4
   500 |        2 |             5
   500 |        2 |             6
   500 |        2 |             7
   500 |        2 |             8
   500 |        2 |             9
   500 |        2 |            10
(10 rows)

The next query shows the row numbers for the last 10 rows in quartile 2. It shows that there are 125125 rows in quartile 2:

 > WITH cte_ntiles AS (
->    SELECT        
->       part,                             
->       ntile(4) OVER (ORDER BY part) AS my_ntile
->    FROM t                
-> )                                                
-> SELECT    
->    part,                               
->    my_ntile,
->    row_number() OVER (PARTITION BY my_ntile ORDER BY part) AS my_row_number
-> FROM cte_ntiles                            
-> WHERE my_ntile = 2
-> ORDER BY my_row_number DESC 
-> LIMIT 10;
  part | my_ntile | my_row_number
-------+----------+----------------
   707 |        2 |        125125
   707 |        2 |        125124
   707 |        2 |        125123
   707 |        2 |        125122
   707 |        2 |        125121
   707 |        2 |        125120
   707 |        2 |        125119
   707 |        2 |        125118
   707 |        2 |        125117
   707 |        2 |        125116
(10 rows)

Note: As you may have noticed, in the interest of simplicity, we are ignoring the id column and focusing just on the part column. As an “exercise for the reader”, you can tweak these examples to take both the part and id columns into account when creating ntiles and when sorting.

Now we can build on the above code to find the values of part at the split points:

 > WITH cte_ntiles AS (
->    SELECT  
->       part,                    
->       ntile(4) OVER (ORDER BY part) AS my_ntile        
->    FROM t       
-> ),                                      
-> cte_rank_by_ntile AS (
->    SELECT                        
->       part,                                        
->       my_ntile,
->       row_number() OVER (PARTITION BY my_ntile ORDER BY part) AS my_row_number
->    FROM cte_ntiles                                        
-> )                  
-> SELECT                           
->    part,
->    my_ntile,              
->    my_row_number,                                 
->    max(my_row_number) OVER (PARTITION BY my_ntile) AS my_max
-> FROM cte_rank_by_ntile             
-> ORDER BY part, my_ntile, my_row_number                   
-> LIMIT 10;           
  part | my_ntile | my_row_number | my_max
-------+----------+---------------+---------
     1 |        1 |             1 | 125125
     2 |        1 |             2 | 125125
     2 |        1 |             3 | 125125
     3 |        1 |             4 | 125125
     3 |        1 |             5 | 125125
     3 |        1 |             6 | 125125
     4 |        1 |             7 | 125125
     4 |        1 |             8 | 125125
     4 |        1 |             9 | 125125
     4 |        1 |            10 | 125125
(10 rows)

We can now use the above query to find the split points. These are the value of part where my_row_number is the same as my_max.

 > WITH cte_ntiles AS (
->    SELECT
->       part,          
->       ntile(4) OVER (ORDER BY part) AS my_ntile
->    FROM t
-> ),                           
-> cte_rank_by_ntile AS (       
->    SELECT
->       part,                       
->       my_ntile,
->       row_number() OVER (PARTITION BY my_ntile ORDER BY part) AS my_row_number      
->    FROM cte_ntiles                           
-> ),    
-> cte_adding_max_rank AS (   
->    SELECT                                             
->       part,    
->       my_ntile,                        
->       my_row_number,
->       max(my_row_number) OVER (PARTITION BY my_ntile) AS my_max                              
->    FROM cte_rank_by_ntile                          
-> )          
-> SELECT                             
->    part                                                    
-> FROM cte_adding_max_rank
-> WHERE my_row_number = my_max AND my_ntile < 4
-> ORDER BY my_ntile;
  part
--------
   500
   707
   866
(3 rows)

We have found the split points! (Note this assumes we are OK with splitting solely on the value of part. As mentioned above, for simplicity, we are ignoring the id column. The resulting ranges will be a little less equal in size to each other, but not by much.)

We are almost ready to use the above query to tell CockroachDB to split the test table again, at the split points we computed. But first we have to remove the previously-specified split points. We can do this a couple different ways. The first way is to use the following command: ALTER TABLE t UNSPLIT ALL . However, CockroachDB will not immediately combine the ranges back into a single range. It may take its time doing so. There are ways to force CockroachDB to do this, but, for this example, it is much simpler to just drop the table, re-create it, and re-populate it. In other words, repeating what we did above:

 > DROP TABLE t;  

[...]

 > CREATE TABLE t (

[...]

 > INSERT INTO t (part, val) (

[...]

We are now back to having our test table in a single range:

 > SELECT start_key, end_key, range_id, range_size_mb FROM [SHOW RANGES FROM TABLE t WITH DETAILS];
   start_key   |   end_key    | range_id |     range_size_mb
---------------+--------------+----------+------------------------
  …/<TableMin> | <after:/Max> |      175 | 132.29213400000000000

We are finally ready to use the above query to tell CockroachDB to split the test table again, at the split points we computed. Note that, instead of listing the split points explicitly, we can provide a SELECT statement:

 > ALTER TABLE t SPLIT AT
-> WITH cte_ntiles AS (     
->    SELECT                                         
->       part,
->       ntile(4) OVER (ORDER BY part) AS my_ntile
->    FROM t                                                  
-> ),                  
-> cte_rank_by_ntile AS (                         
->    SELECT
->       part,                    
->       my_ntile,                                        
->       row_number() OVER (PARTITION BY my_ntile ORDER BY part) AS my_row_number 
->    FROM cte_ntiles                      
-> ),        
-> cte_adding_max_rank AS (   
->    SELECT                                        
->       part,
->       my_ntile,                   
->       my_row_number,                                      
->       max(my_row_number) OVER (PARTITION BY my_ntile) AS my_max
->    FROM cte_rank_by_ntile                      
-> )    
-> SELECT                      
->    part                                             
-> FROM cte_adding_max_rank
-> WHERE my_row_number = my_max AND my_ntile < 4
-> ORDER BY my_ntile;                                       
       key       | pretty |    split_enforced_until
-----------------+--------+-----------------------------
  \xf67e89f701f4 | /500   | 2262-04-11 23:47:16.854776
  \xf67e89f702c3 | /707   | 2262-04-11 23:47:16.854776
  \xf67e89f70362 | /866   | 2262-04-11 23:47:16.854776
(3 rows)

Now we can see the ranges that were created at these split points:

 > SELECT start_key, end_key, range_id, range_size_mb FROM [SHOW RANGES FROM TABLE t WITH DETAILS];
   start_key   |   end_key    | range_id |     range_size_mb
---------------+--------------+----------+------------------------
  …/<TableMin> | …/1/500      |      179 | 32.902092000000000000
  …/1/500      | …/1/707      |      180 | 32.959506000000000000
  …/1/707      | …/1/866      |      181 | 33.000005000000000000
  …/1/866      | <after:/Max> |      182 | 33.430587000000000000
(4 rows)

This shows that the ranges are now similar in size! They are:

  • 32.90 MiB
  • 32.96 MiB
  • 33.00 MiB
  • 33.43 MiB

Voilà!

If you use this technique, be sure to monitor the size of the ranges and re-split as necessary. And, when the overall size of the table gets large enough, consider removing all manual splitting and let CockroachDB handle the splitting automatically

Summary

We have seen how to use value-based splitting to split a table into approximately equal-size ranges, when the rows are not equally distributed throughout the keyspace. While manual splitting is an advanced technique that is not the best solution in many situations, it can be a useful took in your toolbox.

Leave a comment

Your email address will not be published. Required fields are marked *