Beyond creating indexes – Part 1A – query-optimization.com

When we talk about optimizing a bad query, the fist solution that comes to mind is creating indexes. In an act of desperation, we create indexes for every single column in the query hoping to hit that magic key that will reduce execution time by 100X.

Indexes are a great mechanism to speed up queries, but there are many scenarios where they just won’t help. For this post series, I wanted to focus on those cases and explore what alternatives or workarounds we can implement.

Today is the turn of Low cardinality columns

What is cardinality and why are we talking about it?

Cardinality (a set theory term) refers to the number of different values a given column has. Let’s say we have a real state database with a table for properties where we store on which part of the country they are in: North , South, West, East, North-East, etc. The number of different values for that column will be 8 at its best and it will never go beyond that maximum.

The maximum number of different values for the country_region column is 8:
N, W, E, S, NE, SE, NW and SW

Because of how standard (B-Tree) indexes work, the higher the number of different values for a column, the more efficient using the index is. This is because indexes “index” unique values and they are very good at locating them. However, if that value appears on 10.000 rows, the index will give you back a list of references to those 10.000 rows.

Going back to our properties table example, let’s say we have the following distribution of regions:

N: 11.000 properties
NE: 20.000 properties
SE: 6000 properties
S: 8000 properties
E: 17.000 properties

If I write a query like this:

SELECT property_id FROM properties WHERE country_region = 'E'

it is very likely that the server will decide to use the index, but it will give me back a list of 17 thousand rows!. Also, if the remaining columns involved in the query are not in the index, the database server needs to visit each row in the list returned by the index to retrieve the additional values.

To make it more interesting, if the server notices that a given value accounts for a high percentage of the table (it used to be 30%, but now it is actually dynamic) it will avoid the index altogether…

This sometimes confuses people as they see the index is being used in the execution plan, although they don’t understand why the query is still slow.

How to detect low cardinality columns?

If we have indexes created for the column, you can just check the SHOW INDEXES IN <table_name> output as follows:

mysql> SHOW INDEXES IN airplane\G
*************************** 1. row ***************************
        Table: airplane
   Non_unique: 0
     Key_name: PRIMARY
 Seq_in_index: 1
  Column_name: airplane_id
    Collation: A
  Cardinality: 5583
     Sub_part: NULL
       Packed: NULL
         Null: 
   Index_type: BTREE
      Comment: 
Index_comment: 
*************************** 2. row ***************************
        Table: airplane
   Non_unique: 1
     Key_name: type_id
 Seq_in_index: 1
  Column_name: type_id
    Collation: A
  Cardinality: 13
     Sub_part: NULL
       Packed: NULL
         Null: 
   Index_type: BTREE
      Comment: 
Index_comment: 
2 rows in set (0.00 sec)

You will get a row for each column included in an index. The above output is a simple one, as only two indexes exist (PRIMARY and type_id) and they have one column each. We will look at a more complex output in a minute.

The cardinality for the PRIMARY index (or Key as MySQL calls them) will give you the total number of rows for the table. If there are no PRIMARY or UNIQUE indexes to use as reference, you can run SELECT count(*) FROM <table> to obtain that value.

Then we have cardinality for the type_id column, which is 13. That means that only 13 different type IDs exist for the 5583 rows in the table. Keep in mind these values are estimations and not exact numbers. They are based on the statistics computed by the server and are used to decide which execution plan is better, but we will leave that for another time.

Reading cardinality for multi-column indexes

Let’s see what happens when the index has more than one column (I’ve removed a few trivial columns from the output for readability)

mysql> SHOW INDEXES IN weatherdata;
+-------------+------------+----------+--------------+-------------+-------------+
| Table       | Non_unique | Key_name | Seq_in_index | Column_name | Cardinality |
+-------------+------------+----------+--------------+-------------+-------------+
| weatherdata |          0 | PRIMARY  |            1 | log_date    |        4017 |
| weatherdata |          0 | PRIMARY  |            2 | time        |     1191465 |
| weatherdata |          0 | PRIMARY  |            3 | station     |     4764415 |
+-------------+------------+----------+--------------+-------------+-------------+

What the server reports is the aggregated cardinality for that column and the columns above: this means that for log_date, there are only 4017 different values, although when combined with time, you get 1191465 different values. Following the same logic, columns log_date, time and station combined produce 4764415 different values.

When no indexes exist for the column

In this case, we can use a few queries to obtain cardinality info although keep in mind they will be resource-intensive for large tables.

To obtain the amount of different values you can use DISTINCT

mysql> SELECT DISTINCT (type_ID) FROM airplane;
+---------+
| type_ID |
+---------+
|       6 |
|      18 |
|      21 |
|      38 |
|      40 |
|      41 |
|      48 |
|      60 |
|      75 |
|     228 |
|     232 |
|     301 |
|     316 |
+---------+
13 rows in set (0.01 sec)

Furthermore, if you want to see how values are distributed, you can use a GROUP BY

mysql> SELECT type_id,count(*) FROM airplane GROUP BY type_id;
+---------+----------+
| type_id | count(*) |
+---------+----------+
|       6 |      466 |
|      18 |      450 |
|      21 |      448 |
|      38 |      430 |
|      40 |      436 |
|      41 |      443 |
|      48 |      400 |
|      60 |      439 |
|      75 |      438 |
|     228 |      443 |
|     232 |      409 |
|     301 |      414 |
|     316 |      367 |
+---------+----------+
13 rows in set (0.00 sec)

Applying what we learned above, if we write a query requesting rows where type_id = 18, the index will be used but it will return 450 rows, which accounts for 8% of the table. The lower the percentage, the more efficient using the index was.

Wrapping up

Uff.. this topic took longer from what I initially expected. I’ll publish part B soon with the idea of keeping the post size under control. A few takeaways:

Cardinality tells you the amount of different values that exist for a given column
Creating indexes on low-cardinality columns may not improve performance greatly. The server might even decided to ignore the index if the value you are asking for appears on a large percentage of the table
If you create an index on multiple columns, combined cardinality should be considered
You can learn about columns cardinality using SHOW INDEXES IN (if indexes already exists for the target column) or running count(*), distinct and GROUP BY as described earlier

Beyond creating indexes – Part 1A

What is cardinality and why are we talking about it?

How to detect low cardinality columns?

Reading cardinality for multi-column indexes

When no indexes exist for the column

Wrapping up

By the way, we built a tool that helps you detecting performance problems due to low cardinality columns among many others.. Try for free!

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from query-optimization.com