Beyond creating indexes – Part 1A

Sun Over Earth (NASA, International Space Station Science, 11:22:09)

Cardinality (a set theory term) refers to the number of different values a given column has. Let’s say we have a real state database with a table for properties where we store on which part of the country they are in: North , South, West, East, North-East, etc. The number of different values for that column will be 8 at its best and it will never go beyond that maximum.

Properties table

Because of how standard (B-Tree) indexes work, the higher the number of different values for a column, the more efficient using the index is. This is because indexes “index” unique values and they are very good at locating them. However, if that value appears on 10.000 rows, the index will give you back a list of references to those 10.000 rows.

Going back to our properties table example, let’s say we have the following distribution of regions:

  • N: 11.000 properties
  • NE: 20.000 properties
  • SE: 6000 properties
  • S: 8000 properties
  • E: 17.000 properties

If I write a query like this:

it is very likely that the server will decide to use the index, but it will give me back a list of 17 thousand rows!. Also, if the remaining columns involved in the query are not in the index, the database server needs to visit each row in the list returned by the index to retrieve the additional values.

To make it more interesting, if the server notices that a given value accounts for a high percentage of the table (it used to be 30%, but now it is actually dynamic) it will avoid the index altogether

This sometimes confuses people as they see the index is being used in the execution plan, although they don’t understand why the query is still slow.

How to detect low cardinality columns?

If we have indexes created for the column, you can just check the SHOW INDEXES IN <table_name> output as follows:

You will get a row for each column included in an index. The above output is a simple one, as only two indexes exist (PRIMARY and type_id) and they have one column each. We will look at a more complex output in a minute.

The cardinality for the PRIMARY index (or Key as MySQL calls them) will give you the total number of rows for the table. If there are no PRIMARY or UNIQUE indexes to use as reference, you can run SELECT count(*) FROM <table> to obtain that value.

Then we have cardinality for the type_id column, which is 13. That means that only 13 different type IDs exist for the 5583 rows in the table. Keep in mind these values are estimations and not exact numbers. They are based on the statistics computed by the server and are used to decide which execution plan is better, but we will leave that for another time.

Let’s see what happens when the index has more than one column (I’ve removed a few trivial columns from the output for readability)

What the server reports is the aggregated cardinality for that column and the columns above: this means that for log_date, there are only 4017 different values, although when combined with time, you get 1191465 different values. Following the same logic, columns log_date, time and station combined produce 4764415 different values.

In this case, we can use a few queries to obtain cardinality info although keep in mind they will be resource-intensive for large tables.

To obtain the amount of different values you can use DISTINCT

Furthermore, if you want to see how values are distributed, you can use a GROUP BY

Applying what we learned above, if we write a query requesting rows where type_id = 18, the index will be used but it will return 450 rows, which accounts for 8% of the table. The lower the percentage, the more efficient using the index was.

Wrapping up

Uff.. this topic took longer from what I initially expected. I’ll publish part B soon with the idea of keeping the post size under control. A few takeaways:

  • Cardinality tells you the amount of different values that exist for a given column
  • Creating indexes on low-cardinality columns may not improve performance greatly. The server might even decided to ignore the index if the value you are asking for appears on a large percentage of the table
  • If you create an index on multiple columns, combined cardinality should be considered
  • You can learn about columns cardinality using SHOW INDEXES IN (if indexes already exists for the target column) or running count(*), distinct and GROUP BY as described earlier

Leave a Reply

Discover more from query-optimization.com

Subscribe now to keep reading and get access to the full archive.

Continue reading