
On our previous post, we were discussing what were low cardinality columns, why indexing them is not very efficient and how to detect them. Let’s now focus of what can we do about it.
Histograms: a game changer?
In Part A of this post, we said that, with low cardinality columns, values are distributed across a small set of buckets. We used cardinal points as an example but this is also very common in columns used for status, gender, types among others.
We said that an index can still help in these cases and that index efficiency will depend on the how the values are distributed. If I have a status column with 3 different values (let’s say A,B and C), and most rows are in status C, using the index won’t be efficient for the server, as the index will return a list of pointers to almost all table rows. If, on the other hand, we want to retrieve all rows with status A or B, then using the index will make sense to the server.
Furthermore, even if no indexes exist for the low cardinality column, the server could make better decisions (i. e. join order) if it knows beforehand what percentage of the total rows match a given value.
So how do we tell the server that almost all values for status are equal to C? That’s what histograms were introduced for: giving more information to the optimizer about how data is distributed for a column or group of columns so the server can generate more efficient execution plans.
Creating histograms in MySQL / Percona Server
As opposed to MariaDB, MySQL and Percona server allow you to specify the amount of buckets you want to compute. This is useful when you know beforehand how many different values exist for your low cardinality column.
Histograms can also be used to determine the distribution of values for other type of columns like dates. In that case, you can omit the number of buckets and let the server to calculate them based on data range.
ANALYZE TABLE tbl_name UPDATE HISTOGRAM ON col_name [, col_name] WITH N BUCKETS;
Creating histograms in MariaDB
As mentioned above, the syntax is slightly different for MariaDB the BUCKETS clause us not supported.
ANALYZE TABLE tbl_name PERSISTENT FOR COLUMNS (col1,col2,...) INDEXES (idx1,idx2,...);
If values are evenly distributed or if you need to query data for the values that appear the most, let’s see what else can be done.
We need more filters
When server-level changes are not enough, we need to look at our query and get creative. To speed up a query we should somehow reduce the amount of work required by the server to generate the result. That involves working on two main factors: number of rows and amount of operations (scans, joins, sorts, etc).
Below is a set of checks you can do to detect hidden opportunities:
Is the low cardinality column part of an existing multi-column index?
If it is, we could try to leverage contiguous columns to improve cardinality and obtain less rows from the index. Let’s see how that would work using the table from our previous post:
mysql> SHOW INDEXES IN weatherdata;
+-------------+------------+----------+--------------+-------------+-------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Cardinality |
+-------------+------------+----------+--------------+-------------+-------------+
| weatherdata | 0 | PRIMARY | 1 | log_date | 4017 |
| weatherdata | 0 | PRIMARY | 2 | time | 1191465 |
| weatherdata | 0 | PRIMARY | 3 | station | 4764415 |
+-------------+------------+----------+--------------+-------------+-------------+
Above index statistics are telling us that our table has around 4764415 rows. We know this because it is the PRIMARY index and the total cardinality roughly matches the total number or rows (roughly because these statistics are estimations).
The following query will probably use the index and return around 1186 rows (4764415÷4017) on average:
SELECT * FROM weatherdata WHERE log_date = '2023-02-01'
If I add the time column to the query, cardinality will improve around 296X and the query will return 4 rows (4764415÷1191465) on average.
SELECT * FROM weatherdata WHERE log_date = '2023-02-01' AND time = '10:00'
Ok, I know what you are thinking: specifying a single date will probably defeat the purpose of the query. True, but even if we specify a range of dates, our cardinality will greatly improve.
SELECT * FROM weatherdata WHERE log_date = '2023-02-01' AND time BETWEEN '10:00' AND '22:00'
Is it possible to Include other conditions based on indexed columns with better cardinality?
Check other existing indexes: do you see any with better cardinality that the one for our problematic column? Any chances we can include a condition using that indexed column?
When you have two conditions on indexed columns, the server optimizer will use columns cardinality to decide what index to use for that table. Adding a condition on a better-cardinality column or combination of columns will make the server to use that index first, to then filter out any rows not matching the condition for the lower cardinality column.
Is it possible to Include other conditions based on NON-indexed columns with better cardinality?
This strategy works when the table with the low cardinality column is then joined with another table or tables. The concept is the following: the less rows are retrieved from the first table in the join, they less iterations are performed on the second table, and so on.
Adding conditions on non-indexed columns won’t help with table access for that table but it will prevent the problem from spreading to the join operations.
Last resort: split the result
If we really can’t add more conditions or leverage columns of existing multi column indexes, there is one more option to explore: retrieving the result in chunks.
There are cases where the result set is long and it is for a human being to consume, so there is really no point in sending a 10k rows result all at once. For this we can either leverage the LIMIT clause or paginate the result.
LIMIT will cause the query to exit early after the number of requested results is reached. It is not perfect as certain queries (specially those including ORDER BY, GROUP BY, subqueries, etc) cannot exit before processing most for the data. You can use EXPLAIN statement in tree format to confirm if LIMIT will work or not.
There is pretty detailed post about this subject by Planetscale, and it covers exactly what I wanted to say about this.
If you have any questions, comments or objections about this post, please don’t hesitate to leave a comment or reach out. I’ll be happy to clarify or add more information.

Leave a Reply