Beyond creating indexes – Part 1B

Beyond creating indexes in mysql and mariadb

On our previous post, we were discussing what were low cardinality columns, why indexing them is not very efficient and how to detect them. Let’s now focus of what can we do about it.

Histograms: a game changer?

In Part A of this post, we said that, with low cardinality columns, values are distributed across a small set of buckets. We used cardinal points as an example but this is also very common in columns used for status, gender, types among others.

We said that an index can still help in these cases and that index efficiency will depend on the how the values are distributed. If I have a status column with 3 different values (let’s say A,B and C), and most rows are in status C, using the index won’t be efficient for the server, as the index will return a list of pointers to almost all table rows. If, on the other hand, we want to retrieve all rows with status A or B, then using the index will make sense to the server.

Furthermore, even if no indexes exist for the low cardinality column, the server could make better decisions (i. e. join order) if it knows beforehand what percentage of the total rows match a given value.

So how do we tell the server that almost all values for status are equal to C? That’s what histograms were introduced for: giving more information to the optimizer about how data is distributed for a column or group of columns so the server can generate more efficient execution plans.

As opposed to MariaDB, MySQL and Percona server allow you to specify the amount of buckets you want to compute. This is useful when you know beforehand how many different values exist for your low cardinality column.

Histograms can also be used to determine the distribution of values for other type of columns like dates. In that case, you can omit the number of buckets and let the server to calculate them based on data range.

As mentioned above, the syntax is slightly different for MariaDB the BUCKETS clause us not supported.

If values are evenly distributed or if you need to query data for the values that appear the most, let’s see what else can be done.

When server-level changes are not enough, we need to look at our query and get creative. To speed up a query we should somehow reduce the amount of work required by the server to generate the result. That involves working on two main factors: number of rows and amount of operations (scans, joins, sorts, etc).

Below is a set of checks you can do to detect hidden opportunities:

If it is, we could try to leverage contiguous columns to improve cardinality and obtain less rows from the index. Let’s see how that would work using the table from our previous post:

Above index statistics are telling us that our table has around 4764415 rows. We know this because it is the PRIMARY index and the total cardinality roughly matches the total number or rows (roughly because these statistics are estimations).

The following query will probably use the index and return around 1186 rows (4764415÷4017) on average:

If I add the time column to the query, cardinality will improve around 296X and the query will return 4 rows (4764415÷1191465) on average.

Ok, I know what you are thinking: specifying a single date will probably defeat the purpose of the query. True, but even if we specify a range of dates, our cardinality will greatly improve.

Check other existing indexes: do you see any with better cardinality that the one for our problematic column? Any chances we can include a condition using that indexed column?

When you have two conditions on indexed columns, the server optimizer will use columns cardinality to decide what index to use for that table. Adding a condition on a better-cardinality column or combination of columns will make the server to use that index first, to then filter out any rows not matching the condition for the lower cardinality column.

This strategy works when the table with the low cardinality column is then joined with another table or tables. The concept is the following: the less rows are retrieved from the first table in the join, they less iterations are performed on the second table, and so on.

Adding conditions on non-indexed columns won’t help with table access for that table but it will prevent the problem from spreading to the join operations.

If we really can’t add more conditions or leverage columns of existing multi column indexes, there is one more option to explore: retrieving the result in chunks.

There are cases where the result set is long and it is for a human being to consume, so there is really no point in sending a 10k rows result all at once. For this we can either leverage the LIMIT clause or paginate the result.

LIMIT will cause the query to exit early after the number of requested results is reached. It is not perfect as certain queries (specially those including ORDER BY, GROUP BY, subqueries, etc) cannot exit before processing most for the data. You can use EXPLAIN statement in tree format to confirm if LIMIT will work or not.

There is pretty detailed post about this subject by Planetscale, and it covers exactly what I wanted to say about this.

If you have any questions, comments or objections about this post, please don’t hesitate to leave a comment or reach out. I’ll be happy to clarify or add more information.

Leave a Reply

Discover more from query-optimization.com

Subscribe now to keep reading and get access to the full archive.

Continue reading