Beyond creating indexes – Part 3A

Moonrise on the International Space Station

One factor that contributes to query execution time is the number of rows involved. Now, does all of the data on our hottest tables really needs to be there? Is there an escalable and secure way to move stale/invalid rows out of the way?
That’s what today’s post is all about.

The sad truth: core tables will always grow

If your business or project is doing well (and we hope it is) the core tables around which your data model revolves will be ever growing. Orders, customers, products, payments are examples of this type of entities.

Large tables end up becoming a focus of attention for database administrators: they put pressure on disk space, schema changes take longer and query performance starts to suffer. Because they are usually core tables, they will have many indexes, which exacerbates the space problem, while eroding write performance.

Parallel schema changes are only available in MySQL version >= 8.0.14 and there are still no implementations of built-in parallel queries. That gives us another two good reasons to keep our data sets sizes under control.

Extra row = Extra iteration

Back to query performance, when tables are joined in a query, there is an amplification effect that could affect execution speed.

To resolve a Join, the server will first pick one table as the “parent” or “driving” table and retrieve all rows that match the conditions specified for it. Then, it will pick the next table in the join and it will perform a lookup for each of the rows in the parent table. These “nested iterations” continue for each participating table.

This is even worst with LEFT Joins, as all rows from the left subset are kept, as opposed to INNER joins, where only matching rows matter.

Below is a simplified output of EXPLAIN that shows this effect:

We can see that the flight table is the parent in this case, and that after retrieving the rows and applying all filters, 51183 rows were kept.

Next, we see that table booking is being joined with flight (we know this because of the ref key, which points to column flight_id for this table)

The rows_examined_per_scan key for booking is telling us that 110 rows were retrieved, but that is for each of the 51183 rows in flight!

rows_produced_per_join shows the actual impact of the join: 5670049 rows were produced (roughly, 110 x 51183)

Even though indexes are probably used for lookups (if a full scan is done for every row in the parent table, we have a serious problem) execution time accumulates with each row.

Archiving/Deleting data in a escalable way

There are two strategies we can implement to archive/eliminate any data that may not need to be online, or exist at all. We will cover nibbling today and leave the second strategy for part B of this post.

The challenge with removing data from big, core, tables is that we want to keep impact to the minimum. We also want to make sure the process we use scales for tables of any sizes.

Fortunately, a tool already exists for this, it is open source and it’s been around long enough to trust it our precious data. It is called pt-archiver, a Perl super script that:

  • Allows you to move or delete data based on a SQL WHERE clause
  • Data can be moved to a table on the same server or a different one. Furthermore, data can be sent to a file simultaneously (how cool is that?)
  • It does this by scanning the table in fixed-size chunks, by index, keeping locking and transaction size under control. This makes it very efficient for any table sizes.

Let’s say, for example, we want to remove all rows for category_id 4 from the products table in the inventory database, on server dbsrv1. With pt-archiver, we could easily do so with the following command:

We added some additional parameters to control the chunk size (–limit), to force a commit on each chunk (–commit-each ), and to print the progress every 5000 rows (–progress). That’s it. You can seat down and watch it work.

Because the tool will be deleting data from your table, it is always better to make sure our last backup is sound and to test your command with the –dry-run option. This will print the commands the tool will execute and exit.

If you really, really want to know exactly what the tool is about to do, you can enable massive verbosity by using PTDEBUG as follows:

One other cool feature about pt-archiver is that you can ask it to keep an eye on slave lag, if your server has replicas. This way, the archiving process will halt until lag is below the value specified. This is how you would enable monitoring for a replica running on dbsrv2, making pt-archiver to pause when lag exceeds 60 seconds:

The pt-archiver script can be downloaded directly to a server using wget:

or it can be installed as part of the Percona toolkit, using your favorite package manager:

Nibbling relies on the ability to either efficiently find and delete rows matching the condition recurrently or to be able to split the table into unique nibbles or chunks. This requieres either an appropriate index for the –where condition or at least that a Primary key exists. If non of these are present in our table, we might need to consider a different approach.

Finally, always start with a dry run and make sure the queries the tool is printing are leveraging an index. Please leave me a message below if you need help with that!

Wrapping up

It could be as simple as having a pt-archiver script scheduled to prune our tables on a regular basis.


Leave a Reply

Discover more from query-optimization.com

Subscribe now to keep reading and get access to the full archive.

Continue reading