Booleans and B-tree Indexes: A Cautionary Tale

Are you sure your indexes are being used?

By Felipe "Bidu" Rodrigues

Booleans and B-tree Indexes: A Cautionary Tale

Boolean fields are commonplace in most databases. They even featured prominently in my last post, about computed boolean flags . However, when it comes to indexing, they can be a bit tricky. Specially if your database is MySQL.

Indexing - In Theory

On a very high level, a database index allows you to speed up queries by reducing the search space and allowing you to traverse it faster. The mechanics behind them will vary from platform to platform and from index type to index type.

The most common index type for relational databases is a B-tree . Suppose you have an is_active column in the table that tracks your User objects. Let’s assume that 80% of your users are active, having is_active = True. A regular B-tree index on that column would split your search space into two parts - one containing 20% of the records and the other containing 80% of the records:

Search space split by a proper index showing 20% of the users on the 'false' part

With proper indexing and index usage, searching through inactive users will involve dealing only with the little red region, instead of the whole search space

If you search for records with a False value, for example, the database will traverse only 20% part of the index to find the records you are looking for.

Indexing - In Practice

Now, B-trees are great, but the devil sometimes lies in the details. When we added an index for a boolean column we saw radically different performance between queries similar to:

SELECT * FROM items WHERE is_deleted = true;

and

SELECT * FROM items WHERE is_deleted IS true;

You can see a benchmark of that query over random data below, on a test performed by Thomas Jost, a fellow Doister who worked with me on this analysis:

RunIS FALSE= FALSE
Run 13,3831,472
Run 23,6371,349
Run 33,2771,394
Run 43,0861,378
Run 53,1081,395
Run 63,0661,451
Run 73,0201,456
Run 83,1481,556
Run 93,2271,574
Run 103,1561,563
Average3,28821,4588-55,64%
Median3,19151,4535-54,46%
Std deviation0,26920,0821

From that data, we can infer that something is definitely off with our query. But what? Let’s build a simple example table:

CREATE TABLE `items` (
    `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
    `is_deleted` tinyint(1) NOT NULL DEFAULT '0',
    PRIMARY KEY (`id`),
    KEY `idx_is_deleted` (`is_deleted`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

and take a look at the relevant part of the EXPLAIN for the two queries:

Query #1

EXPLAIN SELECT * FROM items WHERE is_deleted IS FALSE;
possible_keyskeyrowsfilteredExtra
idx_is_deleted1100Using where; Using index

Query #2

EXPLAIN SELECT * FROM items WHERE is_deleted = FALSE;
possible_keyskeyrowsfilteredExtra
idx_is_deletedidx_is_deleted1100Using index

View on DB Fiddle

The second query uses the index, while the first one does not. Why is that? The answer lies in the semantics of the IS operator.

Identity vs Equality

In Computer Science it is sometimes important to distinguish between two types of comparisons: identity and equality. The former is the strictest form of comparison, where two objects are equal if and only if they are the same object. The latter is a more relaxed check, where two objects are equal if they have the same value.

This may seem like a trivial distinction, a splitting of hairs that is only relevant for theoreticians and overzealous CS practitioners, but it is quite important. If you ask me for a beer, and I give you a beer, we can say that the values are equal.

However, if you ask me for exactly the beer you bought yesterday, and I give you a beer from my fridge, the identity check fails, even if I give you a beer of the exact same brand you bought.

In practice, the semantics of equalities and identities will vary between different programming languages. If you ponder the complexities behind the question “Is this object the same as that object?”, you can see why.

The difficulties of telling apart identity and equality are more pronounced if you consider how easy it is for a program to copy parts of its own memory bit by bit. If x = 10 and y = x, are x and y pointing to the same memory address? What if a = 10 and b = 10? What if instead of dealing with trivial primitives like integers, we are dealing with complex objects? The list of questions goes on and on.

What is a Boolean for MySQL

In short, they don’t exist. MySQL does not have a native boolean type.

Instead, it uses the TINYINT type to represent boolean values. As a TINYINT may range from -128 to 127, MySQL will consider any value that is not 0 as true, and 0 as false.

As one of our engineers, Artyom Pervukhin, pointed out, the nuance here is that a query like is_deleted IS True crosses two different worlds - the numeric realm where is_deleted lives and the boolean realm where True lives.

While IS and = will behave the same for values 0 and 1, they will differ for any other:

MySQL [todoist_data]> select 0 is true, 0 = true;
+-----------+----------+
| 0 is true | 0 = true |
+-----------+----------+
|         0 |        0 |
+-----------+----------+
1 row in set (0.00 sec)

MySQL [todoist_data]> select 1 is true, 1 = true;
+-----------+----------+
| 1 is true | 1 = true |
+-----------+----------+
|         1 |        1 |
+-----------+----------+
1 row in set (0.00 sec)

MySQL [todoist_data]> select 2 is true, 2 = true;
+-----------+----------+
| 2 is true | 2 = true |
+-----------+----------+
|         1 |        0 |    <====== Aha!
+-----------+----------+
1 row in set (0.00 sec)

MySQL’s documentation states :

Comparison of dissimilar columns (comparing a string column to a temporal or numeric column, for example) may prevent use of indexes if values cannot be compared directly without conversion.

That is, this crossing between realms will prevent the use of the index.

Besides that, there is a nuanced difference when using expressions like IS TRUE or IS FALSE. While it may seem that IS TRUE is directly comparing against the TRUE as an alias for 1, that’s not the complete picture. MySQL provides three flavors for the IS operator :

When you use column_name = TRUE, you’re applying the equals (=) operator between the column value and the literal 1. In contrast, column_name IS TRUE applies the IS TRUE operator, where TRUE is not a standalone literal value but part of the syntactic operator.

Conclusion

The conclusion is simple - MySQL will not engage the index if you use IS instead of =. If you have boolean columns in your codebase and they are indexed, make sure that your queries and the ones made by your ORM are using the appropriate operator.

The screenshot below showcases the number of slow queries before and after we applied a system-wide fix, changing all IS operators to =:

Number of slow queries with booleans before and after the fix

The number of slow queries with booleans before and after the fix

This small fix alone was enough to eliminate some performance issues we were seeing under specific loads.

While surprising, this behavior is a known issue . Despite verified by the MySQL team, it is unclear if a fix will ever be implemented.

Lastly, it is worth noting that our tests further showed that:

Acknowledgments

I would like to thank Thomas Jost, Artyom Pervukhin, Alex Rodrigues, and Gonçalo Silva for their help in researching and fixing this issue, as well as for their feedback on this post.