One of our clients came to us with a performance issue on their Drupal 6 multi-site installation. Views were taking ages to save, the admin pages seemed unnecessarily sluggish, and clearing the cache put the site in danger of going down. They reported that the issue was most noticeable in their state-of-the-art hosting environment, yet was not reproducible on a local copy of the site — a baffling scenario as their 8 web heads and 2 database servers were mostly idle while the site struggled along.

Our performance analysis revealed two major issues. After implementing fixes, we saw the average time to save a Drupal view drop from 2 minutes 20 seconds to 4.6 seconds — a massive improvement. Likewise, the time to load the homepage on a warm cache improved from 2.3 seconds to 621 milliseconds. The two bugs that accounted for these huge gains turned out to be very interesting:

1. Intermediary system causes MySQL queries to slow down

Simple queries that are well indexed and cached, can see significant lag when delivering packets through an intermediary. This actually has nothing to do with Drupal, as it is reproducible from the MySQL command line utility. (It’s probably a bug in the MySQL libraries but we’re not entirely sure.) It could also be a problem with the intermediary but we’ve reproduced it in two fairly different systems: F5‘s load balancer proxy and VMWare Fusion’s network drivers/stack.

For example:

SELECT cid, data, created, expire, serialized FROM cache_menu WHERE cid IN (x)

A query like this one should execute in a millisecond or less. In our client’s case, however, we found that 40ms was being added to the query time. The strange part is that this extra delay only occurred when the size of the data payload returned was above a certain threshold, so most of the time, similar queries returned quickly, but around 10—20 of these simple queries had 40ms or more added to their execution time, resulting in significant slowdowns.

We briefly debugged the MySQL process and found it to be waiting for data. Unfortunately, we didn’t pursue this much further as the simple workaround was apparent: reroute the MySQL traffic directly to the database instead of through the F5 load balancer. The net change from applying this simple modification is that the time to save a view was reduced to 25.3 seconds.

2. Database prefixing isn’t designed to scale as the number of prefixes increases

Drupal can be configured to share a database with another system or Drupal install. To do this, it uses a function called db_prefix_tables() to add prefixes to table names so they don’t collide with other applications’ table names. Our client was using the table prefixing system to allow various sub-sites to share data such as views and nodes, and thus they had 151 entries in the db_prefixes list.

The problem is that db_prefix_tables() is not well optimized for this implementation edge case. It will run an internal PHP function called strtr() (string token replace) for each prefix, on every database query string. In our case, saving a view executed over 9200 queries, meaning strtr() was called more than 1.4 million times!

We created a fix using preg_replace_callback() which resulted in both the number of calls and execution time dropping dramatically. Our view could now be saved in a mere 10.3 seconds. The patch is awaiting review in the Drupal issue queue, and there’s a patch for Pressflow 6, too, in case someone needs it before it lands in core.

The final tweaks included disabling the administration menu and the query logger. At that point, we finally reached a much more palatable 4.6 seconds for saving a view — still not as fast as it could be, but given the large number of views in the shared system, a respectable figure.