Integrity constraint violations on the Drupal 7 block table

One of our Drupal 7 sites was giving a lot of white pages of death (500 errors). Looking through the Drupal logs, we saw a lot of these entries that were the cause of the white pages:

SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry 'spock-system-user-menu' for key 'tmd': INSERT INTO {block}

It’s obviously a problem with the block table, so let’s have a look at it:

[user@server01 www]$ drush sql-query 'describe block'
bid        int(11)      NO PRI NULL auto_increment
module     varchar(64)  NO
delta      varchar(32)  NO 0
theme      varchar(64)  NO MUL
status     tinyint(4)   NO 0
weight     int(11)      NO 0
region     varchar(64)  NO
custom     tinyint(4)   NO 0
visibility tinyint(4)   NO 0
pages      text         NO NULL
title      varchar(255) NO
cache      tinyint(4)   NO 1
css_class  varchar(255) NO

There is no field called ‘tmd’, so it must be an index on the table:

[user@server01 www]$ drush sql-query 'show index from block'
block 0 PRIMARY 1 bid     A 36 NULL NULL BTREE
block 0 tmd     1 theme   A 12 NULL NULL BTREE
block 0 tmd     2 module  A 36 NULL NULL BTREE
block 0 tmd     3 delta   A 36 NULL NULL BTREE
block 1 list    1 theme   A 12 NULL NULL BTREE
block 1 list    2 status  A 18 NULL NULL BTREE
block 1 list    3 region  A 36 NULL NULL BTREE
block 1 list    4 weight  A 36 NULL NULL BTREE
block 1 list    5 module  A 36 NULL NULL BTREE

Yep, there is an index called ‘tmd’ and it’s a unique one (the 0 in the second column means it’s not non-unique). We can see it consists of the 3 colums theme, module and delta (hence the name tmd).

The problem

The most occurring problem here, and this is well known in the Drupal community, is that large delta values make the block query fail. Take the 2 block names below, which will be the value for the delta field in the query:

  • feature_multilingual_frontpage_switcher_with_select
  • feature_multilingual_frontpage_switcher_with_links

They are both longer than 32 characters. The value for the ‘tmd’ index will trim the delta field down to 32 chars, making them both the same string:

  • feature_multilingual_frontpage_s
  • feature_multilingual_frontpage_s

With the module and theme being the same value for both, this will now give a duplicate key error and explain our watchdog error.

The fix?

The easiest fix for that problem is to simply increase the length of the title, delta or theme column, depending which one is giving problems. In our case that would be the delta column:

ALTER TABLE block
CHANGE delta delta VARCHAR(128)
CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL DEFAULT '0'
COMMENT 'Unique ID for block within a module.';

Here we raised the length of the column to 128 chars, which should be a lot better than the default 32. We can’t raise if much beyond that size though, as the maximum length of the index key on InnoDB tables is limited by MySQL (see the CREATE INDEX documentation for the full explanation).

Sadly this did not fix our problem, as we kept seeing this error in the logs after raising the column size. You can also see that in the error I posted at the start of this post the size of the key was not exceeding the length of any column, so that was already an indication that there was something else causing the problem.

The real problem gets revealed

If we dive into the Drupal code (simply searching for pieces of the query string from the error), we find the query that’s giving the exception is being called inside the function _block_rehash.

At the top of the Drupal 7 API page for this function we can see a list of all functions that call this function, and one of them is block_flush_caches:

function block_flush_caches() {
  // Rehash blocks for active themes. We don't use list_themes() here,
  // because if MAINTENANCE_MODE is defined it skips reading the database,
  // and we can't tell which themes are active.
  $themes = db_query("SELECT name FROM {system} WHERE type = 'theme' AND status = 1");
  foreach ($themes as $theme) {
    _block_rehash($theme->name);
  }

  return array('cache_block');
}

This function is an implementation of hook_flush_caches, a function that gets called at every full cache flush.

So in short: this _block_rehash function is called at every cache flush. And that’s exactly where our problems lies with this site.

The site we are reviewing is a very busy Drupal 7 site with a lot of concurrent visitors. Drupal cron is scheduled to run every minute as this is needed for scheduled content publishing and sending out push notifications to mobile devices (but that’s not really important for this issue).

A cron run does a partial cache flush (called in system_cron) and after that the first visitor of the site triggers a cache rebuild. But because this site is a very busy site, the _block_rehash function seems to get called more than once at the same time, by different visitors.

As the_block_rehash function does not start a transaction, or have any kind of table locking on the block table, we can get a situation where the same query wants to get inserted more than once. Luckily our database fails on that because of the unique index, and thus saving us from corrupt data. But the user will get a 500 error returned and thus see a white page instead of the content he expected.

The real fix

For now we simply added a try/catch block around the query in the _block_rehash function, so that when it fails it won’t give a white screen of death anymore:

try {
  drupal_write_record('block', $block, $primary_keys);
}
catch (Exception $e) {
  watchdog('block_rehash', 'Trying to insert an existing block with message: @message', array('@message' => $e->getMessage()), WATCHDOG_WARNING);
}

The proper solution would be that Drupal implements a transaction system that prevents these kind of problems happening again. A shoutout to Jasper Knops for helping me figure out this issue (and proofreading this post :)

About Drupal cron

I mentioned above that this site runs a Drupal cron every minute, which is not a good idea on any busy site. If you have jobs that need to run every minute you should invoke individual cron commands instead of running a full Drupal cron.

Reverse proxy configuration for Apache or nginx with varnish

A while ago I wrote a blog post about how you should tell your Drupal 7 site that the webserver is using a reverse proxy setup to make sure the correct ip adress of visitors gets sent to Drupal.

If you want to have the correct ip address in your Apache or nginx logs too, you need to add some configuration. Two things you need to take care of:

  • The Varnish vcl needs to ad an X-Forwarded-For header to backend requests
  • Apache or nginx needs to use this header for logging instead of the client ip address

Varnish vcl configuration

This is an example for a varnish 3.x vcl, where you need to add the X-Forwarded-For header in the vcl_recv function:

sub vcl_recv {
 if (req.restarts == 0) {
    if (!req.http.X-Forwarded-For) {
      set req.http.X-Forwarded-For = client.ip;
    }
  }

Sidenote: If you want a fully working vcl for Drupal or WordPress, you should take a look at Mattias’ github project for varnish 3.x or varnish 4.x templates.

Apache configuration

After the X-Forwarded-For header has been added by varnish, we can tell apache to use it for logs. This is an example from an Ubuntu 12.04 LTS server and the file /etc/apache2/apache2.conf, other setups might be similar.

Our vhosts use the “combined” log method, so that’s actually the only one you need to update. But it’s nice to keep everything in line and update all the log formats that used %h before:

# The following directives define some format nicknames for use
# with a CustomLog directive (see below).
# If you are behind a reverse proxy, you might want to change %h
# into %{X-Forwarded-For}i
#
LogFormat "%v:%p %h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" vhost_combined
LogFormat "%{X-Forwarded-For}i %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" combined
LogFormat "%{X-Forwarded-For}i %l %u %t \"%r\" %>s %O" common
LogFormat "%{Referer}i -> %U" referer
LogFormat "%{User-agent}i" agent

In the example above I already replaced the %h variable with the X-Forwarded-For variable.

Nginx configuration

Nginx has a very simple configuration option you have to change. In your vhost config you have to add this code in the server block:

server {

  ...

  # Pass the real ip from varnish/
  set_real_ip_from 127.0.0.1;/
  real_ip_header X-Forwarded-For;

  ...

This ofcourse assumes the varnish server runs on the same server as the webserver. You might need to change this to the ip of your actual varnish server if it runs on another ip.

If you’ve done all of this correctly, you should now see the actual client ip addresses appear in your webserver logs.

Performance debugging basics: the MySQL slow log

In this blog post I’ll show you how simple and usefull the MySQL slow query log can be to find performance problems in a Drupal 7 site running a complex import task as a drush batch process.

Look at the graph below, showing a Zabbix graph of the CPU utilization of a staging server for one of our projects:

Zabbix CPU utilization graph during a performance problem

On the left side, between midnight and 6:30, you can see a lot of colors in the graph. The green one is idle time, the blue one is a PHP application doing some work but the dark yellow and purple ones are bad ones: they indicate that there’s way too much disk I/O going on while running the PHP application.

As this server has no New Relic APM Pro account, doing a deep code inspection wasn’t possible. I did however see that most of the processing time was spent on MySQL, as you can see in the standard (free) New Relic APM graph below:

New Relic APM graph of a drush command

The first and most basic thing you then have to do is turn on the MySQL slow log on the server. The slow log is useful to log 2 things:

  • queries that are taking longer than a set amount of time to execute
  • queries that are not using indexes where they should be using one

In MySQL you simply have to add these statements in your my.cnf file. This example config is for a MariaDB 5.5 server, if you have a different version or a different fork of MySQL, check the documentation for the proper slow log configuration as it might be different:

# Slow logging
slow_query_log = OFF
log_slow_queries = /var/log/mysql/slow.log
long_query_time = 2
log-queries-not-using-indexes

In this example config I’ve set the query cut-off time to 2 seconds, so anything that takes longer than 2 seconds to execute gets logged. Make sure you create the directory (if it doesn’t exist yet) and touch an empty slow.log file:

# Create the directory
mkdir -p /var/log/mysql
chown mysql:mysql /var/log/mysql

# Create an empty log file
touch /var/log/mysql/slow.log
chown mysql:mysql /var/log/mysql/slow.log

Restart your MySQL server to load the configuration changes.

This config does not turn on slow logging yet, it just configures the log location and sets the options. To actually enable it, you need to run this command in a MySQL prompt with admin privileges.

The example below uses a Plesk server (which we use for our shared development enviroment), but it could be as simple as running “mysql” as root for your system:

[root@acc01 ~]# mysql -uadmin -p`cat /etc/psa/.psa.shadow`
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 2849
Server version: 5.5.42-MariaDB-log MariaDB Server
Copyright (c) 2000, 2015, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
MariaDB [mysql]> SET GLOBAL slow_query_log = 'ON';

We don’t enable this by default for the server as the log can fill up quite fast. So do not forget to turn it back off when you are doing by running the same command with the value OFF again:

SET GLOBAL slow_query_log = 'OFF'

The benefit of using this command is that you can run it on the server without having to reload or restart it. So once the configuration is setup, enabling or disabling the slow can be done with just this simple command.

Right, now we have our slow log turned on, let’s see what a tail -f /var/log/mysql/slow.log now shows while our PHP application runs. I’ll filter out all the useless queries (and there probably will be a lot) and go straight to the performance killer:

# Time: 150625 16:46:17
# User@Host: user[user] @ localhost []
# Thread_id: 16 Schema: db_prod QC_hit: No
# Query_time: 1.522648 Lock_time: 0.000053 Rows_sent: 0 Rows_examined: 200343
SET timestamp=1435243577;
UPDATE user_member_status SET is_invoiced=0;

This query would show up every 2 seconds, meaning the PHP script probably calls it every time it starts a new batch of things to process. MySQL logs this query because it doesn’t use an index and therefor has to scan all the rows in the table, which leads to a load of blocks being loaded from disk and is a thing you want to avoid at all cost in databases. There are just over 200 000 rows and the query takes 1.522 seconds to run, which is a very long time for a query.

If you look at the query, you can see that it really doesn’t make sense to run this query over and over again. It clearly sets the value for the is_invoiced column to 0 for all the rows in the database, running it more than once is useless.

It could also be missing a where clause, but that’s something only the developer that wrote the code can tell.

So, with this information I went back to the developer, who then confirmed it was a bug: this query should only be run at the start of the process. He made a fix, pushed it to git, made a new release with our Bamboo deployment server, pushed it to acc01 and within 15 minutes of my “discovery” the performance fix was live.

If we look back at the Zabbix graph from above, we can see that at the right side there is now again a lot of CPU action, but only good things. The disk I/O bottleneck is gone as we removed the “bad” query locking up the server:

Bad query locking up the server

That’s it, another performance problem found and fixed using some basic devops tools. I hoped you learned something for this blog and remember to turn off your slow log when you are done!

Apache, Varnish, mod_deflate and 503 errors

About 90% of the unexplainable Varnish 503 errors I’ve seen the last year have been caused by Apache having mod_deflate enabled while there’s a varnish cache in front of it. This happens mostly with PDF files not being able to download, they just give the infamous white screen of death with nothing being logged.

The solution is simple:

  • disable mod_deflate in Apache
  • enable gzip compression of Varnish output

This is probably also true for nginx servers, where you then need to disable gzip compression.

One hour of New Relic to solve a critical performance issue

The screenshot below shows you the backend processing times for a WordPress site with a big performance issue in New Relic’s APM tool.

Using the New Relic APM

In this case there was a query that returned 170 000 rows of data PHP had to process on about 80% of the requests. Took me 5 minutes to find the problem and 55 minutes to get the fix pushed to production.

Expect a bigger blog post about New Relic very soon.