Emily Newsom

As Development Director, Emily oversees all of Brightflock's development projects. Prior to co-launching Brightflock in 2008, Emily developed enterprise Java systems for the biotechnology industry. Emily is an avid proponent of open source software (but without the beard!) and has contributed to a number of OSS projects including Firefox, Ubuntu, WordPress and Drupal. Emily holds an honors BSc in Computer Science from the University of Victoria.

My personal site can be found here

Emily Newsom's blog »

We recently took on the task of optimizing a high traffic Drupal 6 site that sees more than 7000 unique visitors per day.

Now, Drupal isn't often praised for it's performance, but there are such a wealth of modules available that can really make it into a very slick system. I'm not going to focus on everything needed for Drupal optimization, but I will discuss, on a high level, how we were able to achieve the largest performance gains.

The site in question had the default Drupal 6 performance settings all enabled and setup correctly. The database, MySQL, was being hammered and could not keep up with the request volumes. Almost all of the traffic was from anonymous users.

The first step was to install Boost. Boost is a great little "drop in" static caching module that will serve static html from the filesystem (via Apache rewrite rules), avoiding Drupal altogether. It's a snap to setup and works like a charm. One word of warning, however, to anyone considering Boost for a Drupal 7 site--The D7 module is still in development at the time of this post and really lacks in its ability to flush individual nodes.

Once Boost was installed, we saw a large load reduction, however, the database, MySQL, was still unable to keep up with many of the dynamic search requests that couldn't be static cached. This particular site's schema was built to hold all of its data in CCK fields, and as a result had many fragmented content_type_xyz and content_field_xyz_abc tables that often needed to be queried all together. Using CCK to hold non-Drupal data is a great way to utilize all of the UI that Drupal offers to maintain data. However, the problem with this approach is that CCK is user-defined--it has to be flexible enough to hold anything, and when something is flexible, it tends not to be performant.

The second step was to create a denormalized query table that duplicated all of the data in the CCK fields. The site had approx. 200,000 nodes, where each node had approx. 30 CCK fields and several of the CCK fields were 1-to-many. So once all the CCK fields were joined together, we were dealing with a multi-million row table--no wonder the queries against these fields were slow. Because we're dealing with a multi-million row table, it's going to take a while to create. It was acceptable to have the search queries be slightly stale, so we decided to create the massive table in a regularly-run background cron process to limit any need to keep the table up-to-date on the fly. Once the denormalized CCK query table was created, the last step was to point all of the search requests against it. Like a lot of Drupal sites, this site was using Views to perform its searches. Using hook_view_pre_execute($view), we were able to rewrite the query to run against our new denormalized table (Check out $view->build_info , $view->build_info, $view->build_info to do this--maybe this will be another post in the future!). After rewriting the views to query our new table, the load on the server was dramatically reduced.

These two steps essentially eliminated the load that we were seeing at the start of this process: Boost + Denormalized CCK search table. The server for this site is now more idle than not and it continues to serve the same volume than it did before, with no change to the user or admin experience--the kind of performance solution I like!