Wrangling the Beast: 5 Tips for Managing Massive Sites

Successful large-scale site builds are no accident. Planning for a high performance and high traffic site, regardless of the backend framework, requires more than just adequate server memory and proper hosting setup.

When such projects are being evaluated, performance considerations for every aspect of the architecture must be planned from day one. Otherwise, development teams are likely to spend significant time in pre-launch tasks, ripping out already-built features for "launch critical" performance report failures. With a little foresight and pre-planning, especially in Drupal 8, you can avoid this from the first sprint onward.

1. Ensure proper testing environment content

Setting up a QA team for long-term success is essential for any project. This may take more than creating proper test cases and well-documented tickets. Testing can only surface problems you’ve established as a possibility to begin with, and more often than not this does not include large database and high content sites for our QA tests. If we need highly successful results reflective of post-launch criteria, it is important to ensure all development and QA environments have content counts as close as possible to the final production launch setup.

An afternoon spent preparing a QA environment with millions of content nodes will save weeks of failed performance tests right before launch.

Uncomfortably, QA tests against low-content sites will pass in even under-provisioned environments. The ugly side of non-performant queries and libraries usually only manifest themselves at the wrong time and the wrong situations, like right before an important stakeholder review.

Improper joins against 10 million record database tables only produce slowdowns on components and blocks if the table they're joining actually has 10 million records. If testing is being done against 10 records it often will pass with flying colors at no fault to the tester.

It’s essential to preload all environments with enough content to mirror actual production launch counts, be that 200 or 2 million nodes. At Mediacurrent, we firmly believe in the value of great QA and will go above and beyond setting up QA tests and QA environments for successful end-to-end testing.

2. Don't use entityTypeManager for queries

Drupal specific, but the entityTypeManager method for database selections is extremely inefficient when used against large content database tables and, in our opinion, should be avoided at all costs. Use Drupal\Core\Database\Connection instead.

There are active discussions amongst the Drupal core community as to the performant operations of this method, but as of today this built-in core operation has significant operational costs at large content scales (1+ million nodes) to the point of possible complete database lockouts and high mysql slow query log entries on sequential uncached operations, such as a basic page load.

The good news is the Database\Connection library is almost plug-and-play compatible with any existing entityTypeManager code so swaps can be relatively painless and straightforward. The performance penalty of entityTypeManager library on large sites is so pronounced it is an almost “do not use” requirement for large content sites.

3. Be mindful of revisions

Unfortunately, there's no magic answer to database bloat due to revisions, aside from "don't use revisions," but that's usually not an option. Revisions can be a hard requirement for content-heavy sites with large editorial requirements.

Setting an appropriate revisions policy based on X number of revisions or Y amount of time is key to managing database growth. Do this via the contrib project Node Revisions Delete, which is tantamount to required on any large content site with revisions. If a site will have a significant amount of content and millions of nodes, keep in mind that the database table row size will increase at a 1:1 rate to the revisions policy, if not more due to any field-level revisions as well.

For example, when considering body field content and the size of storing revision blob-type data from WYSIWYG, it's easy to see where the size can quickly spiral into out-of-control data sprawl.

Field-level revisions even magnify this problem, creating a 1:2, 1:3, or even more database table growth escalation on every change. We generally recommend setting a one to two revisions maximum per-content rule, ideally on top of an automatic purging based on date.

In a “what-if” type exercise, an even better but perhaps slightly more complex technical requirements setup would be the ability to save revisions on an external datasource, apart from the main site connection, similar to storing archival data on a separate server.

When considering revisioning for either legal requirements or a peace-of-mind mini backup solution, the day-to-day needs for these content pieces are almost nonexistent. Couple this with the exponential size of revisions compared to the content on the site, it makes little sense to have these database table entries used in joins sitting alongside the main content and site structure, especially when it becomes millions upon millions of rows.

While there’s no white paper or real-world example to point to for this, it’s an avenue worth exploring given the tremendous potential benefits.

Lastly, pruning old revisions from a database is an extremely slow operation. Do not underestimate this need and server impact should it arise.

4. Use Search to power content

Overall, databases are incredibly inefficient at contextual content selections. It can also be horribly inefficient to write logic as part of a development task to pull contextual content, nor should we ever attempt to reinvent the wheel as such.

Search platforms, either via Elasticsearch, Apache Solr or any of the other options out there, have been engineered by their respective communities to provide optimal and optimized queries for contextual content. It’s almost literally what they were built to do.

Pulling content lists from these search platforms as part of the baseline architecture should be the default architecture setup on any component on a site. For Drupal, Search API provides this resource almost as straightforward as doing database selections and queries. While admittedly not always possible, pulling content this way provides huge levels of performance gains and produces similar if not the exact same results as would be pulled from a database query.

A “Top 5 list of all Technology stories" requirement for a block on a page can often produce the exact same or similar results as querying against entities with a taxonomy relationship vs pulling back an Elasticsearch or Solr result using the same query parameters. And in the scale of millions of nodes, will be much more performant.

5. Pagination can hurt

Sometimes what can seem like the most insignificant UX component can have serious implications on performance. Take for instance the pager on views, if your view is using “full” pager this tells Drupal to first do a count query to determine the total number of records in the view. On a site with millions of pieces of content, this will bog down the site. A better solution is to use the “mini” pager as it removes the expensive count query. Note, content includes media as well, so be mindful that the views powering how you add media to your content can also be the culprit.

Often this manifests only on the /admin pages of sites where content lists of millions of nodes become problematic for editors. Regardless, slow queries by admins or content editors do not discern between those and regular site visitors, and editorial team content authoring slowdowns caused by paginated content will create slow queries across the entire site.

Conclusion

Complex Drupal projects are a unique breed, often serving many sites on a platform. It takes an experienced development team to “wrangle the beast" and execute a supersize project.

From custom editorial workflows to robust search functionality, Mediacurrent navigates build complexities to bring sites like Weather.com and Mass.gov the right foundation for digital greatness. We want to hear from you: What's the biggest technical hurdle you've faced on a large scale Drupal build? Tell us in the comments or drop us a line.

Persona

Drupal Developer

Services

Development