Blog posts from August 2007...

Facebook/Washington Post, Performance Tuning

This final post about my group's work (at Washington Post.Newsweek Interactive) on our Facebook Platform app The Compass is long overdue. But now the time has come! Let's talk Postgresql and Apache performance.

In the first two posts on this subject, I wrote about the Facebook Platform itself and the Compass' architecture. In this post, we'll look at some of the challenges we encountered while serving the app and areas we focused on to improve our Postgresql and Apache performance.

NOTE: All of this is anecdotal, based on my experience with this app. I'm no performance guru and don't hold myself up as such. I think, too, different applications have different needs, and the requirements of something like Facebook could not be optimal for other situations.

Caching Limitations

As I mentioned last time, all of FBML we load into a profile is cached and served by Facebook, but the hits to our application pages are hits to our servers as well. The first thing that comes to mind with Django is, "well, make sure you have caching enabled." There are a couple reasons why this doesn't work as well as one would like.

First, the caching for a Django site is bypassed when the request contains GET or POST data. Every request from Facebook contains POST data. Each callback request has a few fb_sig* parameters that are POSTed to your page to verify the request comes from Facebook. This is great for security and passing data from Facebook back to your application, but it kills the normal caching process for Django-based sites.

Second, each request can potentially be unique. In our case, the only Facebook canvas pages we serve are the one that submits the compass survey questions and the one to display the Flash map of your friends who have installed the compass. It's hard to do much low-level caching of Django querysets because you don't want to inadvertently give the user someone else's data. We do a little of this, though. See, for example, what we do here when we display the compass based on your last answer:


cache_key = 'compass_entries_%s' % facebook.user
compass_entries = cache.get(cache_key)
if not compass_entries:
    compass_entries = Compass.objects.filter(user__exact=facebook.user).order_by('-id')[:10]
    cache.set(cache_key, compass_entries, 60 * 15)

We also reset these entries in the cache when a user resubmits the survey. So we save a few DB hits if the same user retakes the survey a few times back to back. However, there's just not much in common across users to really take advantage of Django's cache. We're pretty well left to raw DB performance.

Bypass the ORM

One of the first things we did to help performance was to bypass Django's ORM. We store the user's answer to each question via a save method on the form that is submitted. Using the ORM this would look something like:


from politicompass.models import Compass
def save(self, uid):
    q1 = self.clean_data.get('q1')
    q2 = self.clean_data.get('q2')
    q3 = self.clean_data.get('q3')
    q4 = self.clean_data.get('q4')
    q5 = self.clean_data.get('q5')
    q6 = self.clean_data.get('q6')
    q7 = self.clean_data.get('q7')
    q8 = self.clean_data.get('q8')
    q9 = self.clean_data.get('q9')
    q10 = self.clean_data.get('q10')

    for i in range(1,11):
        answer = 'q%s' % i
        compass = Compass(user=uid, question_id=i, answer=answer)
        compass.save()

We refactored this before launch to bypass the ORM and excecute the INSERTs in one connection:


from django.db import connection
from politicompass.models import Compass
def save(self, uid):
    q1 = self.clean_data.get('q1')
    q2 = self.clean_data.get('q2')
    q3 = self.clean_data.get('q3')
    q4 = self.clean_data.get('q4')
    q5 = self.clean_data.get('q5')
    q6 = self.clean_data.get('q6')
    q7 = self.clean_data.get('q7')
    q8 = self.clean_data.get('q8')
    q9 = self.clean_data.get('q9')
    q10 = self.clean_data.get('q10')

    sql = ""
    for i in range(1,11):
        answer = 'q%s' % i
        sql += "INSERT INTO facebook_compasses (user, question_id, answer) VALUES (%s, %s, %s);" % (uid, i, answer)

    cursor = connection.cursor()
    cursor.execute(sql)
    connection._commit()

There were other performance-conscious moves we made along these lines, and still, once the app started to grow in popularity, we had users submitting that form in such numbers that our DB server load stayed at a freakishly high level. (NOTE: Prior to Facebook, we normally ran at about a .20-.35 load. Once the Facebook app launched, our load jumped up into the 3.00-4.30 range depending on site activity.)

Tuning Postgresql

I had already tuned Postgresql once for some spikes we had encountered when some of our apps were linked up by MSN and MSNBC. These tunings included raising the max_connections limit and bumping up the amounts for the following settings:


shared_buffers
work_mem
maintenance_work_mem
max_stack_depth

The most significant of these for us was shared_buffers. With the hits we had received from MSN and MSNBC, raising shared_buffers to about 1.6 GB (we have 8 on the box) and increasing max_connections was enough to keep us humming along nicely. With the Facebook traffic we had to increase shared_buffers to about half the available RAM on the box and everything dropped back to a sane level. We are running on Solaris and so we had to have our box increase the amount of shared memory available from the kernel in order to give so much RAM to shared_buffers, but again, once this happened, the load recovered amazingly well.

Hits Under Facebook

Just to toss out some raw numbers, when we first loaded our app to Facebook, we were doing about 5-10 hits a second during peak usage. We ended up doing about 2.5 million hits the first week, just from Facebook alone. We run 4 other sites off the same server. This is a single Postgresql server. We do have our two web servers behind a load balancer, and our static media is served from the normal media.washingtonpost.com setup. Needless to say, there are certainly higher numbers that other sites boast, but the single DB, with some tuning and planning, survived the spike pretty well.

Currently, we're doing about 10 million hits a month from this setup, and we're really at its limits now. To do much more, we'll have to look at replicating the database. Having said that, were it not for the Facebook traffic and a similar Newsweek week app bypassing the cache for reasons outlined above, I think we could easily do twice the traffic on the same setup. Caching really saves on DB load, so use it all you can if possible.

Apache Tuning

Luckily, we never felt the Facebook traffic from an Apache stand point. I will point out, for the sake of LAMP stack completeness, that the best trick I learned for Apache is to set MaxRequestsPerChild to something in the range of 500. This keeps Apache memory size down while also serving a decent amount of traffic per process. And if you don't know this already, never serve a Django-based site with DEBUG=True. Not only is it bad from a security stand point, but Django in DEBUG mode stores the queries run in memory, so you can quickly eat up your RAM if you forget to turn this off.

Again, this is just my experience of tuning our stack, so YMMV, but I hope sharing this info will prove useful.

Link | Posted by deryck on August 15, 2007 | 6 comments