Rolling restarts for mongrel_cluster_ctl

Published on 04/14/09

I have been doing a lot of updating on Net-at-hand over the last couple of months while working on the plugin architecture.

Before the server upgrade I did recently, restarting my cluster of mongrels was kind of dicey. I was using so much swap, I guess, that some of the ports would refuse to restart and would just hang. Often I would have to go in and kill the processes by hand, clear out my pids folder, and start it back up. Needless to say, I am sure there were many requests that saw the old “We’re down for maintenance” sign.

Upgrading the server and fixing some of the memory leaks fixed most of that. The mongrels restart almost instantly now, but I still get dropped requests because they would all get stopped together and then would all get restarted together.

After some googling, I found this patch for mongrel_cluster_ctl that restarts the mongrels one at a time. So requests are still passed along to the working mongrels while one is being taken care of. I tried it for the first time this morning and I am smiling. I did a smattering of page re-loads while restarting and only one was a little slow (while the mongrel finished loading). Not one was turned away.

Now, a couple of issues remain with this approach.

  • One is that I can’t perform database migrations with this approach. Usually, I can have any old versions of the application sitting around when the new database is in play. I would rather people see the “site maintenance” page than the “oops!” page.
  • The second issue has to do with my front-end server nginx. Right now, I am running six mongrel instances and nginx is proxying requests to those mongrels. However, nginx uses a round-robin proxying strategy, basically just going down the list in order. This has generally worked ok for me, but if nginx is sending a request to a mongrel that is being restarted (rather than just going to the next one), I imagine a request would get dropped. I am planning to fix this in the not-to-distant future because nginx’s proxying also creates issues when someone is uploading a large file (which seems to be happening more regularly now!).