I really enjoy this kind of work. Each time I do something like this a learn a lot about building a robust infrastructure for ruby on rails. To think that when Net-at-hand was first making its way into production almost three years ago, I was running the application via fcgi on a shared host.
Now I have my own server and I have specially configured web server, and application servers, and now process monitoring and load balancing. I have learned a bunch. I’m not going to take a bunch of time fleshing out this post with all the details, but I have spent so much energy on this I have to say something.
The process monitoring system that I have put up watches how much ram and cpu usage each of my mongrel instances is using. If these go over a certain threshold then the process is restarted automatically (killed if need be).
With the introduction of a plugin architecture last year for Net-at-hand, there are several problem areas within specific plugins that need to be addressed. In the worst case scenario, a plugin would use up so much ram and cpu that it would grind my server to a halt. When this happened, I would get a text message and I would have to manually go in and restart the processes.
This worked fine when this didn’t happen often, but it had been happening more regularly over the last weeks and last Sunday night I slept through one such instance and Net-at-hand was down for about four hours (when one of my resellers was going live with a big client site).
Now I don’t even have to think about this, because it happens automatically. Of course I am still manually checking the health of the system, but it doesn’t need nearly as much involvement from me.
The web server that I use has a built in reverse proxying function and simple load balancing. This works pretty well for the most part, until I introduced the plugin architecture and some requests started taking too long to render. What happens is nginx happily continues to send requests to a busy mongrel and these get piled up behind a slow one. So Net-at-hand was sporadically slow when this happens.
I spent a day testing a load balancer I had used previously (pound) before I realized that it didn’t do any better. It would skip a backend server if it was down (which I wanted), but it would still send requests to one that was up but bogged down doing something else.
After a bunch more googling, I found that haproxy is able to limit the number of requests that it sends to backend servers, effectively letting me queue the requests at the load balancer and then sending them to the backend servers one at a time. This way, if one backend is busy, faster requests don’t get held up because they go to the backends that aren’t busy.
So far, I am very happy with how it is working. We’ll have to see if I stay happy, but it looks like I should be able to use this setup for a very long time.