This is what I woke up to four days ago.

By now, anyone reading this entry – and thousands of people who aren't reading this entry – have probably seen my Double-Boiling Your Hard Drive article. I wrote that one up because I thought it was fascinating, and submitted it to Slashdot, Digg, and Reddit just for fun. I got a small pile of hits off those submissions, but not many, and I went to bed assuming those posts would simply fade into oblivion.

While I was sleeping, a friend of mine made a Reddit entry that, to say the least, did quite a bit better. I woke up and my site was so hammered that even I couldn't access it.

Since I spent a lot of time trying to figure out what the issue was and (once again) ended up finding a solution myself, I'll explain the problem and eventual solution here along with a lot of technical garbage that most people probably don't care about. I am, however, going to try to make it understandable to non-geeks – so if you want a bit of a view into how webservers and performance works, read along.

The entire problem stems from the basic method that you use to do communications across the Internet. Here's the simplest way you can write an Internet server of any kind:

Repeat forever:
  Wait for a connection
  Send and receive data
  Close the connection

This seems reasonable at first glance, but there's one huge issue – you can only process one connection at a time. If Biff connects from a modem, and it takes me 20 seconds to send a webpage to him, that means there's 20 seconds during which nobody else can communicate with the server. Can't download, can't even connect. Worse than that, even the fastest browsers are going to take a few seconds to load a large complex webpage like this one, and that would kill performance completely.

There are many many solutions to this. The most common web server, Apache, seems to have two main modes you can run it in – prefork and worker – which implement two of those solutions.

First off, prefork. In prefork mode, Apache forks the webserver process into a larger number of processes, which means that there are actually, say, ten copies of Apache running simultaneously. This isn't nearly as bad as it sounds. Any modern operating system is going to realize that most of the data Apache loads, like the program itself and likely all of its configuration data, doesn't change – and if it doesn't change, there's no reason to make one copy of it for each Apache. It's shared among all the processes. If it's shared, you only need a single copy. Low memory usage and everyone is happy.

Unfortunately, Apache isn't the only thing that goes into a standard webserver now. The journal software that I'm using on this site – WordPress – is written in a scripting language called PHP. Scripting languages need an interpreter to work, and so Apache runs the PHP interpreter – one copy per process. The interpreter code itself gets shared in the same way Apache does, but all the temporary structures it builds and all of its working space isn't shared.

It actually gets worse from here. As part of their normal function, most programs allocate and deallocate memory. If they need to load in a big file, they allocate a lot of data, then deallocate it when closing the file. When they allocate, they first check to see if there's any "spare space" available that the program has already received but isn't currently using. If not, they request more space from the OS. However, most programs will not return space to the OS at any point. They'll just return that unused space to its local pool, the aforementioned "spare space". This means the OS can't know exactly what the program is or isn't using at any one time. The process eventually stabilizes at whatever its worst-case is – if you have one horrible page that takes eight megs of RAM just on its own, and you have your program load pages randomly, your program is going to reach eight megs and then sit on that eight megs forever – even after it's done dealing with that nasty page.

As a result of this, if you're running ten Apache processes, you will eventually be using ten times the maximum amount of RAM that Apache+PHP could use on any one page on your site. That's painful.

In my case, this site is running on a virtual server with 256 megs of RAM. My average Apache process was eating about 12 megs, and MySQL was consuming another 50 megs, and the OS was taking another chunk. I couldn't get more than ten processes running without absolutely killing my server. (When MySQL crashes due to running out of RAM I really don't care that I can serve error pages 50% faster.)

And this is why, despite the fact that the CPU load was negligible, the site was still completely inaccessible. I had plenty of CPU to generate and send more pages with. I was swimming in spare CPU. But no matter how much CPU I had, I couldn't possibly service more than ten users simultaneously.

Now, back to those Apache modes! What I've just described up there – with one process per connection – is "prefork" mode. There's another newer mode called "worker" mode.

In worker mode, Apache spawns one thread per connection. You can think of threads as sub-processes – they run inside the same process and all have access to the same memory and data.

Remember all that stuff I wrote about programs returning data to the OS? They don't return data to the OS – but they'll gladly return data to the process. Every copy of PHP that gets run can reclaim the exact same memory and re-use it, even while its siblings are sending and receiving data.

By default, worker mode spawns 25 threads per process, with multiple processes if it needs more connections than that. Under heavy load, each thread spends most of its time sending or receiving data (Biff's crummy modem again). In reality only one or two threads will actually be running PHP at a time – so the memory usage for this single process is, at most, twice that of the prefork processes. But we can now handle 25 times as many connections.

I finally got this mode up and running, and suddenly my site was not just usable, but 100% responsive. No slowdown whatsoever. However, you'll notice I haven't given any kind of detailed instructions on making this work, and there's a good reason for it. This is a terrible long-term solution and I was crossing my fingers the entire time, hoping it wouldn't melt down.

Here's the issue with threading. Imagine you have a blind cook in a kitchen. (I'm avoiding the classic car analogy.) He can cook easily, because he knows where everything is, and he knows where he may have moved things – he can take a pot down, put it on the stove, chop an onion, toss it in the pot, and the pot is still there. He's blind, but it doesn't matter, because nobody is mucking about with his kitchen besides him. No problems.

Now imagine that we have a huge industrial kitchen, with fifty blind cooks, all sharing the same stovetops and equipment. The cooks would get pots mixed up, interfere in each other's recipes, and there would probably be a lot of fingers lost. Threading, unless you're careful, can be equally catastrophic – all the threads work in the same memory space and they can easily stomp all over each other's data.

PHP, in theory, is threadsafe. Some of the libraries that PHP calls are threadsafe. Not all of them. It worked, for a day – but I wouldn't want to rely on it long-term.

There is a solution to this. It's just a horrible bitch.

There's a module called FastCGI that you can use with Apache in worker mode. FastCGI is threadsafe. FastCGI can be set up to call a specially-built version of PHP, and do so in multiple processes so PHP doesn't even have to be threadsafe. To make things even better, FastCGI keeps a small "pool" of PHPs – perhaps three or five – but nowhere near one per connection. This does mean that you can only have five PHP sessions running at once, but remember that PHP processing is fast on our server! Apache is smart enough to read all the input, do all the PHP processing quickly in memory, and then sit there waiting for Biff's modem to acknowledge all the data. Five PHP instances can easily service a few dozen connections.

Unfortunately, Debian Linux (and likely others) doesn't have particularly good native support for this. All the modules do exist in one form or another (apache2-mpm-worker, libapache2-mod-fastcgi, php5-cgi) but just installing them doesn't do the trick – you need to hook them together. Luckily, the FastCGI FAQ does mention everything you'd need for this (look under "config"). It's annoying to set up, but it's not really difficult – just irritating.

FastCGI on its own doesn't solve all the problems. WordPress is, actually, a CPU-hungry beast. Five PHP instances might be able to service a few dozen connections, but not hundreds – WordPress pages involve a lot of database queries and a lot of work. But solving this issue can be accomplished nicely by installing WP Supercache – it will cache pages as they're displayed and it hugely decreases CPU usage, meaning that those same five PHP instances can now serve well over a hundred.

Before these changes, my server couldn't handle more than ten simultaneous connections. I've used website stresstesting software since – I've managed up to 400, and the server doesn't even break a sweat. I can't do higher because my connection starts dying horribly.

There's no real excuse for any modern server to have trouble with this sort of load, unless it's doing extremely heavy noncacheable processing or getting hundreds of simultaneous connections. Computers are fast, and getting faster all the time – at this point I'd love to see this site get Digged or even Slashdotted, because I'm truly curious what it could stand up to.

I'm hoping that someone with this same problem will find this page and be able to fix it quickly. It's not that hard – it's just kind of annoying.

First I'm going to show you a picture, just to get your attention.

I have a rather old computer case that I've been lugging around for years. It's a Hush Technologies Mini-ITX. I don't think they even make these systems anymore – I got mine many, many years back, and it was one of the first they produced.

The Hush Mini-ITX was a near-silent computer, before silent computers were anywhere close to as easy to build. It used a Mini-ITX board, a small quiet low-power motherboard that frequently had a small fan for cooling, but instead of the standard fan it used a heat pipe connected to the side of the case. The case itself acted as a large heatsink and radiator for the CPU. The hard drive was enclosed in a heat-conductive but noise-silencing frame. Overall, a clever design.

It's been a hardy case. I can't say the same for components put inside of it. It runs hotter than I really want – so far it's on its third hard drive, its third motherboard, and its second power supply. Last time I swapped the hard drive when it started getting a bit noisy – not "there are things banging around in the hard drive case" noisy, but "its hum is getting louder". I figured the same thing would work this time, so in my recent upgrade I included a spare hard drive for it. Standard replacement deal – turn the system off, plug the extra hard drive in, toss a SystemRescueCd in the drive, and it refused to detect either hard drive.

Eventually I figured out that my old hard drive was deeply, deeply unhappy. It wouldn't show up in BIOS (and neither would any other drive on that IDE chain) and it wouldn't even initialize – it would just sit there and click. Click. Click. Click. Click-whir. Click. Click. Click. It was spinning up just fine . . . although after enough clicks, it would spin down again. It just wasn't showing up as a hard drive to the computer.

I did a lot of research and tried the standard recovery tricks. Apparently there's a rather infamous hard drive Click of Death, but it's more of a general symptom than a specific cause, and the causes can be anything ranging from "your hard drive is somewhat old" to "your drive head is now bent at a ninety degree angle". So that didn't really help me diagnose it, much less solve it.

The tricks are, to be said, odd, but I tried them anyway. Freezing the drive didn't help – if anything, it made the click noisier. Banging the drive gently didn't help. At this point I had kind of given up, so I tried banging it more emphatically and that didn't help either. That's most of the standard tricks.

So I sat there, with a slowly thawing hard drive sitting on the desk in front of me, and thought.

One of the possible reasons for the Click of Death was that the heads had gotten misaligned, either vertically or horizontally, or in some combination of the two. Another possible reason was that the heads had actually gotten stuck on something. If I could jar the heads loose, or get it started, it might function fine after that. And it had been working just peachy-keen in the computer beforehand – I hadn't even realized it was defective, just old.

So if the heads are just stuck . . . and freezing the drive makes it louder . . . well, brief diversion. If you have a jar that you can't open, there's a trick to getting it open. You run the jar under hot water. The lid expands, and the neck expands, and that also means the gap between the lid and the neck expands. And that makes it easier to open. Now, if I heated up my hard drive, perhaps the same thing would happen. On top of that, the drive had been quite a bit warmer when it was working – it had been encased in that soundproof frame I mentioned before. What if I brought it to near that same temperature before trying?

Obviously I didn't want to melt the drive, or burn it, or get it wet. This is exactly what a double boiler is for, and you can approximate a double-boiler easily using two pots. Thus the picture at the beginning of this entry.

I heated the drive up until it was bordering on "hot to the touch". I figured that was around how hot it was before. I plugged it in, and . . .

. . . well, apparently I've now invented a new way of repairing hard drives. I copied over the most vital stuff, moved it to a different computer quickly (I've never been afraid of a component cooling down before, but I suppose there's a first time to everything) and successfully took a disk image of it. Worked 100% perfectly. I can't find any references to this technique online, so perhaps I really am the first one to try it.

I can't say I recommend this as a standard repair method, and obviously this is no substitute for professional repair services. But if you've tried freezing your hard drive, smacking your hard drive with a hammer, and all the other "normal" tricks . . . maybe it's time to try double-boiling it.

 

On another subject, I will admit that this has little to do with Mandible Games. I've just been kind of busy lately, in entirely uninteresting ways. First off, Mandible Games almost has a logo – I'm just asking for a few minor changes before I finalize it and put it up. Second, I've been doing a lot of work on the interface to D-Net. I want people to be able to change the game's resolution and aspect ratio, and that takes a lot of effort to make the menus work sanely. Third, I got a new computer and almost lost a lot of data – obviously that's a bit of a slowdown as well.

My todo list, however, is getting shorter and shorter. Right now there's only six items left before I actually release a public demo version. The first version is going to be Windows, since that's what I develop on natively, but D-Net builds perfectly fine on both Linux and OSX – all I need to do is figure out how Linux and OSX packaging and installing works.

The first version also isn't going to include online play or single-player play, just to warn you, but it should give a sense of what the game is like, and if you have some friends who want to blow you up in tanks (and, ideally, some USB game controllers), it'll work just fine for that.

I think that's the current State of Mandible. Double-boiling hard drives and writing uninteresting UI code. Yep. That's about the size of it.