6th October
Fsck me

Any window user, from the days of 3.11 on should remember this fun scenario. You come from a night out on the town to find your pc shutdown when you distinctly remember leaving it on when you left. You hit the power, it starts up, and you’re greeted with the beautiful “scandisk” screen as it does a health check on your drive. This was done because, plain and simple, the FAT file system used by (and still used today on some systems) Windows was dumb. File corruption was all too common especially with an inappropriate shutdown or power loss. Thankfully, with the more modern NTFS file system its not quite as needed or as common.

*nix, for all its supposed superiority had the same issues for quite a while. In fact, in Linux, with its generally default file system ext3 you’ll see a similar screen (well, its black instead of blue) come up occasionally. This is fsck, or the file system check, which does pretty much the same thing. It checks the filesystem to make sure its coherent and consistent with what the system is expecting. This can take a short amount of time (journaling in ext3, or ufs (solaris 9+), etc. can greatly decrease the time) especially on larger volumes. In fact, LWN recently had an issue with this when their file server crashed taking down the archives of all the mailing list postings. They didn’t post the size of the archive but stated that an fsck of it took over a week.

I recently had to deal with this at work as well. A customer had a 700Gb array with about 300Gb utilized. Due to a bad startup on the solaris box, the array was seen as “dirty” and so it started to fsck it. And did, for twenty two hours until they killed it. When I finally talked her off the ledge I happened to explain to her that this was quite common and in fact, she probably could have seen a fsck time of a week or more on array that large. There’s other variables as well (file system chosen, file sizes, file numbers, array configuration, etc.) but this seems to be an issue now. If an array of that size goes down there’s little point to fsck it. You’re not really going to recover any errors that might be “lost,” you’re just going to waste a mammoth amount of time.

What would you do, restore a backup taking 5hrs or wait for an fsck to finish which could put your machine out of production for a week?

permalink zero comments