Server Failure
Over the weekend I had one of my active servers go into a strange state and take all its virtual machines down with it. Well no big deal thats why i use Xen and DRBD. If the active machine dies all data is current up till the failure and all virtual machines can start on the backup. The problem though is I don’t have any type of fencing. Fencing isolates a server so that it never tries to access any type of shared disk or claim virtual IP’s that it should not use server bad data etc. This is needed when a failure is not clean. Well this failure was far from clean. ssh was dead to everything (other than the backup) mail would make connections but not show a banner, web was dead but pingable, and database was up just fine.
Well this means I would not safely bring up the secondary as active or risk corrupting data and all the other badness related above. In this situation I chose to leave services down until physical access was available to safely fence the server.
I think now I might do a RFQ for ether a new switch that I can down ports or a power system with switchable power so i can STONITH.
I think I made the right choice. Most my customers are dairy based and they had proofs this last week which is a big time of the year, it’s like earning reports for wall street, big deal. Well those proofs were all tied up in a dead server.
Everything is backup now, you can read the ‘official’ notice at www.mlds-networks.com

