Well, yesterday was an eventful day to say the least. At approx. 9am GMT, a cPanel bug in one of our servers corrupted the DNS zone files. As our DNS works in a cluster, it wasn’t too long before the corrupted zone files were synchronised by each server causing all zone files on all servers to be corrupted. The zone files contain information regarding your DNS settings.
By chance, Chris was able to catch this problem early on. This was by sheer luck that a customer was using dedicated IP’s for his nameserver (thus not on the cluster anymore) on the same server which had the bug first. While he managed to fix a few hundred files, the other servers had already synchronised and so the problem was spread out across most of our systems.
Chris fought with his eyes to keep them open, but eventually he did need his sleep. He delegated the tasks to fix the problem to our 24×7 staff, and the lucky folks on shift were charged with the duties of going through thousands of zone files, checking the integrity and fixing those that need to be fixed.
Whenever we thought we had just about finished, more corrupted files were found. Not to mention the servers still kept synchronising making the job much harder..
Chris woke up not too long later and went straight back to work on fixing those zone files. He, along with the rest of the team did eventually manage to fix all zone files by isolating a server from the cluster and working on that alone. The rebuilt zone files were then copied over to all other servers, and the isolated server was brought back into the cluster.
This was the first major outage we had experienced due to a problem with our systems. Although we had no plan of action, the experience we have gained over the years with dealing with minor problems had prepared us for such an event. Chris delegated tasks to other staff members, and kept charge of the situation, while I handled things on the front and continued to send out updates by email as I received them.
Despite the extreme work loads staff were under, we still managed to answer all support tickets in under an hour.
For the most part, all customers were very understandable as they know downtime is really not something which has ever been a concern. We have maintained a steady 99.9% uptime throughout the years we have been around. However, we did receive the odd clients which decided they knew what they are talking about and demanded “answers” and an ETA. I did originally give an ETA based on information I had received, however after it be obvious that the issue is more complex than first anticipated, we stopped giving any ETA whatsoever…
One person in particular moaned about the lack of ETA at first and requested a “ball park” figure – so we gave him one. He then asked for another ETA, but this time, we told him we have no ETA. He for some reason decided that we did have an ETA and so demanded it. He failed to comprehend “I have no ETA” really does mean that we have no ETA. After which, he complained why we gave an ETA out in the first place.
But despite that, customers did adhere to our requests. Our forums served their purpose and allowed customers to communicate amongst themselves about the problem. Some pointing out our faults, some praising us – of which we appreciated both
All that said, we did learn something from this experience, and we have already implemented measures to make sure this doesn’t happen again. All systems are fully functional, and things are all back to normal