Announcement - We had a problem, we admitted it, we solved it
By Mike (October 7th, 2020)
On the 16th of July 2020, one of our servers failed. This failure meant that a number of our customers were unable to access their data. Their field service staff were able to carry on working, but the office staff were not able to do anything.
We became aware that there was a problem at 7:35 am. The first customer report was at 7:56. We confirmed that the systems were not accessible. We tried a remote reboot (because switching it off and then on again does actually fix most IT issues). This did not solve the problem so we switched the effected customers onto their secondary servers. By 8:05 am our customers were back up and running without any data loss.
By monitoring our servers we were aware of the problem before our customers reported it. By admitting we had a problem we could focus on resolving it. By taking ownership of the problem we minimised the stress and disruption to our customers.
What actually went wrong? It turned out that the raid controller on that server had failed. This meant that the disks with the customers' databases were no longer accessible. Fortunately, we had designed the cluster of servers to cope with this. We have mirrored disks within each server and full replication to a secondary server in a different location. This means we will not lose any data in the event of a failure. Our customers can not afford to go back to last nights or last weeks copy of their data, can you?
If your current system provider does not admit when they have an issue, doesn't own the problem or worse loses your data then contact us about providing you with an alternative service management solution.
Follow Us
Follow @epixsystemsltd