In early spring this year (2024) there were a number of reports in the media about various retailers and food outlets having to shut down temporarily, while website and software issues were resolved. Sainsbury’s, for instance, experienced issues which led to the cancellation of online grocery deliveries and card payments, due to an overnight software update. During a similar period, Tesco resolved a technical issue that had affected a small proportion of its deliveries.
During March, some McDonald’s outlets in markets including Australia, New Zealand, Canada, Japan and Germany, as well as the UK, experienced an IT issue which left customers without their food. Greggs stores across the UK were hit by IT problems, which prevented it accepting card payments. And, on top of that, Facebook went down for a few hours, which left many people with time on their hands.
Here at Lake Solutions, we’ve written before about having a recovery plan in place.
Post-pandemic, many businesses recognised the benefits of having a disaster recovery plan in place. But, if is anything positive to be taken from Covid-19, it is the way in which the majority of companies have weathered the storm and been more resilient than they thought they could be.
Typically, a crisis is hard to plan for – we saw that with the pandemic – but a crisis can come in many guises. One tends to think of it being something like a fire, a flood or perhaps terrorism. But, an organisation should also consider the crisis implications of issues such as shareholder action, a product or service failure or a financial crisis.
In addition to keeping staff and customers safe, also central to any crisis plan should be the need to safeguard the reputation of an organisation. The mantra behind crisis PR is ‘tell it all, tell it fast and tell it true’. And it’s important at the time of a crisis to have your social media channels monitored at all times, with the ability to upload appropriate messaging and respond to questions from customers. We are sure the big chains turned to their crisis PR plans when their IT failed.
Today, many businesses rely on their website as a channel to deliver sales, so it is important it is properly protected – not only from cyber attacks and security issues – but from things such as power failures affecting servers or new software updates going wrong (in the case of Sainsbury’s recently).
As is very often the case, it’s sensible not to keep all your eggs in one basket – but to share them out among a few baskets. If, as a business, you are looking to deploy change, such as a big software update, then it’s sensible to make sure you can prove that change works before deploying it out across all your outlets. Maybe have a phased roll-out of a new system or updated software. If it doesn’t work, what’s your roll-back plan? Are you able to roll it back?
One imagines that a large supermarket wouldn’t roll-out new software without testing it first in a selection of stores, perhaps smaller ones and then, by segmenting the stores geographically perhaps or by size, roll it out as the initial tests work well. If this phased approach doesn’t work, it’s easier to roll it back or rescue the situation.
In a previous blog, we talked about the complexities of releasing new software code.
Traditionally, releasing new software code involved running two content management systems, with one being down while the other was upgraded, followed by the other. The deployment process was staggered and it could be complex. Then, one would always run the risk of pushing out this content only to find it broke the new environment.
Today, developers – particularly those working on consumer-facing applications or applications with critical uptime requirements - are much more likely to opt for blue-green deployment, as it is recognised as reducing downtime and risk.
It works by running two identical hardware environments, referred to as blue and green, which are configured in exactly the same way. This means that there is never an environment which is half upgraded and it is seen as offering a cleaner and more reliable way of working. If there are any issues during the process, it is easier to roll back.
Sometimes bad things do happen – we all know that – such as a fire at a data centre, a loss of power or a hardware fail. In which case, it’s worth having systems located across multiple data locations. It gives peace of mind to know that your website is hosted in a second location. The majority of data centres have back-up power supplies, such as generators. In terms of the cables going into the building, you’ll often find that key power supplies are physically routed out of different sides of it. This means if one cable is cut, there are others that remain intact.
Establishing a disaster recovery (DR) programme for your website does cost money and it’s hard to see the benefits of spending that money if nothing bad ever happens – rather like paying out for any type of insurance.
If you have a disaster recovery programme in place and you feel like it’s sitting there doing nothing – balance that with the loss of revenue and reputation your business will experience, even if your website is down for a few hours. Also, don’t assume your DR site will work – test it regularly; every six months or so, switch over to the DR site and switch back again.
Sometimes, though, websites don’t break because you’ve actively done something – like integrating new software code – something might ‘just happen’. In an earlier blog, we talked about testing your website – just keeping an eye on it before something happens.
Traditionally, making changes to a website could cause ramifications elsewhere. However, at Lake Solutions we utilise the latest development technologies and practices to minimise collateral impact. That said, there are arguments to carry about both manual testing and automated testing on your website.
Manual testing has its advantages, as an actual person takes the place of a visitor to that website, making the journey as a typical customer might, with all the randomness of a human... Manual testing is useful for checking a new change and making sure everything works properly. Each manual test is arguably unique – as no human will act quite the same as another. However, humans can generally only do one thing at a time, while machines are better at multitasking.
If you are looking at doing regression testing, checking a website on an ongoing basis to make sure it’s working OK and you are not necessarily checking a current change, then this takes more and more time each time you do it, so it might be sensible to automate this part of the process.
Automated tests are useful if you want to compare data from one test to the next, as you’ll know that the test will have been the same without any human error. Generally, automated testing is also quicker and you can do it at any time – overnight if that makes sense. It is also good at finding glitches in the early stages of development, which could cost money if not dealt with when they appear.
If you’d like to talk further about disaster recovery and your website, call us at Lake Solutions on: 020 3397 3222.