I'm really proud of the Lijit team today. As part of last round of financing we purchased an entirely new production environment. This environment, we internally named 15c will allow us to scale into the future as well as provide new stealth capabilities we have yet to announce. For the last week we have been putting finishing touches on it and testing the system. Because it's a new "production" environment, we get the luxury of testing it in place rather than the usual, push the code out, do the final production testing (in the middle of the night) and "go live", verify nothing is funky.
So, this weekend we decided to "wind the frog" as it were. We made a copy of the old DTC production database and moved it over to 15c on Saturday. We then, started synchronizing the new production to old production environments over night. That set the ground work for Sunday where the team met at 9:00am to do final changes and redirect DNS over to 15c, and light it up. We were careful to leave a back-out path just in case something bad happened. We never believe anything bad will happen but it pays to leave the breadcrumbs so you get back to happy land.
Well, this was one of those days. As soon as the team pointed production traffic at 15c… Things went bad, site slowed down, database took a dig, hmmm. We never saw any of this in the last week of testing. It's of course totally illogical as our new environment is way more capable then the old one that works, although a little stressed. Quickly the team pointed traffic back to the DTC (old production environment) and regrouped. A few more things were tried, and finally around 6pm the team called the ball and we went back to the old production environment. We captured the logs we briefly generated during the go live attempts and decided to regroup tomorrow and go heads down on analysis of what is going on.
Obviously, it would be cooler if everything went well, but I have learned it's much better to have a seasoned team that understands things never go exactly right. They blew a big chunk of their weekend, a large portion executing multiple steps that simply gave them an ability to recover from something bad. But as a result, we live to fight another day and our customers only saw about 5 minutes of funky behavior. No one panicked, everything was methodical, it just didn't happen today.
Good work guys, you'll get it next time!