Then something came up, and due to dependencies in the FT code, it was pushed back 'til 10pm. Then, during testing, we discovered a critical error. Two, really, due to the same problem in the FT code. So we rolled back the FT code and cut a new release. Then we realized we didn't need to wait until 10pm after all, so we rescheduled for 4pm. The release to prod ran about 4:30, and at 5pm we discovered it was seriously borked. Production was not accessible.
Cue a certain amount of panicking, as people tried to figure out the error. Turns out that one piece of the FT code that we rolled back? Didn't get rolled back. Now nobody can log in. Okay, no problem, roll back the release. When this fails to roll back gracefully, cue more serious panic.
We start to delve into code and db scripts, and discuss options. Environments were inconsistent, which is why we didn't discover this earlier. All the higher-ups in the Operations and QA departments appear out of the walls. DBAs are fetched. I drag a whiteboard over and enumerate our options. We discuss risks, timeframes, and rewards. Options 2 and 3 look best, and after some agonized examination of scripts and some test runs, we agree on Option 2.
Option 2 is implemented in around 30 minutes worth of time, and it goes surprisingly smoothly. Everything is up and running by 7pm, we all wipe our foreheads in relief, and head home....