19

I am a senior a DevOps engineer who took the production stack down for ~10 minutes today because of a bad code commit. I could use some encouragement! It’s a fierce world of competitive engineers and I wonder why my company doesn’t replace me. The mistake was missed by two other peer reviews... but that doesn’t stop me from feeling this way.

Have you crashed prod? Did your team support you or tear you down?

Comments
  • 8
    My co-worker took down the entire call center by deleting the wrong folder from the old prod servers at a fortune 500 insurance company. 100’s of agents were taken effective.

    Don’t feel bad, this stuff happens we all laughed about and know no one deletes shit from the old prod servers now.
  • 7
    Built a time bomb into a web store that wasn't expect to run for as long or as hard as it has.

    D-day appeared and hundreds of orders started duplicating in external systems.

    What a fun mess that was to deal with over a Christmas I will never forget.

    A quick Fix was released about 3 hours later, and the mess took weeks to clean up.
    12 months later I was able to put an actual fix in to future proof it for years to come

    I still work here.

    Another time, I literally deleted the root directory of a website....
    No git back then, had to restore from a week old backup and reapply some work.

    This was an internal ordering platform, basically took the business out of operation for a couple of hours.

    I've found, you can fuck up hard. Just own that shit and get it dealt with and no one will bat an eye.
  • 6
    So you guys habe a stage environment. If so, figure out, why that didn't caught the bug. If not, then this is the perfect example why you need one!
  • 2
    Needed to shotgun a DB server (shotgun - forced hard reset...)

    Well... The DB was the backbone for stock keeping, sales, support and a whole lot more. Everything stopped.

    Hm. I accidentially rerun a statistic query while (the older) statistic query was running. Due to table locking I built an very effective hard lock.

    The bad thing was that the older query successfully executed.... Which led to the release of the limiters preventing further actions (stocktaking, ...).

    The DB was basically shut down due to masses of queries not being able to run due to locked tables - as the query was finally shut down, it was too late.

    Too many queries, too much ressources, too much shitfest... Wasn't funny to clean up that mess.
  • 0
    @Wack the staging server wasn’t under load during the deployments. Strangely this issue only happened when software was being deployed and a certain endpoint was being hit. I forgot to execute a certain lock function before updating the code.
  • 1
    Thanks everybody! It’s been nice to read the comments this morning.
  • 2
    @devphobe what about using a blue-green deployment strategy. The Base idea is to duplicate prod, one version is called "green" the other "blue". Assuming currently "green" is live, you apply the update to "blue" switch over the router to "blue" and you're done. This Way you prevent such locks plus you can allready warm up an Applikation cahe before exposing it to real requests.
  • 1
    Everyone makes mistakes. No big deal. The world is full of mistakes nobody even cares about.

    For the last few weeks I'm fixing errors in a five year old project that was handed over to me. Some errors existed from the beginning of the project. Three devs already maintained that project before me. And in 4-5 weeks I already fixed tons of errors nobody saw, no one cared about or magically appeared just after I started to work for that project.
  • 1
    @Wack there are a lot of great strategies, Including blue/green. Technically we have a form of “rolling” deploy that sorta fell instead of gently rolled. I was working on the deployment code itself trying to gain some speed back into the existing release process when I introduced my bug.
  • 3
    Mistakes are made all the time by every developer, the only thing that is important is how you react to it
  • 1
    On my previous company i had crashed the production several times, no big deal.

    On my current company just few days ago another team crashed their site for about 2 hours. The company lost around half a million dollars from that and still nobody got fired or anything. Get used to it because it happens and it will happen again. 🙂
  • 0
    @devphobe I don't know where you work but 10min doesn't look very long to me, unless you're dealing with real-time data or vital infrastructure.

    Mistakes happen. Don't feel so bad about it 😉
    The most important thing is to take lessons from them. Not only you, your company as well. There might be some tools or resources that you could get to avoid or mitigate those problems in the future.
  • 1
    @react-guy I have an affinity towards the high-stress environments where outages our sometimes life or death scenarios. Medical and legal tech is a rough world!
  • 0
    Kinda interested in your dev-ops stack ,can you give some insight how you do things there?
Add Comment