Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Search - "post-mortem"
-
Big event. Massive traffic in production, so we were monitoring all night.
I was in a room with 2 devs of my team, a marketting girl, my boss and a designer... chilling.
Suddenly the production is down.
Boss: production is down, anyone can check?
Me: already on it
Dev1: it looks ok for me
Dev2: me too
Me: wait what? Impossible everything is down
Dev1: oh I refreshed the page it's not working
Me: don't stay on the page refreshing it like you are fucking monkeys. Give me useful intel or be quiet.
Market girl: is it working?
...
Guys is it working?
...
Hello?
Me: Not yet we are looking. Don't distract me.
Boss: client called us. They want it online now.
Dev1&2: he's looking
... 1 min later...
Boss: is it working?
Boss: is it working?
Boss: is it working?
Me: SHUT THE FUCK FOR FUCKING ONE SECOND. ALL OF YOU, OUT NOW. YOU ARE FUCKING MONKEYS WHO CAN'T DO SHIT. IF YOU CAN'T HELP JUST SHUT YOUR DAMN SHITHOLE. DEVS, LOOK WITH ME. MARKET GIRL PREPARE A FUCKING POST-MORTEM MAIL. BOSS GET THE CLIENT ON THE PHONE AND STALE. DO. YOUR. FUCKING. JOBS.
That's how I ended up screaming at everyone... the rest of the night went in complete silence and I fixed the issue 2min after the got quiet or busy.24 -
--- GitHub 24-hour outage post mortem ---
As many of you will remember; Github fell over earlier this month and cracked its head on the counter top on the way down. For more or less a full 24 hours the repo-wrangling behemoth had inconsistent data being presented to users, slow response times and failing requests during common user actions such as reporting issues and questioning your career choice in code reviews.
It's been revealed in a post-mortem of the incident (link at the end of the article) that DB replication was the root cause of the chaos after a failing 100G network link was being replaced during routine maintenance. I don't pretend to be a rockstar-ninja-wizard DBA but after speaking with colleagues who went a shade whiter when the term "replication" was used - It's hard to predict where a design decision will bite back and leave you untanging the web of lies and misinformation reported by the databases for weeks if not months after everything's gone a tad sideways.
When the link was yanked out of the east coast DC undergoing maintenance - Github's "Orchestrator" software did exactly what it was meant to do; It hit the "ohshi" button and failed over to another DC that wasn't reporting any issues. The hitch in the master plan was that when connectivity came back up at the east coast DC, Orchestrator was unable to (un)fail-over back to the east coast DC due to each cluster containing data the other didn't have.
At this point it's reasonable to assume that pants were turning funny colours - Monitoring systems across the board started squealing, firing off messages to engineers demanding they rouse from the land of nod and snap back to reality, that was a bit more "on-fire" than usual. A quick call to Orchestrator's API returned a result set that only contained database servers from the west coast - none of the east coast servers had responded.
Come 11pm UTC (about 10 minutes after the initial pant re-colouring) engineers realised they were well and truly backed into a corner, the site was flipped into "Yellow" status and internal mechanisms for deployments were locked out. 5 minutes later an Incident Co-ordinator was dragged from their lair by the status change and almost immediately flipped the site into "Red" status, a move i can only hope was accompanied by all the lights going red and klaxons sounding.
Even more engineers were roused from their slumber to help with the recovery effort, By this point hair was turning grey in real time - The fail-over DB cluster had been processing user data for nearly 40 minutes, every second that passed made the inevitable untangling process exponentially more difficult. Not long after this Github made the call to pause webhooks and Github Pages builds in an attempt to prevent further data loss, causing disruption to those of us using Github as a way of kicking off our deployment processes (myself included, I had to SSH in and run a git pull myself like some kind of savage).
Glossing over several more "And then things were still broken" sections of the post mortem; Clever engineers with their heads screwed on the right way successfully executed what i can only imagine was a large, complex and risky plan to untangle the mess and restore functionality. Github was picked up off the kitchen floor and promptly placed in a comfy chair with a sweet tea to recover. The enormous backlog of webhooks and Pages builds was caught up with and everything was more or less back to normal.
It goes to show that even the best laid plan rarely survives first contact with the enemy, In this case a failing 100G network link somewhere inside an east coast data center.
Link to the post mortem: https://blog.github.com/2018-10-30-...6 -
I have just concluded a post-mortem on one of my servers.
Cause of death: out of memory due to a tiny memory leak in a VPN service triggered by 66 different IPs brute-forcing the creds at the same time. Mostly from China, of course.
Dear bot writers: you made me put aside my spaghetti and write iptables rules. I hate iptables. And I love spaghetti. You should be ashamed of yourself! Did momma not teach you basic OpSec? Don't crash the target and never, ever, interrupt the sysadmin during dinner!6 -
I was on vacation when my employer’s new fiscal year started. My manager let me take vacation because it’s not like anything critical was going to happen. Well, joke was on us because we didn’t foresee the stupidity of others…
I had to update a few product codes in the website’s web config and deploy those changes. I was only going to be logged in for 30 minutes to complete that.
I get messaged by one of our database admins. He was doing testing and was unable to complete a payment on the website. That was strange. There was a change pushed by our offsite dev agency, but that was all frontend changes (just updating text) and wouldn’t affect payments.
We don’t want to enlist the dev agency for debugging work, especially when it’s not likely that it’s a code issue. But I was on vacation and I couldn’t stay online past the time I had budgeted for. So my employer enlists the dev agency for help. It’s going to be costly because the agency is in Lithuania, it was past their business hours, and it was emergency support.
Dev agency looks at error logs. There are Apple Pay errors, but that doesn’t explain why non Apple Pay transactions aren’t going through. They roll back my deployment and theirs, but no change. They tell my employer to contact our payment processor.
My manager and the Product Manager contact Payroll, who is the stakeholder for our payment gateways. Payroll contacts our payment gateway and finds out a service called Decision Manager was recently configured for our account. Decision Manager was declining all payments. Payroll was not the person who had Decision Manager installed and our account using this service was news to her.
Payroll works with our payment processor to get payments working again. The damage is pretty severe. Online payments were down for at least 12 hours. Our call center had logged reports from customers the night before.
At our post mortem, we had to find out who ok’d Decision Manager without telling anyone. Luckily, it was quick work. The first stakeholder up was for the Fundraising Dept. She said it wasn’t her or anyone on her team. Our VP of Analytics broke it to her that our payment processor gave us the name of the person who ok’d Decision Manager and it was someone on the Fundraising team. Fundraising then starts backtracking and says that oh yes she knew about it but transactions were still working after the Decision Manager had been configured. WTAF.
Everyone is dumbfounded by this. How could you make a big change to our payment processor and not tell anyone? How did our payment processor allow you to make this change when you’re not the account admin (you’re just a user)?
Our company head had to give an awkward speech about communication and how it’s important. The web team can’t figure out issues if you don’t tell us what you did. The company head was pissed because it was a shitty way to start off the new fiscal year. Our bill for the dev agency must have been over $1000 for debugging work that wasn’t helpful.
Amazingly, no one was fired.4 -
I've caught the efficiency bug.
I recently started a minimum wage job to get my life back in order after a failed 2 year project (post mortem: next time bring more cash for a longer runway)
I've noticed this thing I do at every job, where I see inefficiency and I think "how can I use technology to automate myself out of this job?"
My first ever application was in C++ for college (a BASIC interpreter) and it's been so long I've since forgotten the language.
But after a while every language starts to look like every other language, and you start to wonder if maybe the reason you never seriously went anywhere as a programmer was because you never really were cut out for it.
Code monkey, sure. Programmer? Dunno, maybe I just suffer from imposter syndrome.
So a few years back I worked at a retail chain. Nothing as big as walmart, but they have well over 10k store locations. They had two IBM handscanners per store, old grungy ugly things, and one of these machines would inevitably be broken, lost or in need of upgrade/replacement about once a year, per location. District manager, who I hit it off with, and made a point of building report with, told me they were paying something like $1500 a piece.
After a programming dry spell, I picked up 'coding' with MIT app inventor. Built a 'mostly complete' inventory management app over the course of a month, and waited for the right time.
The day of a big store audit, (and the day before a multi-regional meeting), I made sure I was in-store at the same time as my district manager, so he could 'stumble upon' me working, scanning in and pricing items into the app.
Naturally he asked about it, and I had the numbers, the print outs, and the app itself to show him. He seemed impressed by what amounted to a code monkeys 'non-code' solution for a problem they had.
Long story short, he does what I expected, runs it by the other regionals and middle executives at the meeting, and six months later they had invested in a full blown in house app, cutting IBM out of the mix I presume.
From what I understand they now use the app throughout the entire store chain.
So if you work at IBM, sorry, that contract you lost for handscanners at 10k+ stores? Yeah that was my fault (and MIT app inventor).
They say software is 'eating the world' but it really goes to show, for a lot of 'almost coders' and 'code monkeys' half our problem is dealing with setup and platform boilerplate. I think in the future that a lot of jobs are either going to be created or destroyed thanks to better 'low code' solutions, and it seems to be a big potential future market.
In the mean while I've realized, while working on side projects, that maybe I can do this after all, and taken up Kotlin. I want to do a couple of apps for efficiency and store tracking at my current employer to see if I'm capable and not just an mit app-inventor codemonkey after all.
I'm hoping, by demonstrating what I can do, I can use that as a springboard into an internal programming position at my current gig (which seems to be a company thats moving towards a more tech oriented approach to efficiency and management). Also watching money walk out the door due to inefficiency kinda pisses me off, and the thought of fixing those issues sounds really interesting. At the end of the day I just like learning new technologies, and maybe this is all just an excuse to pick up something new after spending so long on less serious work.
I still have a ways to go, but the prospect of working on B2B, and being able to offer technological solutions to common and recurring business needs excites the hell out of me..as cringy and over-repeated as that may sound.5 -
Read the following in Morgan Freeman’s voice.
Okay everyone sit on down and get ready for story time. There once was a workspace that was a pain in the ass to setup. It often would take an entire day even for the most experienced devs on the team...for it was a workspace perched atop a swamp of shit that would require a whole year to refactor into something that isn’t shit.
It was inherited, passed down, stepped in and scrapped from the boot soles of every programmer that ever touched it. It was an amalgam of old, new, and third party components with a class path a mile long and no package management because the company although physically in the present, somehow maintained a temporal presence in the past. And there was nothing that the team hated more than setting that workspace. In short it was an unholy mess that made Satan cry and Dennis Ritchie spin in his grave so much that the state of California attached magnets and a coil to his body and casket to generate electricity.
Then one day the untalented clowns known as App Group decided that our IDE should be owned and configured strictly through them. They took poor Eclipse and mounted so much silly shit to it that it resembled a riding lawn mower with a fax machine and a blender duct taped to it. Eventually as everything the company touched did, it simply turned into a broken, shitty mess that not even Jesus Titty Fucking Christ could bring back the dead.
And then, every month or so the IDE would break in such a grand way that every developer had to rebuild their workspace...the very same Lovecraftian monster disguised as a code base. It was just too much to bear for old Deus. He was all out of fucks and there wasn’t enough alcohol in the world to quiet his injured soul. So he stood on a chair, carved his name in a rafter and tied a noose to it, put it around his neck and finally kicked the chair out from under himself. I am told he even pooped his pants and the post mortem shit in the seat of his pants was still better than the codebase at work. I’m Morgan Freeman. -
Post Mortem analysis:
It was alright when I typed "SESSION_"
It also was OK when I typed "1"
But I was definitely already asleep while typing the remaining "2" and "3" characters
the line between "sleepy" and "asleep" is soooooooo thin!1 -
Some people may remember me posting about our rabbits a while ago.
Sadly they both passed away a few weeks back.
They were both a little over a year old and died within 4 days of each other, one on Wednesday morning and then one on the Saturday morning.
After about 1200 in emergency vets fees and another 500 for a post mortem on Spencer (who died on Saturday) we found out the vets had fucked up their vaccinations.
In the U.K. it is recommended that rabbits are vaccinated against 3 viruses Myxomatosis (Myxo), Rabbit Haemorrhagic Disease Variant 1 (RHD1), and Rabbit Haemorrhagic Disease Variant 2 (RHD2).
They got their vaccinations for Myxo and RHD1 in January, and went back two weeks later for their RHD2 vaccination.
Now it turns out, when they went back for the second vaccination they were incorrectly given the Myxo and RHD1 vaccinations again.
The lab results showed Spencer had RHD, but not which variant and it is safe to assume Frank also has RHD.
The vets were going to get another lab to test for the variant but decided not too (funny that they don’t want to confirm whether it was their fuck up that killed two otherwise perfectly healthy rabbits).
The wife and I are considering getting legal advise.
What fucks me off so much is that it wasn’t a situation where there were two possible courses of treatment, or they didn’t respond to treatment, it was just a human fucking up.
The practice manager also like to keep mentioning that vaccines aren’t 100% effective, and because they won’t test for the variant of RHD we will never know 100% whether their fuck up killed our rabbits.
I’m contemplating trying to get in touch with the lab and paying for the extra tests myself.
Due to the nature of the virus it also means we can’t get anymore rabbits for 3-4 months.3 -
I've seen far worse people doing what I'm doing, applying terrible practices and still being valued af.
Even if I do smth wrong after doing all the research and alternatives' analysis I know I'll do a proper post-mortem RCA, document it and learn from it, as a result I'll make a better choice to fix the problem and I'll know better next time.
I think I'm alright compared to them. So I don't wory about being an impostor.
Learning by good examples is a good approach. Learning by bad ones might be even better. The "good ones" are yet to fail and be replaced by better (or worse) ones. The "bad ones" are already failing and you can learn WHY you should not be doing it like that and HOW should you do instead to solve the problem.
Learning from good examples only works well if you know the back-story, all the WHYs and HOWs. People usually don't :) -
Hey guys.
I just released a demo for this game-sort-of-thing I'm working on. Would love to know your opinions/criticism.
https://cellv.itch.io/post-mortem
Playable on any device.21 -
The JavaScript community should invent the word "outd" to use instead of "outdated" just to make sure they give forewarning about frameworks and tools going out of fashion, because by the time they spell the full word it's already too late and a post-mortem announcement.
-
AHH!!! PM talk is melting my brain...nodes are...collapsing...
"We need to post-mortem our lessons learned and level set our expectations so we can define quick resolutions and set tollgate approvals, at a very high level."
# clear my head of beastly things
def cls():
print ('\n' * 666)
cls()1 -
Retrospective does not seem to work in practice as it does in theory.
Complain about what went wrong and what went right. Then at the next one, those issues still exist.
I might as well just have written in a diary to air my frustrations -
Post Mortem: Yarn Workspaces
So yarn workspaces are fucken sick, i love them. As long as you don't want to deploy your shit, because your api most certainly needs node_modules, your vue frontend doesn't... -
@Flux
I read https://devrant.com/rants/1845851/...
I was going to comment until I realized it was a post from 300 days ago.
Just want to say ask you how it went or give a post mortem. Also Congratulations. I hope you brought a long run way when you started almost a year ago.
Remember starting is 80% of finishing, and carrying through is the other 20%.
Success looks easy when all you see are the success stories. Steve jobs wasn't the most famous marketer, he was the most famous salesman, and what he sold was a dream.
Marketing is a buzzword, and a lot of companies try to use marketing as a replacement for sales people, but nothing really beats a good sales person or team. Thats the secret to marketing: forgo it as much as possible and work on sales and relationships instead. Awareness is nice, but money and sales are better.
This coming from a guy who had six businesses by the time he was twenty six and helped his family to start two other successful businesses.
Apparently I'm good at helping other people make money just not good at helping myself do it.
What I've learned is if you can get 1 customer you can get 10 and if you can get 10 you can get 100. And then keep going.
Good luck.3 -
Do you ever feel your job is too demanding compared to other software engineering jobs?
I've worked in two companies for now.
First company, Kotlin microservices and we had QAs, didn't have to write a lot of tech specs and no post mortem or on call at all (not yet atleast), it was just talk to PO, he tells the business requirement, we work together to make tickets, no legacy code so was easy to know what to do for tech, no monolith to handle or anything, much easier, just code and meetings.
Current job is meetings with PO telling you what he wants, have to write a full on tech spec and also know business requirements and product knowledge as the current PO doesn't know anything about how the products work, writing huge tech specs, communicating on requests sent my clients on slack, pretty much always firefighting, the system is so fragile and legacy, coding is actually less its mostly spending hours finding out how this shittt legacy flows work (no docs) , PO pretty much does fuck all, just wants meetings and wants us to do very very stupid tedious low impacts projects. This bundled with oncall and onpoint and the absolute sheer amount of incidents our team is involved in (on average we have 4 a week LOL, varying size but they're all very annoying) and the overtime oncall benefit is so bad too, if you do get paged out of hours, you just get that hour back during work hours. In other companies like friends, you get paid for the whole time you're oncall, whether you get paged or not. I can't go out anywhere on weekends or anywhere at all during on call in case I get paged, which happens a lot. Its a cluster of a mess. This bundled with manager stoll not wanting to promote me to IC3 despite all I've done so far.
My question is, is this more normal than I think it is? Is this just how crap our career can be? Mind you I'm in the UK so not getting those mind boggling US wages sadly either. Have US colleagues in same team doing same job but obviously getting more11