Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Search - "non-production infrastructure"
-
I think I've shown in my past rants and comments that I'm pretty experienced. Looking back though, I was really fucking stupid. Since I haven't posted a rant yet on the weekly topics, I figure I would share this humbling little gem.
Way back in the ancient era known as 2009, I was working my first desk job as a "web designer". Apparently the owner of this company didn't know the difference between "designer", which I'm not, and "developer", which I am, nor the responsibilities of each role.
It was a shitty job paying $12/hour. It was such a nightmare to work at. I guess the silver lining is that this company now no longer exists as it was because of my mistake, but it was definitely a learning experience I hold in high regard even today. Okay, enough filler...
I was told to wipe the Dev server in order to start fresh and set up an entirely new distro of Linux. I was to swap out the drives with whatever was available from the non-production machines, set up the RAID 5 array and route it through the router and firewall, as we needed to bring this Dev server online to allow clients to monitor the work. I had no idea what any of this meant, but I was expected to learn it that day because the next day I would be commencing with the task.
Astonishingly, I managed to set up the server and everything worked great! I got a pat on the back and the boss offered me a 4 day weekend with pay to get some R&R. I decided to take the time to go camping. I let him know I would be out of town and possibly unreachable because of cell service, to which he said no problem.
Tuesday afternoon I walked into work and noticed two of the field techs messing with the Dev server I built. One was holding a drive while the other was holding a clipboard. I was immediately called into the boss's office.
He told me the drives on the production server failed during the weekend, resulting in the loss of the data. He then asked me where I got the drives from for the Dev server upgrade. I told him that they came from one of the inactive systems on the shelf. What he told me next through the deafening screams rendered me speechless.
I had gutted the drives from our backup server that was just set up the week prior. Every Friday at midnight, it would turn on through a remote power switch on a schedule, then the system would boot and proceed to copy over the production server's files into an archive for that night and shutdown when it completed. Well, that last Friday night/Saturday morning, the machine kicked on, but guess what didn't happen? The files weren't copied. Not only were they not copied, but the existing files that got backed up previously we're gone. Why? Because I wiped those drives when I put them into the Dev server.
I would up quitting because the conversation was very hostile and I couldn't deal with it. The next week, I was served with a suit for damages to this company. Long story short, the employer was found in the wrong from emails I saved of him giving me the task and not once stating that machine was excluded in the inactive machines I could salvage drives from. The company sued me because they were being sued by a client, whose entire company presence was hosted by us and we lost the data. In total just shy of 1TB of data was lost, all because of my mistake. The company filed for bankruptcy as a result of the lawsuit against them and someone bought the company name and location, putting my boss and its employees out of a job.
If there's one lesson I have learned that I take with the utmost respect to even this day, it's this: Know your infrastructure front to back before you change it, especially when it comes to data.8 -
C'mon people! Spread the word! "The cloud" is not "just someone elses computer", it's a completely different way to compute!
I'm so tired of the oversimplifications done trying to explain the consept. The massive amount of work, sweat and tears put into the orchestration, automation and abstraction layers to deliver truly elastic, scalable and self healing infrastructure, applications and services deserves a fuckload more respect than "just someone elses computer"!
Hosting and time-sharing have been with us almost as long as we have had computers (mainframes etc), but dismissing the effort of thousands upon thousands of devs and ops people to make systems robust and automated enough to literally being able to throw a wrench in the engine any time during production and not have the systems suffer is fucking insane!
The whole reason the term "cloud" is so fitting is not just because it was coined from the cloud-shape used in technical and non-technical drawings and illustrations symbolising the internet, but also because of the illusion of magic it gives the end-user not being able to see "whats inside the music box".19 -
So I work as a "Web Development Lead"
Which means I lead (frontend,backend and infrastructure teams)
Also I am in charge of infrastructure or devops or whatever you call it, which means I handle production issues, dev and staging environments,...etc
and I am a team of one, and today I asked for a day off because it's my wife's birthday
and suddenly everyone is blocked, everything is on fire, and the phone is not stopping ringing, I had to go out of the cinema theater to answer the non-stopping calls
I AM ASKING FOR A SINGLE DAY, A FUCKING DAY, EVEN IF SOMEONE IS BLOCKED SO WHAT IT'S NOT EVEN A DAY I ONLY NEED 6 HOURS
IS TO TOO MUCH FUCKING TO ASK4 -
One night, after one very stressful week of production code fixes (I was working on a game with some friends and I created the network infrastructure for P2P and database communication from scratch), I was at my gf's house. After we fell asleep, I stood up and screamed right at her something like "I fucking already told you how X works and how to communicate with Y. Learn to write code properly and after double checking yours then you shall ask me for a non-existing 'bug' fix. Learn how to properly write event based code and use polling you moron!". After that I turned to the other side and fell asleep immediately.
When I saw her the next morning sleeping in the couch, I could not understand why... Only after she described to me the whole incident I started laughing.
After that I just took two weeks off the project and after that period I never actually worked the same way (so hard) in my free time with them.1 -
developer makes a "missed-a-semicolon"-kind of mistake that brings your non-production infrastructure down.
manager goes crazy. rallies the whole team into a meeting to find "whom to hold accountable for this stupid mistake" ( read : whom should I blame? ).
spend 1-hour to investigate the problem. send out another developer to fix the problem.
... continue digging ...
( with every step in the software development lifecycle handbook; the only step missing was to pull the handbook itself out )
finds that the developer followed the development process well ( no hoops jumped ).
the error was missed during the code review because the reviewer didn't actually "review" the code, but reported that they had "reviewed and merged" the code
get asked why we're all spending time trying to fix a problem that occurred in a non-production environment. apparently, now it is about figuring out the root cause so that it doesn't happen in production.
we're ALL now staring at the SAME pull request. now the manager is suddenly more mad because the developer used brackets to indicate the pseudo-path where the change occurred.
"WHY WOULD YOU WASTE 30-SECONDS PUTTING ALL THOSE BRACES? YOU'RE ALREADY ON A BRANCH!"
PS : the reason I didn't quote any of the manager's words until the end was because they were screaming all along, so, I'd have to type in ALL CAPS-case. I'm a CAPS-case-hater by-default ( except for the singular use of "I" ( eye; indicating myself ) )
WTF? I mean, walk your temper off first ( I don't mean literally, right now; for now, consider it a figure of speech. I wish I could ask you to do it literally; but no, I'm not that much of a sadist just yet ). Then come back and decide what you actually want to be pissed about. Then think more; about whether you want to kill everyone else's productivity by rallying the entire team ( OK, I'm exaggerating, it's a small team of 4 people; excluding the manager ) to look at an issue that happened in a non-production environment.
At the end of the week, you're still going to come back and say we're behind schedule because we didn't get any work done.
Well, here's 4 hours of our time consumed away by you.
This manager also has a habit of saying, "getting on X's case". Even if it is a discussion ( and not a debate ). What is that supposed to mean? Did X commit such a grave crime that they need to be condemned to hell?
I miss my old organization where there was a strict no-blame policy. Their strategy was, "OK, we have an issue, let's fix it and move on."
I've gotten involved ( not caused it ) in even bigger issues ( like an almost-data-breach ) and nobody ever pointed a finger at another person.
Even though we all knew who caused the issue. Some even went beyond and defended the person. Like, "Them. No, that's not possible. They won't do such dumb mistakes. They're very thorough with their work."
No one even talked about the person behind their back either ( at least I wasn't involved in any such conversation ). Even later, after the whole issue had settled down. I don't think people brought it up later either ( though it was kind of a hush-hush need-to-know event )
Now I realize the other unsaid-advantage of the no-blame policy. You don't lose 4 hours of your so-called "quarantine productivity". We're already short on productivity. Please don't add anymore. 🙏11 -
I recently tried to apply the same data analytics rationale that I use at work to my personal life. This is not a rant, it is more like an data storytelling of an actual use case I would like some input on.
I set a goal - gotta thin up a bit and calm down my ticker - and got a (almost unreasonably expensive) field expert consultant to yell at me about it for a couple hours.
I unravel the metrics - there is like a million weight-related KPIs and most say nothing at all. I have never seen an non-infrastructure measurable subject that could not be resumed to 2-5 performance metrics. I got overall weight, how well my nine-years-old business suit fits me, heart rate, and day-after relative muscle pain (it will make sense soon).
Then its data-pipeline time. I bought a cheap weight scale and smartwatch, and every morning I input the data in an app. Yes, I try to put on the suit every morning. It still does not fit.
After establishing a baseline, I tried to fit different approaches. Doing equipment-free exercises, going to the gym, dieting. None was actually feasible in the long run, but trying different approaches does highlight the impacts and the handling profile of each method.
Looking at the now-gathered data, one thing was obvious - can't do dieting because it is not doable to have a shopping list and meals for me and another for the family.
Gym is also off the table - too much overhead. I spend more time on the trip there and back than actually there.
And home exercise equipment is either super crappy or very expensive. But it is also the most reasonable approach.
So it is solutions time. I got a nice exercise bycicle (not a peloton), an yoga mat (the wife already had that one) and an exercise program that uses only those two resources. Not as efficient without dieting, not as measurable and broad as the gym, but it fits my workflow. Deploy to production!
A few months pass and the dataset grows. The signal is subtle but has support - it works! The handling, however, needs improvement, since I cannot often enough get with the exercise program. Some mornings are just after some hard days.
I start thinking about what else I can improve in the program, but it is already pretty lean and full of compromises.
So I pull an engineer and start thinking about the support systems and draft profile. What else could be draining my willpower and morning time?
Chores. Getting the kids ready for school, firing up the moka pot, setting the off-brand roomba, folding the overnight-dried clothes, cooking breakfast, doing the dishes, cleaning the toilets. All part of my morning routine. It might benefit from some automation.
Last month I got that machine our elders call "wasteful" and "useless crap lazy entitled Americans invented because they feel oh-so-insulted for simply doing something by hand like everyone always did" - a "dish-washer".
Heh, I remember how hard was to convince my mother-in-law that an remote-controled electric garage door would not make she look like an spoiled brat.
Still to early to call, but I think that the dishwasher just saved me about 25 mins every morning. It might be enough to save willpower for me to do more exercise.
This is all so reflective of all data analytics cases really are out in the wild - the analytics phase seems so small compared to the gathering and practical problem-solving all around. And yet d.a. is what tells you that you are doing the wrong thing all along. Or on what you should work next.7 -
More and more, I am getting frustrated/depressed from the attitude of our customers who complain, moan and get angry about issues in their infrastructure, while at the same time, refusing to pay more so the issues could be mitigated.
Like, a client's angry with us today for having one of their non-production-critical databases inaccessible for... Hmm... About 8 hours now (So a whole workday).
Like... I get it, some of your employees couldn't work with it offline, but like... What the hell do we do? You keep data from as far back as several years ago in there, without partitioning, without exports, in a mix of innodb and myisam, so when the DB crashes, and its replication has to be reset from zero, reimporting all the data takes hours upon hours, and importing .sql files just takes time.
Or another client who got angry when their app fell out of the internet, cuz one of their myisam-based log tables crashed, and had to be repaired, with data spanning several years back, meaning it took hours to fix...
The more I work with these "basic" and "simple" infrastructure designs that is *not* redundant, or HA, the more I wonder -- How do the big names out there do it? How do you design systems with fault tolerance so a single DB table crash doesn't lead to the whole app getting inaccessible?
We have... One, exactly one, client, who uses MariaDB with Gallera, and that cluster is *amazing*, it just keeps chugging along, without a care in the world. But it cost them quite a lot, as they had to buy 3 DB servers, instead of 1...1