Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Search - "production bug test"
-
I worked with a good dev at one of my previous jobs, but one of his faults was that he was a bit scattered and would sometimes forget things.
The story goes that one day we had this massive bug on our web app and we had a large portion of our dev team trying to figure it out. We thought we narrowed down the issue to a very specific part of the code, but something weird happened. No matter how often we looked at the piece of code where we all knew the problem had to be, no one could see any problem with it. And there want anything close to explaining how we could be seeing the issue we were in production.
We spent hours going through this. It was driving everyone crazy. All of a sudden, my co-worker (one referenced above) gasps “oh shit.” And we’re all like, what’s up? He proceeds to tell us that he thinks he might have been testing a line of code on one of our prod servers and left it in there by accident and never committed it into the actual codebase. Just to explain this - we had a great deploy process at this company but every so often a dev would need to test something quickly on a prod machine so we’d allow it as long as they did it and removed it quickly. It was meant for being for a select few tasks that required a prod server and was just going to be a single line to test something. Bad practice, but was fine because everyone had been extremely careful with it.
Until this guy came along. After he said he thought he might have left a line change in the code on a prod server, we had to manually go in to 12 web servers and check. Eventually, we found the one that had the change and finally, the issue at hand made sense. We never thought for a second that the committed code in the git repo that we were looking at would be inaccurate.
Needless to say, he was never allowed to touch code on a prod server ever again.8 -
My biggest dev blunder. I haven't told a single soul about this, until now.
👻👻👻👻👻👻
So, I was working as a full stack dev at a small consulting company. By this time I had about 3 years of experience and started to get pretty comfortable with my tools and the systems I worked with.
I was the person in charge of a system dealing with interactions between people in different roles. Some of this data could be sensitive in nature and users had a legal right to have data permanently removed from our system. In this case it meant remoting into the production database server and manually issuing DELETE statements against the db. Ugh.
As soon as my brain finishes processing the request to venture into that binary minefield and perform rocket surgery on that cursed database my sympathetic nervous system goes into high alert, palms sweaty. Mom's spaghetti.
Alright. Let's do this the safe way. I write the statements needed and do a test run on my machine. Works like a charm 😎
Time to get this over with. I remote into the server. I paste the code into Microsoft SQL Server Management Studio. I read through the code again and again and again. It's solid. I hit run.
....
Wait. I ran it?
....
With the IDs from my local run?
...
I stare at the confirmation message: "Nice job dude, you just deleted some stuff. Cool. See ya. - Your old pal SQL Server".
What did I just delete? What ramifications will this have? Am I sweating? My life is over. Fuck! Think, think, think.
You're a professional. Handle it like one, goddammit.
I think about doing a rollback but the server dudes are even more incompetent than me and we'd lose all the transactions that occurred after my little slip. No, that won't fly.
I do the only sensible thing: I run the statements again with the correct IDs, disconnect my remote session, and BOTTLE THAT SHIT UP FOREVER.
I tell no one. The next few days I await some kind of bug report or maybe a SWAT team. Days pass. Nothing. My anxiety slowly dissipates. That fateful day fades into oblivion and I feel confident my secret will die with me. Cool ¯\_(ツ)_/¯12 -
Worst dev team failure I've experienced?
One of several.
Around 2012, a team of devs were tasked to convert a ASPX service to WCF that had one responsibility, returning product data (description, price, availability, etc...simple stuff)
No complex searching, just pass the ID, you get the response.
I was the original developer of the ASPX service, which API was an XML request and returned an XML response. The 'powers-that-be' decided anything XML was evil and had to be purged from the planet. If this thought bubble popped up over your head "Wait a sec...doesn't WCF transmit everything via SOAP, which is XML?", yes, but in their minds SOAP wasn't XML. That's not the worst WTF of this story.
The team, 3 developers, 2 DBAs, network administrators, several web developers, worked on the conversion for about 9 months using the Waterfall method (3~5 months was mostly in meetings and very basic prototyping) and using a test-first approach (their own flavor of TDD). The 'go live' day was to occur at 3:00AM and mandatory that nearly the entire department be on-sight (including the department VP) and available to help troubleshoot any system issues.
3:00AM - Teams start their deployments
3:05AM - Thousands and thousands of errors from all kinds of sources (web exceptions, database exceptions, server exceptions, etc), site goes down, teams roll everything back.
3:30AM - The primary developer remembered he made a last minute change to a stored procedure parameter that hadn't been pushed to production, which caused a side-affect across several layers of their stack.
4:00AM - The developer found his bug, but the manager decided it would be better if everyone went home and get a fresh look at the problem at 8:00AM (yes, he expected everyone to be back in the office at 8:00AM).
About a month later, the team scheduled another 3:00AM deployment (VP was present again), confident that introducing mocking into their testing pipeline would fix any database related errors.
3:00AM - Team starts their deployments.
3:30AM - No major errors, things seem to be going well. High fives, cheers..manager tells everyone to head home.
3:35AM - Site crashes, like white page, no response from the servers kind of crash. Resetting IIS on the servers works, but only for around 10 minutes or so.
4:00AM - Team rolls back, manager is clearly pissed at this point, "Nobody is going fucking home until we figure this out!!"
6:00AM - Diagnostics found the WCF client was causing the server to run out of resources, with a mix of clogging up server bandwidth, and a sprinkle of N+1 scaling problem. Manager lets everyone go home, but be back in the office at 8:00AM to develop a plan so this *never* happens again.
About 2 months later, a 'real' development+integration environment (previously, any+all integration tests were on the developer's machine) and the team scheduled a 6:00AM deployment, but at a much, much smaller scale with just the 3 development team members.
Why? Because the manager 'froze' changes to the ASPX service, the web team still needed various enhancements, so they bypassed the service (not using the ASPX service at all) and wrote their own SQL scripts that hit the database directly and utilized AppFabric/Velocity caching to allow the site to scale. There were only a couple client application using the ASPX service that needed to be converted, so deploying at 6:00AM gave everyone a couple of hours before users got into the office. Service deployed, worked like a champ.
A week later the VP schedules a celebration for the successful migration to WCF. Pizza, cake, the works. The 3 team members received awards (and a envelope, which probably equaled some $$$) and the entire team received a custom Benchmade pocket knife to remember this project's success. Myself and several others just stared at each other, not knowing what to say.
Later, my manager pulls several of us into a conference room
Me: "What the hell? This is one of the biggest failures I've been apart of. We got rewarded for thousands and thousands of dollars of wasted time."
<others expressed the same and expletive sediments>
Mgr: "I know..I know...but that's the story we have to stick with. If the company realizes what a fucking mess this is, we could all be fired."
Me: "What?!! All of us?!"
Mgr: "Well, shit rolls downhill. Dept-Mgr-John is ready to fire anyone he felt could make him look bad, which is why I pulled you guys in here. The other sheep out there will go along with anything he says and more than happy to throw you under the bus. Keep your head down until this blows over. Say nothing."11 -
IF LIVES DEPEND ON A SYSTEM
1. Code review, collaboration, and knowledge sharing (each hour of code review saves 33 hours of maintenance)
2. TDD (40% — 80% reduction in production bug density)
3. Daily continuous integration (large code merges are a major source of bugs)
4. Minimize developer interruptions (an interrupted task takes twice as long and contains twice as many defects)
5. Linting (catches many typo and undefined variable bugs that static types could catch, as well as a host of stylistic issues that correlate with bug creation, such as accidentally assigning when you meant to compare)
6. Reduce complexity & improve modularity -- complex code is harder to understand, test, and maintain
-Eric Elliott12 -
A few days after deploying a big important Website into production, I wanted to copy the whole thing including DB back onto our test server for future testing/bug fixing if something comes up. (Last changes were done on production server before going live)
So I opened SSH, removed everything on the test sever aaaaand then I realized I was connected to production...
Took about an hour to get everything up and running again. We didn't tell the client and hoped it would not be noticed.2 -
So, a few years ago I was working at a small state government department. After we has suffered a major development infrastructure outage (another story), I was so outspoken about what a shitty job the infrastructure vendor was doing, the IT Director put me in charge of managing the environment and the vendor, even though I was actually a software architect.
Anyway, a year later, we get a new project manager, and she decides that she needs to bring in a new team of contract developers because she doesn't trust us incumbents.
They develop a new application, but won't use our test team, insisting that their "BA" can do the testing themselves.
Finally it goes into production.
And crashes on Day 1. And keeps crashing.
Its the infrastructure goes out the cry from her office, do something about it!
I check the logs, can find nothing wrong, just this application keeps crashing.
I and another dev ask for the source code so that we can see if we can help find their bug, but we are told in no uncertain terms that there is no bug, they don't need any help, and we must focus on fixing the hardware issue.
After a couple of days of this, she called a meeting, all the PMs, the whole of the other project team, and me and my mate. And she starts laying into us about how we are letting them all down.
We insist that they have a bug, they insist that they can't have a bug because "it's been tested".
This ends up in a shouting match when my mate lost his cool with her.
So, we went back to our desks, got the exe and the pdb files (yes, they had published debug info to production), and reverse engineered it back to C# source, and then started looking through it.
Around midnight, we spotted the bug.
We took it to them the next morning, and it was like "Oh". When we asked how they could have tested it, they said, ah, well, we didn't actually test that function as we didn't think it would be used much....
What happened after that?
Not a happy ending. Six months later the IT Director retires and she gets shoed in as the new IT Director and then starts a bullying campaign against the two of us until we quit.5 -
It's about a guy that knows better.
I was working as a subcontractor on a bigger system. We (subs) were not allowed to deploy code, we had to wait for contractor to deploy.
One day I got an email that my code is bugged and that my feature is not working on production. I checked it on test env, everything was fine. Then I checked if the code I wrote was deployed. It was not.
I send an email explaining that if they deployed my code it would be working. Then I got a response. There was a bug in my code.
Another email. I asked how would they know? Do they have a test on their environment that failed?
No. There is one guy that READ my code and he said it should not work, so he will not deploy it. He was not a programmer, he was a business consultant responsible for the documentation.
His issue was that I used a function that was not in a class. So if the function is not declared it's obvious it will not work. I had to explain to him in another email, that you can use object of another class inside your class and then call a function, that is not in your class. It was the last time this guy blocked my deploy.
TL;DR, I had to explain a non-dev how object composition works in order to have my code deployed. Took four emails.4 -
Me: We really need to improve our unit test coverage.
Team/Boss: <sarcasm> Haha yeah.
Production Bug: I'm doing something nasty to a client, because a dev broke something but no test coverage.
Boss: How could we have prevented this?!1 -
Most ignorant ask from a PM or client?
Migrated to SharePoint 2016 which included Reporting Services, and trying to fix a bug in the reporting services scheduler, I created a report (aka, copied an existing one) 'A Klingon Walks Into a Bar', so it would first in the list and distinct enough so the QA testers would (hopefully) leave it alone.
The PM for the project calls me.
PM: "What is this Klingon report? It looks like a copy of the daily inventory report"
Me: "It is. The reporting service job keeps crashing on certain reports that have daily execution schedules."
PM: "I need you to delete it"
Me: "What? Why? The report is on the dev sharepoint site. I named the report so it was unique and be at the top of the list so I can find it easily."
PM: "The name doesn't conform to our standards and it's confusing the testers."
Me: "The testers? You mean Dan, you, and Heather?"
PM: "Yes, smartass. Can you name the report something like daily inventory report 2, or something else?"
Me: "I could, but since this is in development, no. You've already proofed out the upgrade. You're waiting on me to fix this sharepoint bug. Why do you care what I do on this server? It's going away after the upgrade."
PM: "Yea, about that. We like having the server. It gives us a place to test reports. Would really appreciate it if you would rename or delete that report."
Me: "A test sharepoint reporting services server out of scope, so no, we're not keeping it."
PM: "Having a server just for us would be nice."
Me: "$10,000 nice? We're kinda fudging on the licensing now. If we're keeping it, we will be required to be in compliance. That's a server license, sharepoint license, sql server license, and the dedicated hardware. We talked about that, remember?"
PM: "Why is keeping that report so important to you? I don't want to explain to a VP what a Klingon is."
Me: "I'm not keeping the report or moving it to production. When I figure out the problem, I'll delete the report. OK?"
PM: "I would prefer you delete the report before a VP sees it."
Me: "Why would a VP be looking? They probably have better things to do."
PM: "Jeff wants to see our progress, I'll have to him the site, and he'll see the report."
Me: "OK? You tell Jeff it's a report I'm working on, I'll explain what a Klingon is, Jeff will call me a nerd, and we all move on."
PM: "I'm not comfortable with this upgrade."
Me: "What does that mean?"
PM: "I asked for something simple and I can't be responsible for the consequences. I'll be documenting this situation as a 'no-go' for deployment"
Me: "Oookaayyy?"
I figured out the bug, deleted the 'Klingon' report, and the PM couldn't do anything to delay the deployment.4 -
This was some time ago. A Legendary bug appeared. It worked in the dev environment, but not in the test and production environment.
It had been a week since I was working on the issue. I couldn't pinpoint the problem. We CANNOT change the code that was already there, so we needed to override the code that was written. As I was going at it, something happened.
---
Manager: "Hey, it's working now. What did you do?"
Me: *Very confused because I know I was nowhere close to finding the real source of the problem* Oh, it is? Let me check.
Also me: *Goes and check on the test and prod environment and indeed, it's already working*
Also me to the power of three: *Contemplates on life, the meaning of it, of why I am here, who's going to throw out the trash later, asking myself whether my buddies and I will be drinking tonight, only to realize that I am still on the phone with my manager*
Me again: "Oh wow, it's working."
Manager: "Great job. What were the changes in the code?"
Me: "All I did was put console logs and pushed the changes to test and prod if they were producing the same log results."
Manager: "So there were no changes whatsoever, is that what you mean?"
Me: "Yep. I've no idea why it just suddenly worked."
Manager: "Well, as long as it's working! Just remove those logs and deploy them again to the test and prod environment and add 'Test and prod fix' to the commit comment."
Me: "But what if the problem comes up again? I mean technically we haven't resolved the issue. The only change I made were like 20 lines of console logs! "
Manager: "It's working, isn't it? If it becomes a problem, we'll work it out later."
---
I did as I was told, and Lo and Behold, the problem never occurred again.
Was the system playing a joke on me? The system probably felt sorry for me and thought, "Look at this poor fucker, having such a hard time on a problem he can't even comprehend. That idiotic programmer had so many sleepless nights and yet still couldn't find the solution. Guess I gotta do my job and fix it for him. I'm the only one doing the work around here. Pathetic Homo sapiens!"
Don't get me wrong, I'm glad that it's over but..
What the fuck happened?5 -
In my three years experience so far I can honestly say that 100% of the developers I've worked with are narrow sighted with regards to how they develop.
As in, they lack the capacity to anticipate multiple scenarios.
They code with one unique scenario in mind and their work ends up not passing tests or generates bugs in production.
Not to say I'm the best at foreseeing every possible scenario, but I at least TRY to anticipate and test my code as much as possible to identify problems and edge cases.
I usually take much more time to complete tasks than my colleagues, but my work usually passes tests and comes back bug free. Whereas my colleagues get applauded for completing tasks quickly but end up spending lots of time fixing up after themselves when tests fail or bugs appear.
Probably more time wasted than if they had done the job correctly from the start. Yet they're considered to be effecient devs because they work "fast".
Frustrating...7 -
Discovered pro tip of my life :
Never trust your code
Achievements unlocked :
Successfully running C++ GPU accelerated offscreen rendering engine with texture loading code having faulty validation bug over a year on production for more than 1.5M daily Android active users without any issues.
History : Recently I was writing a new rendering engineering that uses our GPU pipeline engine.. and our prototype android app benchmark test always fails with black rendering frame detection assertion.
Practice:
Spend more than a month to debug a GPU pipeline system based on directed acyclic graph based rendering algorithm.
New abilities added :
Able to debug OpenGL ES code on Android using print statement placed in source code using binary search.
But why?
I was aware of the issue over a month and just ignored it thinking it's a driver bug in my android device.. but when the api was used by one of Android dev, he reported the same issue. In the same day at night 2:59AM ....
Satan came to me and told me that " ok listen man, here is what I am gonna do with you today, your new code will be going production in a week, and the renderer will give you just one black frame after random time, and after today 3AM, your code will not show GL Errors if you debug or trace. Buhahahaha ahhaha haahha..... Puffff"
And he was gone..
Thanks satan for not killing me.. I will not trust stable production code anymore enevn though every line is documented and peer reviewed. -
Yet another day at work:
My job is to write test libraries for web services and test others code. Yes I know to code, and have a niche in software testing.
Sometimes developers (whose code I find bugs in) get so defensive and scream in emails and meetings if I point out an issue in their code.
Today, when I pointed a bug in his repo, a developer questioned me in an email asking if I even understood his code, and as a tester I shouldn’t look at his code and only blackbox test it.
I wish I can educate the defensive developer that sometimes, it’s okay to make mistakes and be corrected. That’s how we deliver services that doesn’t suck in production.10 -
If you don't know, there are 2 types of bug fixes:
Hot Fix - Patch files directly on the production
Quick Fix - Deploy fix on production and then test it4 -
Bug I had to fix today: some elements in our React app were being swapped with other elements.
We had `<foo>bar</foo>` on the component but on the html `foo` was being swapped with some other element in our app. It's contents ("bar" in the example) were being left in place, though, so we were getting `<baz>bar</baz>`.
This would only happen when running on production mode. On development everything was fine.
Also, everything seemed fine on the React dev tools. `foo` was where it was supposed to be, but on the html it was somewhere else.
Weirdest shit I ever saw when using React. I found a way to go around it and applied that fix, but I'm still trying to track this down to the source.
The worst part was waiting for fucking webpack to finish the production bundle on every fucking change I wanted to test. I didn't miss the change-save-compile-test flow at all.
What a shit day.4 -
Ever have a bug that *only* occurs in your production environment? How do you test potential fixes? 😜5
-
Aren't you, software engineer, ashamed of being employed by Apple? How can you work for a company that lives and shit on the heads of millions of fellow developers like a giant tech leech?
Assuming you can find a sounding excuse for yourself, pretending its market's fault and not your shitty greed that lets you work for a company with incredibly malicious product, sales, marketing and support policies, how can you not feel your coders-pride being melted under BILLIONS of complains for whatever shitty product you have delivered for them?
Be it a web service that runs on 1980 servers with still the same stack (cough cough itunesconnect, membercenter, bug tracker, etc etc etc etc) incompatible with vast majority of modern browsers around (google at least sticks a "beta" close to it for a few years, it could work for a few decades for you);
be it your historical incapacity to build web UI;
be it the complete lack of any resemblance of valid documentation and lets not even mention manuals (oh you say that the "status" variable is "the status of the object"? no shit sherlock, thank you and no, a wwdc video is not a manual, i don't wanna hear 3 hours of bullshit to know that stupid workaround to a stupid uikit api you designed) for any API you have developed;
be it the predatory tactics on smaller companies (yeah its capitalism baby, whatever) and bending 90 degrees with giants like Amazon;
be it the closeness (christ, even your bugtracker is closed and we had to come up with openradar to share problems that you would anyway ignore for decades);
be it a desktop ui api that is so old and unmaintained and so shitty, but so shitty, that you made that cancer of electron a de facto standard for mainstream software on macos;
be it a IDE that i am disgusted to even name, xcrap, that has literally millions of complains for the same millions of issues you dont even care to answer to or even less try to justify;
be it that you dont disclose your long term plans and then pretend us to production-test and workaround-fix your shitty non-production ready useless new OS features;
be it that a nervous breakdown on a stupid little guy on the other side of the planet that happens to have paid to you dozens of thousands of euros (in mandatory licences and hardware) to actually let you take an indecent cut out of his revenues cos there is no other choice in a monopoly regime, matter zero to you;
Assuming all of these and much more:
How can you sleep at night with all the screams of the devs you are exploiting whispering in you mind? Are all the money your earn worth?
** As someone already told you elsewhere, HAVE SOME FUCKING PRIDE, shitty people AND WRITE THE FUCKING DOCS AND FIX THE FUCKING BUGS you lazy motherfuckers, your are paid more than 99.99% of people on earth, move your fucking greasy little fingers on that fucking keyboard. **
PT2: why the fuck did you remove the ESC key from your shitty keyboards you fuckshits? is it cos autocomplete is slower than me searching the correct name of a function on stackoverflow and hence ESC key is useless? at least your hardware colleagues had the decency of admitting their error and rolling back some of the uncountable "questionable "hardware design choices (cough cough ...magic mouse... cough golden charging cables not compatible with your own devices.. cough )?12 -
A couple of years ago I was working on a fairly large system with a complex (by necessity) access control architecture.
As is usually the case with those projects, it's awkward for developers to repro bugs that have to do with a user's accesses in production when we are not allowed to replicate production data in test, let alone locally.
We had a bug where I ended up making myself a new row in the production database for a thing I could have access to without affecting real data to repro it safely. I identified the bug so I could repro it in dev/test and removed the row and ensured everything worked normally, whew scary.
Have you ever walked into the office one day, and everyone is hunched over in a semicircle around one person's workstation, before one turns around to look at you and says - after a pause - "... ltlian?.."
Turns out I had basically "poisoned the well" with my dummy entity in a way where production now threw 500 for everyone BUT me who had transitive access to this post-non-entity. Due to the scope of the system, it had taken about a day for this to gradually propagate in terms of caching and eventual consistencies; new entities coming in was expected, but not that they disappear.
Luckily I had a decent track record for this to be a one-off. I sometimes think about how I would explain testing in prod and making it faceplant before going home for the day, other than "I assumed it would be fine". I would fire me.3 -
most memorable bug I fixed
Line 1:
- throw new Error(‘test’)
+ // throw new Error(‘test’)
yes, this was committed code in production4 -
Ok I'm officially losing my fucking mind!
I've been trying to solve a connection bug that only occurs in production which is cool if THIS FUCKING APP DIDN'T TAKE 30 MINUTES TO DEPLOY!!!
Been busy for 4 hours and I've only been able to test 4 minor fixes. -
Damn, nespresso! As if your fucking webshop wasn't bad enough, don't bug me (haha) with useless mails AND JUST TEST YOUR SHIT BEFORE PRODUCTION!!! Seriously, thousands of customers pay lots and lots of money for mediocre coffee in aluminum capsules, so JUST HAVE THE DECENCY TO INVEST SOME BUCKS INTO THE CUSTOMER FACING APPS!!!6
-
If I had a nickel every time the unit tests failed not because something was wrong in the code, but because someone had messed up the unit test I'd be able to retire early.
I just spent the better part of 10 hours hunting down a bug in some production code only for the test to be wrong because the person who wrote it had mocked the http response incorrectly.
Nothing I did to "fix" the code worked, because nothing was wrong with it...4 -
We are 3-4 days away from deployment to production. We are still bug fixing. But one coworkers decided this is the time to make a fuss about the way everything is set up. He doesn't like the dev database. So he knocks it over.. and while so doing it, he doesn't inform the team. And when I ask if something else is gonna knock over? No answer! (And something broke down too..)
Now we have issues to test our bugfixes. The whole thing took me half a day finding out and made me distracted with frustration, and not just for me. Most bugs could've been done in that half a day!
I so wanna punch the guy xD but no, I gotta save face, pfff!2 -
To my dev manager: hold on... you are annoyed that one of our high profile customers found a bug in our software after upgrading to production because they didn’t test that feature in staging?? Did I hear that right? Yeah, let’s blame our customers for bugs that we introduced and that we clearly didn’t test during QA... that’s what we’re all about over here...8
-
This is just a sweet little harmless block of code, why should I unit test it?
3 months later...
Bug in production : This is why. -
When the CTO/CEO of your "startup" is always AFK and it takes weeks to get anything approved by them (or even secure a meeting with them) and they have almost-exclusive access to production and the admin account for all third party services.
Want to create a new messaging channel? Too bad! What about a new repository for that cool idea you had, or that new microservice you're expected to build. Expect to be blocked for at least a week.
When they also hold themselves solely responsible for security and operations, they've built their own proprietary framework that handles all the authentication, database models and microservice communications.
Speaking of which, there's more than six microservices per developer!
Oh there's a bug or limitation in the framework? Too bad. It's a black box that nobody else in the company can touch. Good luck with the two week lead time on getting anything changed there. Oh and there's no dedicated issue tracker. Have you heard of email?
When the systems and processes in place were designed for "consistency" and "scalability" in mind you can be certain that everything is consistently broken at scale. Each microservice offers:
1. Anemic & non-idempotent CRUD APIs (Can't believe it's not a Database Table™) because the consumer should do all the work.
2. Race Conditions, because transactions are "not portable" (but not to worry, all the code is written as if it were running single threaded on a single machine).
3. Fault Intolerance, just a single failure in a chain of layered microservice calls will leave the requested operation in a partially applied and corrupted state. Ger ready for manual intervention.
4. Completely Redundant Documentation, our web documentation is automatically generated and is always of the form //[FieldName] of the [ObjectName].
5. Happy Path Support, only the intended use cases and fields work, we added a bunch of others because YouAreGoingToNeedIt™ but it won't work when you do need it. The only record of this happy path is the code itself.
Consider this, you're been building a new microservice, you've carefully followed all the unwritten highly specific technical implementation standards enforced by the CTO/CEO (that your aware of). You've decided to write some unit tests, well um.. didn't you know? There's nothing scalable and consistent about running the system locally! That's not built-in to the framework. So just use curl to test your service whilst it is deployed or connected to the development environment. Then you can open a PR and once it has been approved it will be included in the next full deployment (at least a week later).
Most new 'services' feel like the are about one to five days of writing straightforward code followed by weeks to months of integration hell, testing and blocked dependencies.
When confronted/advised about these issues the response from the CTO/CEO
varies:
(A) "yes but it's an edge case, the cloud is highly available and reliable, our software doesn't crash frequently".
(B) "yes, that's why I'm thinking about adding [idempotency] to the framework to address that when I'm not so busy" two weeks go by...
(C) "yes, but we are still doing better than all of our competitors".
(D) "oh, but you can just [highly specific sequence of undocumented steps, that probably won't work when you try it].
(E) "yes, let's setup a meeting to go through this in more detail" *doesn't show up to the meeting*.
(F) "oh, but our customers are really happy with our level of [Documentation]".
Sometimes it can feel like a bit of a cult, as all of the project managers (and some of the developers) see the CTO/CEO as a sort of 'programming god' because they are never blocked on anything they work on, they're able to bypass all the limitations and obstacles they've placed in front of the 'ordinary' developers.
There's been several instances where the CTO/CEO will suddenly make widespread changes to the codebase (to enforce some 'standard') without having to go through the same review process as everybody else, these changes will usually break something like the automatic build process or something in the dev environment and its up to the developers to pick up the pieces. I think developers find it intimidating to identify issues in the CTO/CEO's code because it's implicitly defined due to their status as the "gold standard".
It's certainly frustrating but I hope this story serves as a bit of a foil to those who wish they had a more technical CTO/CEO in their organisation. Does anybody else have a similar experience or is this situation an absolute one of a kind?2 -
==============
Getting Feedback Rant!
=============
When "this is simpler" feedback results in a function of 500 lines of code.
When I get "don't do X" in the feedback. Thank you very much. What do you want me to do instead?
Unclear feedback.
When the feedback giver changes his mind after I applied the changes!
When applying the feedback introduces a bug.
Simply opinionated feedback that is not enforced by any tool or backed up by any facts.
Please find something better to do in life.
Unactionable feedback.
"Consider X"
I will not consider thank you very much.
"Verify this works"
Duh..
When the feedback giver knows something that you don't.
I know this is a legit case.. still annoying.
"I disagree with the feature"
Go argue with the PM, not relevant to me, thanks!
=====================
GIVING FEEDBACK RANT
=====================
I rewrote the system. Please review it.
No need to review, just approve.
I will change this as part of the next ticket.
I would like to keep it the way it is.
lazy ass..
You can't test this.
It's impossible to test this.
No need to test this.
There's no point to test this.
I'll test this on production.
Not sure why this is working..
Please document this..
Because documentation is like a thing, you know.
Oh, this code is not related to this PR, I just don't want to open a new branch for such a small change. ignore it.
Ignore this.
This will be meaningful in my next change. -
As I'm on a research/algorithm improvement project at work I'm working pretty much independently. As such I've set up an automated test framework and writing tests for any piece of code I touch.
Today as I was fixing a bug in production area I was demoing my tests to CTO and principal design engineer. They come from a hardware background and have pushed back against automated tests in the past but they were interested in what I was doing.
I WILL DRAG THEM KICKING AND FUCKING SCREAMING INTO THE WORLD OF AUTOMATED TESTS.1 -
In last episode of "How SystemD screwed me over", we talked about Systemd's PrivateTMP and how it stopped me from generating SSL certificates.
In today's episode - SystemD vs CGroups!
Mister Pottering and his team apparently felt that CGroups are underused (As they can be quite difficult to set up), and so decided to integrate them into SystemD by default. As well as to provide a friendlier interface to control their values.
One can read about these interactions in the manual page "systemd.resource-control"
All is cool so far. So what happened to me today?
Imagine you did a major system release upgrade of a production server, previously tested on a standalone server. This upgrade doesn't only upgrade the distribution however, it also includes the switch from SysVInit to SystemD. Still, everything went smooth before, nothing to worry now then, right? Wrong.
The test server was never properly stress-tested. This would prove to be an issue.
When the upgrade finishes, it is 4 AM. I am happy to go to bed at last. At 6 AM, however, I am woken up again as the server's webservices are unavailable, and the machine is under 100% CPU load. Weird, I check htop and see that Apache now eats up all 32 virtual cores. So I restart it, casting it off to some weird bug or something as the load returns to normal.
2 hours later, however, the same situation occurs. This time, I scour all the logs I can, and find something weird - Many mentions that Apache couldn't create a worker thread? That's weird.
Several hours of research and tinkering later, I found out the following:
1 - By default, all processes of a system that runs SystemD are part of several CGroups. One of these CGroups is the PID CGroup, meant to stop a runaway process from exhausting all PIDs/TIDs of a system.
This limit is, by default, set to a certain amount of the total available PIDs. If a process exhausts this limit, it can no longer perform operations like fork().
So now, I know the how and why, but how should I solve this? The sanest option would be to get a rough estimate of just how many threads the Apache webserver might need. This option, though, is harder, than apparent. I cannot just take the MaxRequestsWorkers number... The instance has roughly double the amount of threads already. The cause being, as I found out, the HTTP/2 module, which spawns additional threads that do not count towards this limit. So I have no idea what limit to set.
Or I could... Disable the limit for just the webserver via the TasksAccounting switch. I thought this would work. And it did seem to... Until I ran out of TIDs again - Although systemctl status apache2.service no longer reported the number of tasks or a task limit of the process, the PID CGroup stayed set to the previous limit. Later I found out that I can only really disable the Task Accounting for all the units of a given slice and its parents.
This, though, systemctl somewhat didn't make apparent (And I skimmed the manual, that part was my fault)
So... The only remaining option I had was to... Just set the limit to infinite. And that worked, at last.
It took me several hours to debug this issue. And I once again feel like uninstalling systemd again, in favor of sysvinit.
What did I learn? RTFM, carefully, everything is important, it is not enough to read *half* the paragraph of a given configuration option...
Oh, and apache + http/2 = huge TID sink. -
A customer asks for a change request or a bug fix and it results in creating a ticket for that.
It's the process and how it works in most places but after you finish with the task and fix the same customer who provided you with the requirements will request that you share the steps on how to test the fix or the feature.
I'm not speaking about the data preparation or required configuration. I mean a step-by-step instruction on how the tester/QA will test it.
It's driving me mad!! So a way to counterplay this stupid requests, I provide the happy path and what to expect. And in case, they stumbled on a bug later in production, I can easily say "It was approved by your testing team and that's a new requirement ;)"2 -
fix bug
test production
can't reproduce bug
support tickets and people still bitching they have the bug
i think the shit's been merged and deployed long enough it should've rolled over considering the timestamps on follow up cases4 -
Confession: a very important feature of the website I'm developping wasn't working for a certain time. The boss wasn't aware because he doesn't go on the site, and I only found out last week because I needed to implement a new feature that used the previous one. Problem: the bug was only on production, not on local (and of course we don't have test server).
I took advantage of the absence of my boss today to clear the situation by making all of my tests on prod. I hope no customer tried to pass a command today, but it's finally repaired. I am both proud and shameful.3 -
After three months of development, my first contribution to the client is going live on their servers in less than 12 hours. And let me say, I shall never again be doing that much programming in one go, because the last week and a half has been a nightmare... Where to begin...
So last Monday, my code passed to our testing servers, for QA to review and give its seal of approval. But the server was acting up and wouldn't let us do much, giving us tons of timeouts and other errors, so we reported it to the sysadmin and had to put off the testing.
Now that's all fine and dandy, but last Wednesday we had to prepare the release for 4 days of regression testing on our staging servers, which meant that by Wednesday night the code had to be greenlight by QA. Tuesday the sysadmin was unable to check the problem on our testing servers, so we had to wait to Wednesday.
Wednesday comes along, I'm patching a couple things I saw, and around lunch time we deploy to the testing servers. I launch our fancy new Postman tests which pass in local, and I get a bunch of errors. Partially my codes fault, partially the testing env manipulating server responses and systems failing.
Fifteen minutes before I leave work on the day we have to leave everything ready to pass to staging, I find another bug, which is not really something I can ignore. My typing skills go to work as I'm hammering line after line of code out, trying to get it finished so we can deploy and test when I get home. Done just in time to catch the bus home...
So I get home. Run the tests. Still a couple failures due to the bug I tried to resolve. We ask for an extension till the following morning, thus delaying our deployment to staging. Eight hours later, at 1AM, after working a full 8 hours before, I push my code and leave it ready for deployment the following morning. Finally, everything works and we can get our code up to staging. Tests had to be modified to accommodate the shitty testing environment, but I'm happy that we're finally done there.
Staging server shits itself for half a day, so we end up doing regression tests a full day late, without a change in date for our upload to production (yay...).
We get to staging, I run my tests, all green, all working, so happy. I keep on working on other stuff, and the day that we were slated to upload to production, my coworkers find that throughout the development (which included a huge migration), code was removed which should not have. Team panics. Everyone is reviewing my commits (over a hundred commits) trying to see what we're missing that is required (especially legal requirements). Upload to production is delayed one day because of this. Ended up being one class missing, and a couple lines of code, which is my bad (but seriously, not bad considering I'm a Junior who was handed this project as his first task at his first job).
I swear to God, from here on out, one feature per branch and merge request. Never again shall I let this happen. I don't even know why it was allowed to happen, it breaks our branch policies. But ohel... I will now personally oppose crap like this too...
Now if you'll excuse me... I'm going to be highly unproductive and rest, because I might start balding otherwise after these weeks... -
Why? As a senior, you won't give some time to review my code, will let me merge my code to a branch, then blame us when it will produce the bug in production? why? 😐 Won't even arrange a code review/knowledge sharing session so that juniors can learn at least something. Even you won't encourage us write test cases. If seniors don't follow, are the juniors to blame? 🙂3
-
When you spend hours in a messy codebase to fix a bug properly and add an integration spec to cover that specific case.
And even you do a round of testing on staging + providing screenshots, there is always someone on the team that will write in your PR, "It works, I tested the change on my machine".
I understand that some people are skeptical but to the point of not trusting integration tests + screeshots/recordings then please test it on staging or production next time because if it works on your machine doesn't mean it will work there ;)2 -
Since I started my routine of checking bug logs every morning, I've had 2 instances where a website vulnerability scanner was run against a production website and generated over 2,000 Coldfusion errors.
At the time, I was super nervous about the apparent hack attempt, and hyped that the attackers never actually got in. It's nice to know that despite the various errors indicating vulnerable / breakable code, they were ultimately unsuccessful. I know now that a determined attacker could probably have wrecked our production websites. Since then I've made a ton of security-related updates and I'm actually thankful for the script kiddie getting my attention with that scan.
PS. We're now building a website for a local security company who is going to work with us to pen test the site when it's finished! Gulp.4 -
Just after feature launch, major bug on production and now I am getting yelled at by my lead as the issue happens to be with the PR i was responsible for reviewing yesterday. Somehow a logic error got past my review. But considering how large the project is it wasn't possible for me to test out every possible scenario myself. They should have had QA handle that. Also, that was my first code review. I can't understand why my boss has such unrealistic expectations. Bugs are expected at this stage. I feel like he just puts too much pressure on me for no other ther reason other than to just trigger my imposter syndrome. That way, I feel like a bad developer even though I am working my ass off. And he gets to avoid giving me a raise. Cant believe I rejected multiple offers to stay at this company. I don't even know why am I still working for this company anymore.4
-
When you have a customer that is a pain and you have to do a new contract since months but they are no replying but at same time there is a bug in a plugin they are using.
They are not updating their plugins in production but only after a test in staging.
In production there aren't write permission from web server side, so only they have access.
And the plugin has a 0-day. -
A young new dev was working on his first ticket, about a bug during parsing of an uploaded excel file. Our issue was that if the file contained an empty line, all remaining rows were ignored. So the task included extending our tests to cover this case. After 2 weeks (!), his merge request comes in. His idea (without ever asking for help) was to parse the whole file (in some cases huge) in the production code a second time, just to count the rows (!!) and save the count in a public static int field, which was verified in his new test.2
-
One nightmarish project that was doomed from the beginning, had me as the sole developer. I could hardly sleep when we began testing on a separate test system, but with (nearly) all the config stored in shared memory and copied from the production system, I dreaded, half awake, that the production server data base connection was still configured in the test system and that it was shooting all it's test data repeatedly to prod.
Finally drove to company in middle of the night at 4 o'clock. Checked everything was OK, tried to sleep 3 hours before the start of the work day.
This system also had the most hideous memory corruption in some shared memory that was used across several processes and should have been thoroughly protected by a mutex, but somehow, sometimes this crucial map, that was used to speed up the access to all the customer data just contained garbage.
Still haunts me to that day. (Like xkcd's unresolved tension of a non-matching parenthesis - an unresolved bug.