Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Search - "no more du problems"
-
What an absolute fucking disaster of a day. Strap in, folks; it's time for a bumpy ride!
I got a whole hour of work done today. The first hour of my morning because I went to work a bit early. Then people started complaining about Jenkins jobs failing on that one Jenkins server our team has been wanting to decom for two years but management won't let us force people to move to new servers. It's a single server with over four thousand projects, some of which run massive data processing jobs that last DAYS. The server was originally set up by people who have since quit, of course, and left it behind for my team to adopt with zero documentation.
Anyway, the 500GB disk is 100% full. The memory (all 64GB of it) is fully consumed by stuck jobs. We can't track down large old files to delete because du chokes on the workspace folder with thousands of subfolders with no Ram to spare. We decide to basically take a hacksaw to it, deleting the workspace for every job not currently in progress. This of course fucked up some really poorly-designed pipelines that relied on workspaces persisting between jobs, so we had to deal with complaints about that as well.
So we get the Jenkins server up and running again just in time for AWS to have a major incident affecting EC2 instance provisioning in our primary region. People keep bugging me to fix it, I keep telling them that it's Amazon's problem to solve, they wait a few minutes and ask me to fix it again. Emails flying back and forth until that was done.
Lunch time already. But the fun isn't over yet!
I get back to my desk to find out that new hires or people who got new Mac laptops recently can't even install our toolchain, because management has started handing out M1 Macs without telling us and all our tools are compiled solely for x86_64. That took some troubleshooting to even figure out what the problem was because the only error people got from homebrew was that the formula was empty when it clearly wasn't.
After figuring out that problem (but not fully solving it yet), one team starts complaining to us about a Github problem because we manage the github org. Except it's not a github problem and I already knew this because they are a Problem Team that uses some technical authoring software with Git integration but they only have even the barest understanding of what Git actually does. Turns out it's a Git problem. An update for Git was pushed out recently that patches a big bad vulnerability and the way it was patched causes problems because they're using Git wrong (multiple users accessing the same local repo on a samba share). It's a huge vulnerability so my entire conversation with them went sort of like:
"Please don't."
"We have to."
"Fine, here's a workaround, this will allow arbitrary code execution by anyone with physical or virtual access to this computer that you have sitting in an unlocked office somewhere."
"How do I run a Git command I don't use Git."
So that dealt with, I start taking a look at our toolchain, trying to figure out if I can easily just cross-compile it to arm64 for the M1 macbooks or if it will be a more involved fix. And I find all kinds of horrendous shit left behind by the people who wrote the tools that, naturally, they left for us to adopt when they quit over a year ago. I'm talking entire functions in a tool used by hundreds of people that were put in as a joke, poorly documented functions I am still trying to puzzle out, and exactly zero comments in the code and abbreviated function names like "gars", "snh", and "jgajawwawstai".
While I'm looking into that, the person from our team who is responsible for incident communication finally gets the AWS EC2 provisioning issue reported to IT Operations, who sent out an alert to affected users that should have gone out hours earlier.
Meanwhile, according to the health dashboard in AWS, the issue had already been resolved three hours before the communication went out and the ticket remains open at this moment, as far as I know.5 -
Did a bunch more cowboy coding today as I call it (coding in vi on production). Gather 'round kiddies, uncle Logan's got a story fer ya…
First things first, disclaimer: I'm no sysadmin. I respect sysadmins and the work they do, but I'm the first to admit my strengths definitely lie more in writing programs rather than running servers.
Anyhow, I recently inherited someone else's codebase (the story of my profession career, but I digress) and let me tell you this thing has amateur hour written all over it. It's written in PHP and JavaScript by a self-taught programmer who apparently discovered procedural programming and decided there was nothing left to learn and stopped there (no disrespect to self-taught programmers).
I could rant for days about the various problems this codebase has, but today I have a very specific story to tell. A story about errors and logs.
And it all started when I noticed the disk space on our server was gradually decreasing.
So today I logged onto our API server (Ubuntu running Apache/PHP) and did a df -h to check the disk space, and was surprised to see that it had noticeably decreased since the last time I'd checked when everything was running smoothly. But seeing as this server does not store any persistent customer data (we have a separate db server) and purely hosts the stateless API, it should NOT be consuming disk space over time at all.
The only thing I could think of was the logs, but the logs were very quiet, just the odd benign message that was fully expected. Just to be sure I did an ls -Sh to check the size of the logs, and while some of them were a little big, nothing over a few megs. Nothing to account for gigabytes of disk space gradually disappearing.
What could it be? I wondered.
cd ../..
du . | sort --sort=numeric
What's this? 2671132 K in some log folder buried in the api source code? I cd into it and it turns out there are separate PHP log files in there, split up by customer, so that each customer of ours (we have 120) has their own respective error log! (Why??)
Armed with this newfound piece of (still rather unbelievable) evidence I perform a mad scramble to search the codebase for where this extra logging is happening and sure enough I find a custom PHP error handler that is capturing (most) errors and redirecting them to these individualized log files.
Conveniently enough, not ALL errors were being absorbed though, so I still knew the main error_log was working (and any time I explicitly error_logged it would go there, so I was none the wiser that this other error-catching was even happening).
Needless to say I removed the code as quickly as I found it, tail -f'd the error_log and to my dismay it was being absolutely flooded with syntax errors, runtime PHP exceptions, warnings galore, and all sorts of other things.
My jaw almost hit the floor. I've been with this company for 6 months and had no idea these errors were even happening!
The sad thing was how easy to fix all the errors ended up being. Most of them were "undefined index" errors that could have been completely avoided with a simple isset() check, but instead ended up throwing an exception, nullifying any code that came after it.
Anyway kids, the moral of the story is don't split up your log files. It makes absolutely no sense and can end up obscuring easily fixable bugs for half a year or more!
Happy coding.6 -
this is not a rant.. more like the opposite of rant.. I want to share my overwhelming joy.. that is.. finally receiving my preview for AWS EFS!!!!! I must have applied for this about 8 times! And only today, do I get my prayers answered! Today is going to be a good day! :D2