Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Search - "replication"
-
--- GitHub 24-hour outage post mortem ---
As many of you will remember; Github fell over earlier this month and cracked its head on the counter top on the way down. For more or less a full 24 hours the repo-wrangling behemoth had inconsistent data being presented to users, slow response times and failing requests during common user actions such as reporting issues and questioning your career choice in code reviews.
It's been revealed in a post-mortem of the incident (link at the end of the article) that DB replication was the root cause of the chaos after a failing 100G network link was being replaced during routine maintenance. I don't pretend to be a rockstar-ninja-wizard DBA but after speaking with colleagues who went a shade whiter when the term "replication" was used - It's hard to predict where a design decision will bite back and leave you untanging the web of lies and misinformation reported by the databases for weeks if not months after everything's gone a tad sideways.
When the link was yanked out of the east coast DC undergoing maintenance - Github's "Orchestrator" software did exactly what it was meant to do; It hit the "ohshi" button and failed over to another DC that wasn't reporting any issues. The hitch in the master plan was that when connectivity came back up at the east coast DC, Orchestrator was unable to (un)fail-over back to the east coast DC due to each cluster containing data the other didn't have.
At this point it's reasonable to assume that pants were turning funny colours - Monitoring systems across the board started squealing, firing off messages to engineers demanding they rouse from the land of nod and snap back to reality, that was a bit more "on-fire" than usual. A quick call to Orchestrator's API returned a result set that only contained database servers from the west coast - none of the east coast servers had responded.
Come 11pm UTC (about 10 minutes after the initial pant re-colouring) engineers realised they were well and truly backed into a corner, the site was flipped into "Yellow" status and internal mechanisms for deployments were locked out. 5 minutes later an Incident Co-ordinator was dragged from their lair by the status change and almost immediately flipped the site into "Red" status, a move i can only hope was accompanied by all the lights going red and klaxons sounding.
Even more engineers were roused from their slumber to help with the recovery effort, By this point hair was turning grey in real time - The fail-over DB cluster had been processing user data for nearly 40 minutes, every second that passed made the inevitable untangling process exponentially more difficult. Not long after this Github made the call to pause webhooks and Github Pages builds in an attempt to prevent further data loss, causing disruption to those of us using Github as a way of kicking off our deployment processes (myself included, I had to SSH in and run a git pull myself like some kind of savage).
Glossing over several more "And then things were still broken" sections of the post mortem; Clever engineers with their heads screwed on the right way successfully executed what i can only imagine was a large, complex and risky plan to untangle the mess and restore functionality. Github was picked up off the kitchen floor and promptly placed in a comfy chair with a sweet tea to recover. The enormous backlog of webhooks and Pages builds was caught up with and everything was more or less back to normal.
It goes to show that even the best laid plan rarely survives first contact with the enemy, In this case a failing 100G network link somewhere inside an east coast data center.
Link to the post mortem: https://blog.github.com/2018-10-30-...6 -
Yesterday Gitlab had severe issues.
Somehow their database replication had split and ALL of the traffic went over one server.
I absolutely love how transparent they were about this issue and that they shared what exactly they were doing.9 -
A lot of engineering fads go in circle.
Architecture in the 80s: Mainframe and clients.
Architecture in the 90s: Software systems connected by an ESB.
Architecture in the 2000s: Big central service and everyone connects to it for everything
Architecture in the 2010s: Decentralized microservices that communicate with queues.
Current: RabbitMQ and Kafka.
... Can't we just go back to the 90s?
I hate fads.
I hate when I have to get some data, and it's scattered on 20 different servers, and to load a fucking account page, a convoluted network of 40 apps have to be activated, some in PHP, others in JS, others on Java, that are developed by different teams, connected to different tiny ass DBs, all on huge clusters of tiny ass virtual machines that get 30% load at peak hours, 90% of which comes from serializing and parsing messages. 40 people maintaining this nightmare, that could've been just 7 people making a small monolithic system that easily handles this workload on a 4-core server with 32GB of RAM.
Tripple it, put it behind a load balancer, proper DB replication (use fucking CockroachDB if you really want survivability), and you've got zero downtime at a fraction of the cost.
Just because something's cool now, doesn't mean that everybody has to blindly follow it for fucks sake!
Same rant goes for functional vs OOP and all that crap. Going blindly with any of these is just a stupid fad, and the main reason why companies need refactoring of legacy code.12 -
manager: we had great feedback last week, real users were testing our app! However, we have noticed a lot of issues regards database performance and data replication...
me: oh, that's great news!! How many users? Like hundreds?
manager: no, 6 users so far7 -
Best code performance incr. I made?
Many, many years ago our scaling strategy was to throw hardware at performance problems. Hardware consisted of dedicated web server and backing SQL server box, so each site instance had two servers (and data replication processes in place)
Two servers turned into 4, 4 to 8, 8 to around 16 (don't remember exactly what we ended up with). With Window's server and SQL Server licenses getting into the hundreds of thousands of dollars, the 'powers-that-be' were becoming very concerned with our IT budget. With our IT-VP and other web mgrs being hardware-centric, they simply shrugged and told the company that's just the way it is.
Taking it upon myself, started looking into utilizing web services, caching data (Microsoft's Velocity at the time), and a service that returned product data, the bottleneck for most of the performance issues. Description, price, simple stuff. Testing the scaling with our dev environment, single web server and single backing sql server, the service was able to handle 10x the traffic with much better performance.
Since the majority of the IT mgmt were hardware centric, they blew off the results saying my tests were contrived and my solution wouldn't work in 'the real world'. Not 100% wrong, I had no idea what would happen when real traffic would hit the site.
With our other hardware guys concerned the web hardware budget was tearing into everything else, they helped convince the 'powers-that-be' to give my idea a shot.
Fast forward a couple of months (lots of web code changes), early one morning we started slowly turning on the new framework (3 load balanced web service servers, 3 web servers, one sql server). 5 minutes...no issues, 10 minutes...no issues,an hour...everything is looking great. Then (A is a network admin)...
A: "Umm...guys...hardly any of the other web servers are being hit. The new servers are handling almost 100% of the traffic."
VP: "That can't be right. Something must be wrong with the load balancers. Rollback!"
A:"No, everything is fine. Load balancer is working and the performance spikes are coming from the old servers, not the new ones. Wow!, this is awesome!"
<Web manager 'Stacey'>
Stacey: "We probably still need to rollback. We'll need to do a full analysis to why the performance improved and apply it the current hardware setup."
A: "Page load times are now under 100 milliseconds from almost 3 seconds. Lets not rollback and see what happens."
Stacey:"I don't know, customers aren't used to such fast load times. They'll think something is wrong and go to a competitor. Rollback."
VP: "Agreed. We don't why this so fast. We'll need to replicate what is going on to the current architecture. Good try guys."
<later that day>
VP: "We've received hundreds of emails complementing us on the web site performance this morning and upset that the site suddenly slowed down again. CEO got wind of these emails and instructed us to move forward with the new framework."
After full implementation, we were able to scale back to only a few web servers and a single sql server, saving an initial $300,000 and a potential future savings of over $500,000. Budget analysis considering other factors, over the next 7 years, this would save the company over a million dollars.
At the semi-annual company wide meeting, our VP made a speech.
VP: "I'd like to thank everyone for this hard fought journey to get our web site up to industry standards for the benefit of our customers and stakeholders. Most of all, I'd like to thank Stacey for all her effort in designing and implementation of the scaling solution. Great job Stacy!"
<hands her a blank white envelope, hmmm...wonder what was in it?>
A few devs who sat in front of me turn around, network guys to the right, all look at me with puzzled looks with one mouth-ing "WTF?"9 -
Got a few Jira tickets reassigned to me because the dev who was supposed to work on them got stuck on another project. It's fine, that happens.
I open the tickets. No descriptions for all of them. No screenshots for those reported as bugs, nor any replication steps. No attached test cases or, well, ANY useful information.
I talk to our BA, he says that all information I need are in OTHER tickets on ANOTHER BOARD that business manages but I DON'T HAVE ACCESS TO. Honestly, these shitfucks could've just done simple copy/paste. But nooooo...
So I reassign all the tickets back to their original reporters (business testers) with comments requesting more information.
It's been a week. Now I have no idea what to put in my time sheet.1 -
Ok. FUCK MySQL Workbench.
Most of our products built on MySQL and we just had enough of the tools that we are using for our mysql databases...
We decided to make our own tool :)
If it goes well, we plan to open source it. Would you guys be interested in it?
We planned the following features:
1. Schema editing
2. Schema versioning
3. Update/downgrade script generation to move easily between schema versions
4. Manual/auto sync
5. Might include our own replication solution too...
What do you think?10 -
From Gitlab: "So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place."2
-
Daghhhhhhhh Kafka.
Set it up, seems to work fine.
Oh no...! Take a broker down, then messages go missing - hmm, that's not right. Fine, I'll just look into... Ah, bad replication factor, my fault. So then it's all fixed! Woop. Wait, no. Some messages still going missing occasionally. Oh, only set to "at most once" delivery. My bad, fix that, and... now everything is out of order. Oh, ok, partitions setup wrongly. Wtf, now the whole thing stalls when there's a network blip until a restart. Right, ok, looks like commits have to receive acks in the library I'm using before continuing. Switch to a library that uses CommitWithoutReply. Brilliant....
Apart from said library seems to have commits failing all over the place because it keeps trying to commit during a rebalance 🙄😒😤
The frustrating thing is I KNOW for a fact that Kafka is a fault tolerant, resilient, horizontally scalable thing capable of handling stupid amounts more than I'm throwing at it without missing a beat. But damn,configuring it, and checking you've configured it sanely is a royal, monumental PITA.5 -
If they followed my suggestion and went straight to debugging the server issues they would have been solved it from week 1 and everyone would have thought the migration had a minor performance hiccup. In fact, we have already done such at least twice before and nobody batted an eye.
Instead they self-labelled the migration a failure on first error, setting the stage for apologizing to the client, and put themselves on the spot for a whole staging / production signoff, replication / backup worfklow, almost a blue-green "seamless" deployment reminiscent of DigitalOcean.
Well they're not DigitalOcean, and anyone who has spent any time understanding users knows they will not participate in "new system" tests long enough to find or report issues.
So of course the migration stretched out to almost three months up until the whole reason for the migration - the rapidly escalating risk of the old provider disappearing - hit like a freight train and now they have to go through the problem of debugging the server like I told them to on week 1. Only this time they've set the client mindset against it, lost any chance of reverting, have had grave risk for data loss, and are under pressure to debug other people's code in real-time.
This is why I don't trust devs to do ops. A dev's first solution to any problem is to throw tech at it. -
Someone created a 0-followers private Twitter account and posted something to try out the new views count feature.
It raked dozens of views in a couple hours.
HOW?!?
Source: https://twitter.com/briggityboppity...
It looks like a funny data reverse-engineering exercise, so let's try and figure out what is going on.
Hypothesis 1) it is the OP's own views.
Reasonable, but unlikely if what OP says about not checking it for hours is true.
H2) It's some background job in OP's device that is refreshing OP's own latest tweets, so even without human interaction technically H1 is true. It would be some really shoddy engineering to count eye-less page views, but that's also what managers would demand.
H3) it's some internal Twitter automated function like back up, replication, indexing and word count.
See H2, it would be even dumber to count that as page views.
H4) it's some internal human reviewing for a keyword that could be associated with porn (in this case, "butts"). Really? dozens of humans to review a no-impact single post? They would have to employ hundreds of thousands of reviewers.
H5) it's some page-loading shit, like thousands of similar tweets get stored in the same index hash page and end up counting as a view in all of them every time someone loads the index page. It would be like counting every hit in the namenode as a hit in every data asset in it's Hadoop partition, or every hit in a storage block as a hit in each of it's files.
Duuuumb and kinda like H3.
H6) page views are just a fraud to scam investors. Maybe it's a "most Blockchain transactions are fake" situation, maybe it's a "views get more engagement if you don't think a lot about it" situation, maybe it's a "we don't use the metric system to count page views" situation.
All of them are very dumb.
Other hypothesis or opinions?10 -
TLDR: I need advice on reasonable salary expectations for sysadmin work in the rural United States.
I need some community advice. I’m the sysadmin at a small (35 employee) credit card processing company. I began as an intern and have now become their full time sysadmin/networking specialist. Since I was hired in January I have:
-migrated their 2007 Exchange server to Office 365
-Upgraded their ailing Windows server 2003 based architecture to 2012R2
-Licensed their unlicensed VMware ESXi servers (which they had already paid for license keys for!!!) and then upgraded them to 6.5 while preventing downtime on hosted VMs using tricky transfers and deployments (without vMotion!)
-Deployed a vCenter server to manage said ESXi servers easier
-Fixed a three month gap in their backups by implementing Veeam, and verifying its functionality
-Migrated a ‘no downtime’ fileserver to a new hypervisor host, implemented a ‘hot standby’ server as a backup kept up to date by the minute with DFS replication.
-Replaced failing hard drives in a RAID array underlying their one ‘business critical’ fileserver, which had no backups for 3 months at that time
-Reorganized Active Directory and Group Policy deployment from a nightmare spiderweb of OUs and duplicate policies
-Documented the entire old network and now the new one as I’ve been upgrading this
-Audited the developers AWS instances and removed redundant machines, optimized load balancing on front end Nginx servers, joined developer run Fedora workstations to the AD domain and implemented centralized syslog monitoring on them.
-Performed network scans and rewrote firewall exceptions to tighten security
There’s more, but you get the idea. I’ve now been tasked with taking point on an upcoming PCI audit which will be my first.
I’m being paid $16/hr US, with marginal health benefits. This is roughly $32,000 a year, before taxes.
I have two years previous work experience managing a third party Apple repair facility (SimplyMac) and every Apple certification for warranty repair and software troubleshooting. I have a two year degree in general sciences, with about 4 years of college credit (Two years of a physics education and two years of computer science after I switched focus) I’m actively pursuing a CCNA and MCSA server 2016 with exams paid for and scheduled.
I’m going into a salary negotiation in two months. What is a reasonable salary to request, from your perspective, for someone in my position?
Thanks in advance!7 -
For the fucking millionth time!!!
Backup != slave-master replication you dumb fuck...
What the fuck is so hard to understand after countless explanations using fucking drawings and shit?
Wtf dude...6 -
Completely fucked up replication of MySQL servers.
Remote: 2 different database Servers
--> made sense.
Except the misconfiguration. Or better: No configuration at all.
So how to solve the massiv delays and make everything even more crazy?
2 remote servers - 2 readonly slaves for reading data remote (master - slave)
2 local (internal) servers.
Remote - Local Master Master.
Unfucking this cluster fuck was a real nightmare.
It had to be done at night, cause everything needed to be ripped apart.
And the servers were the backend of a warehouse with supply chain and multiple selling channels (Amazon, eBay etcetera).
So. It had to run the next day at 05.00 clock so the incoming orders could be packaged / prepared for shipping.
That was fun. Not.
And the clusterfuck died spectaculously on my first work day - the old DBA was gone (fired....)
:) -
F**k companies who's apps use MySQL/MariaDB tables of the table engine MEMORY.
Seriously.
That engine *sucks* to work with as an admin. It's such a huge pain in the ass having to always dump the whole DB instead of taking a snapshot.
And if the replica restarts... Poof. Replication breaks. Cuz all the memory tables are suddenly empty!
Fml. Fmfl. Ugh.17 -
Let's start a discussion about decentralized. EveRyOne caN hOsT hiS oWn ServEr. Do you mean the freaking internet in general? By definition, the internet is decentralized. "Decentralization has a protocol we all use to stay in sync". That existed already, it's called IP, TCP and UDP.. The decentralization protocols are on top of those making it only more limiting. Good, many nodes in sync. Yeah, replicating SQL servers exist for a long time.
People who 'invented' decentralized did just not realize how the internet works. Adding a network on top of a network ending up in a smaller network making it more centralized. "Decentralized" stuff has nothing to add. Just some word for replication protocol or smth.
I'm too sober to fall for this shit.14 -
When your IT VP starts speaking blasphemy:
"Team,
We all know what’s going on with the API. Next week we may see 6x order volumes.
We need to do everything possible to minimize the load on our prod database server.
Here are some guidelines we’re implementing immediately:
· I’m revoking most direct production SQL access. (even read only). You should be running analysis queries and data pulls out of the replication server anyway.
· No User Management activities are allowed between 9AM and 9PM EST. If you’re going to run a large amount of updates, please coordinate with a DBA to have someone monitoring.
· No checklist setup/maintenance activities are allowed at all. If this causes business impact please let me know.
· If you see are doing anything in [App Name] that’s running long, kill it and get a DBA involved.
Please keep the communication level high and stay vigilant in protecting our prod environment!"
RIP most of what I do at work.3 -
Somehow further back on this ticket than when I started. How does that even happen?
I went from okay I can replicate the issue and trying to fix it to now the data I need to replicate it won't even load. WTF?
I don't know
Senior doesn't know
Our boss doesn't know
3 seperate QAs don't know
Boss of QA doesn't know3 -
More and more, I am getting frustrated/depressed from the attitude of our customers who complain, moan and get angry about issues in their infrastructure, while at the same time, refusing to pay more so the issues could be mitigated.
Like, a client's angry with us today for having one of their non-production-critical databases inaccessible for... Hmm... About 8 hours now (So a whole workday).
Like... I get it, some of your employees couldn't work with it offline, but like... What the hell do we do? You keep data from as far back as several years ago in there, without partitioning, without exports, in a mix of innodb and myisam, so when the DB crashes, and its replication has to be reset from zero, reimporting all the data takes hours upon hours, and importing .sql files just takes time.
Or another client who got angry when their app fell out of the internet, cuz one of their myisam-based log tables crashed, and had to be repaired, with data spanning several years back, meaning it took hours to fix...
The more I work with these "basic" and "simple" infrastructure designs that is *not* redundant, or HA, the more I wonder -- How do the big names out there do it? How do you design systems with fault tolerance so a single DB table crash doesn't lead to the whole app getting inaccessible?
We have... One, exactly one, client, who uses MariaDB with Gallera, and that cluster is *amazing*, it just keeps chugging along, without a care in the world. But it cost them quite a lot, as they had to buy 3 DB servers, instead of 1...1 -
So, This new company I joined, My first task was to revamp the codebase. have ended up cleaning up the garbage so much that I have lost all my touch with innovation. I am just doing regular and tedious feature replication. I hope I finish this soon and get into real things!! building something from scratch.
-
I think one of the great things about this app is how we handle bug reporting. Any time someone posts about a bug, people independently do extensive tests to determine the extent of the bug (or at least what devices are affected), and people nearly always give detailed replication steps. I think this is a great feature of this community1
-
Enabled a mysql optimization once, corrupted the whole live DB on a sunday while I was an hour away from the nearest internet connection.....
Luckily we had a live replication server so not much was lost (though uploading it over a 14mbit connection takes a long time for 30GB) -
TL;DR I have to bump a Redis cluster from t3.medium to m6g.large just to get enough network bandwidth even though I have no need of the extra memory.
Debugged an interesting issue today.
I am adding Elasticache to a project to reduce strain on the single node postgres DB.
Deployed a Redis replication group with 2 shards, with multi-AZ replication for resilience.
Everything was going well. We arent caching that much atm so was barely using 100Mb of memory.
Suddenly, when our US region comes online, latency skyrockets and the logs are full of Jedis timeout errors.
Still no issue with memory or node CPU.
The cause? Arbitrary network bandwidth throttling by AWS. The app currently processes about 3,000 requests per second so we were exceeding Amazons random ass allowances which arent documented anywhere.1 -
"[Elasticache isn't a managed service because] You are still choosing instance size, setting up replication and clustering more or less without their help"
From their website:
"Amazon offers a fully managed Redis service, Amazon ElastiCache for Redis"
Just because it's configurable doesn't mean it's not managed. -
The reason for replication of studies and efforts that are not exceedingly time intensive is the same as the reason we all do the same problems on a calculus exam
you have to learn somehow.3 -
For those learning MongoDB and struggling to find resources on sharding/replication, this video tutorial from Vemara Hub on YouTube is fantastic and his blog also has it in article form. This is where mongo shines.
Video tutorial: https://youtube.com/watch/...
Article: https://csrepo.blogspot.com/2019/...
All credit to Rajesh Nair. -
Has anyone figured out how to make a free replication database in SQL... I don't want to pay $200 a month for a back up of a 50gb database
-
Am I the only one who's hands start shaking when about to send "CHANGE MASTER TO" on a dev server?
Happened to me yesterday, replication got stuck after corrupting a relay log file when the database segfaulted under my hands.
I could check and recheck the positions I was about to reset it to a bilion times and I was still nervous!