netikras

3y

All these super expensive and fancy enterprise tools. CloudWatch, AppDynamics, Grafana, Splunk and whatnot. Spent a month trying to figure out why the fuck the app does not perform well.

Took 1 day with tcpdump, awk and gnu utils to figure out why.

Should anyone need a tcpdump analyzer -- try my awk script. Shows response times of each network call w/o impacting app performance :)
https://gist.github.com/netikras/...

rant

gnu

performance

tcpdump

enterprise

Ranter

Comments

1

KDSBest

732

3y

You actually need to provide data to those tools e.g. grafana.

It just displays what you feed it.
They are not for profiling.
1

netikras

34571

3y

@KDSBest CW and Grafana are not. AppD -- among other things, is.
0

KDSBest

732

3y

@netikras never heared of it, but I don't do Java
1

netikras

34571

3y

@KDSBest neither of them are java-related
1

hjk101

5546

3y

If I need deep analysis or do some stuff with log lines good old GNU tools are my friends too. Showing response times should be easily visualised with the tools you mentioned though.
1

netikras

34571

3y

@hjk101 they should. And they are. Problem is, RTs in splunk are not logged [if enabled, logging alows down the app significantly due to high req rate], and AppD granularity is 1 minute which is not good if I need to find a problem which occurs every 1min. Grafana has the granularity, but cannot monitor the faulty component's traffic bcz of how its routed internally.
1

netikras

34571

3y

Btw, if interested - the problem was an AWS Redis. For some weird reason, one of the 9 redis nodes slowed down RTs every 60 seconds. Other nodes in the cluster worked OK. Raised an AWS SR - I need explanations
1

lungdart

3429

3y

@netikras sounds like your missing some observability. Automate it and get it into CW or Grafana and start alerting on it
1

netikras

34571

3y

@lungdart aws support said that these slowdowns are due to QUITs being sent more often. It never occurred to me quits would have such an effect...
1

lungdart

3429

3y

@netikras but only on a single instance? That seems strange.
1

netikras

34571

3y

@lungdart I know. The SR is still open.

As for lack of metrics - it's a huge corp, a large app, so granularity is not gonna change that easily
1

netikras

34571

3y

@lungdart alerting is possible, but for aleeting to be efficient I either need crazy granularity of metrics, or know what I'm hunting for. Now I have neither
1

lungdart

3429

3y

@netikras you could monitor quits. If you have TLS termination or a service mesh, you should be able to filter on that.

Send that metric to cloudwatch and alert if you see 10 in a 5 minute window. Maybe even log the client sending the requests.

Maybe your loadbalancer is sending a single client to one node that uses quits. Or maybe a node has been compromised...
0

netikras

34571

3y

@lungdart yeah, thing is, the investigation is still in progress :) perhaps there's also smth else I need to monitor as well, besides quits ;)

it's too early for this.

Related Rants

devRant © 2021 Hexical Labs LLC
Privacy Policy | Terms of Service