14

All these super expensive and fancy enterprise tools. CloudWatch, AppDynamics, Grafana, Splunk and whatnot. Spent a month trying to figure out why the fuck the app does not perform well.

Took 1 day with tcpdump, awk and gnu utils to figure out why.

Should anyone need a tcpdump analyzer -- try my awk script. Shows response times of each network call w/o impacting app performance :)
https://gist.github.com/netikras/...

Comments
  • 1
    You actually need to provide data to those tools e.g. grafana.

    It just displays what you feed it.
    They are not for profiling.
  • 1
    @KDSBest CW and Grafana are not. AppD -- among other things, is.
  • 0
    @netikras never heared of it, but I don't do Java
  • 1
    @KDSBest neither of them are java-related
  • 1
    If I need deep analysis or do some stuff with log lines good old GNU tools are my friends too. Showing response times should be easily visualised with the tools you mentioned though.
  • 1
    @hjk101 they should. And they are. Problem is, RTs in splunk are not logged [if enabled, logging alows down the app significantly due to high req rate], and AppD granularity is 1 minute which is not good if I need to find a problem which occurs every 1min. Grafana has the granularity, but cannot monitor the faulty component's traffic bcz of how its routed internally.
  • 1
    Btw, if interested - the problem was an AWS Redis. For some weird reason, one of the 9 redis nodes slowed down RTs every 60 seconds. Other nodes in the cluster worked OK. Raised an AWS SR - I need explanations
  • 1
    @netikras sounds like your missing some observability. Automate it and get it into CW or Grafana and start alerting on it
  • 1
    @lungdart aws support said that these slowdowns are due to QUITs being sent more often. It never occurred to me quits would have such an effect...
  • 1
    @netikras but only on a single instance? That seems strange.
  • 1
    @lungdart I know. The SR is still open.

    As for lack of metrics - it's a huge corp, a large app, so granularity is not gonna change that easily
  • 1
    @lungdart alerting is possible, but for aleeting to be efficient I either need crazy granularity of metrics, or know what I'm hunting for. Now I have neither
  • 1
    @netikras you could monitor quits. If you have TLS termination or a service mesh, you should be able to filter on that.

    Send that metric to cloudwatch and alert if you see 10 in a 5 minute window. Maybe even log the client sending the requests.

    Maybe your loadbalancer is sending a single client to one node that uses quits. Or maybe a node has been compromised...
  • 0
    @lungdart yeah, thing is, the investigation is still in progress :) perhaps there's also smth else I need to monitor as well, besides quits ;)

    it's too early for this.
Add Comment