2
Earu
2y

Quick question for you all: How do you deal with a problem in production that you cannot fix, even over an extended period of time (say 2 months)?

For context I feel like I’m losing my sanity here, we’ve had this problem on our production API since the beginning of March this year. I’ve done so much testing, got in contact with various teams of my company to try to figure out any potential candidate that would explain the bug, but none worked out. No need to say I’ve spent a considerable amount of time searching on the internet for others with the same problem or similar… We’ve even opened a ticket with the cloud host to see if they would have more details about the problem without success. So how do you deal with that ?

Comments
  • 0
    If its hard to recreate it can take time.

    I have had some elusive problems that took far longer than 2 months.

    Its very much a matter of how disruptive the error is for customers, if its just a small annoyance or only affects few customers rarely it will often be hard to find patterns.

    My most elusive error involved values in a database that became out of sync.

    But once started to identify the problem we realized it was actually some fixup code that triggered it but that the data at the time already was well outside of intended values.

    But the real cause we never found since it only happened a few times per year spread over a handful of our thousands of customers.

    And as far as we could determine, values had been deviating for weeks or month before ending up in an illegal value

    Before that everything still worked.

    So the original error could be months back in time by the time we got notice of it and we no longer had the history available for a full trace.

    And doing full traces for all objects for all customers was not possible without serious redesign.
  • 2
    Log the shit out of it then read all of the logs.
  • 1
    Well.. I do already log everything 😅
  • 0
    it's kind of difficult to judge without knowing what the thing is and why it takes so long to fix. Maybe delete and rewrite from scratch is an option?
  • 0
    @Earu log even harder 😂. Maybe not just the path and returncodes of called functions but also values of parameters passed to those functions etc.

    Get desperate and try desperate things it usually helps. And don’t expect ppl to help at this point(2months), it’s just you who can fix it 🤷🏻‍♂️
Add Comment