Sunday, September 11, 2022

A painful lesson I learned while putting out fire

Imagine you have a service that's unresponsive, yet you can see in your production environment it yields 500-s for every request. SLO-s are fine, no alert is being invoked, everything seems to be normal.

What do you do?

At first, I didn't know what to do, but after the first shock I had the - not that imaginative - idea to restart the service. For our production evnironment we use shipit as a helper for these kind of situations. As a long time software engineer I am familiar with a lot of tool, but this was the first time looking at the shipit user interface. So I went back to my roots where we restarted services by a new deployment. That was arguably not the best decision I admit, yet there I was trying to put out fire. The new deployment went through! Everything should be back up again and running smooth like a sailboat! Right. Right? Right??

Well, this should've been the case, but when I went back to check on the service, it showed still the same error. Bummer.

What to do now?

Let's check logs.

OK, logs show practically nothing. I mean, nothing for the past 10 minutes.

That's not good.

How will I know what's happening inside the container?

OK, calm down. This is a great challenge that you can solve. Embrace it.

Cool, let's have a look what was the last thing the container logged!

Maximum call stack size exceeded.

That's not good. What could possibly cause that so badly that the service stops reporting?

Let's check the stacktrace.

Something-something.. loggerInstace .. something-something.. winston..

Ok, so something's wrong with logging? 

Ouch.

So we don't know anything about the inner parts of the service, because the _logging_ has issues.

In the meantime I tried to redeploy again, same result - 500-s after the deployment.

This is not looking good.

BUT! Teammates are there for you when you them the most! Sergei taught me that Shipit has a "restart" feature!

OK, let's do that.

It works.

WHAT.

Damn.

So when we deployed shipit didn't actually do any changes? Weird.

Let's check the kubernetes pods.

They are old. They definitely didn't restart when I redeployed.

TLDR;

Shipit has a "restart application" command that can be used to restart your application (isn't that weird?). This can solve your production issues when having OOM, and the redeployment might not do the trick.

Disclaimer: this might be just current project specific, as shipit can be customized heavily, nevertheless this was a good learning, that redeployment of the same commit might not do the same thing as a restart of the application.

No comments:

Post a Comment