-
-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add configuration to prevent docker-swarm from killing TCP connections to database #426
Comments
OK, my theory is that the Docker Swarm networking stack is killing the open TCP connections between the Zulip server and the postgres/memcached servers. We had a much more fatal similar with RabbitMQ fixed last year (b312001). The symptom is the same as the service being restarted -- the connections are killed, which Zulip will re-establish in each process when it discovers this (and send an error email), resulting in this random distribution of error emails. Googling suggests that other products have indeed had that sort of problem with Docker Swarm's aggressive killing of TCP connections. https://success.docker.com/article/ipvs-connection-timeout-issue seems to be their knowledge base article on the topic. @stratosgear can you try playing with the diagnostic steps described on that article to see if they suggest this is what's happening? Based on that doc, it looks like Docker Swarm itself doesn't support configuring its networking behavior of killing idle TCP connections :(. For memcached, lericson/pylibmc#199, https://sendapatch.se/projects/pylibmc/behaviors.html, and https://pypi.org/project/pylibmc/1.3.0/ suggest they have an undocumented option to set the keepalive settings. |
Wow, this is getting way too low level for me. As I understand it, the connection drops and then re-establishes. I have never seen any issues on the actual running instance of Zulip, so this is not catastrophic. So I would call this a self-healing auto-recovering issue. Any way I can turn off these messages though, from cluttering my emailbox. |
Just another noteworthy "trivia" mentioning... Overnight, when our private Zulip instance is inactive by human users, there are no error messages emitted. But as soon as people start using it in the morning, and throughout the day, the messages resume. Does this make sense on how you think this error exhibits itself...? |
Yes, because what's happening is the network connections between Zulip processes and the database/cache are dropped by Docker Swarm when idle, and when the first activity happens, the Zulip processes try to use the connection, get an error, and then retry (successfully). @andersk FYI; I wonder if it's worth trying to restructure the way we handle errors accessing the database to avoid sending an exception email when it only happens once and we succeed on retry? |
I am not sure these are only timeouts... Look at the frequency of these messages: From 11:17 to 11:20 there is a pair of identical error messages, and then they repeat immediately in 11:20 again. Are you keeping the connections open, or you expect them to remain open for more than three minutes? I am not sure I follow on why a timeout issue will cause this, or what kind of connections you expect to always be open/connected. Never seen any similar connection problems with any other stack I have worked with. Hopefully this timestamped error log might help a bit! |
@timabbott Is there a way to turn off these emails please, please, please....? I have been deleting 150-200 emails every day for the last 2-3 months. Everything seems to be working fine in zulip, but these insistent messages are driving me crazy! Thanks!!! |
@stratosgear yeah, with that frequency it's clear that your memcached cache is just nonfunctional. Zulip is designed to work even if the cache is down, but it's definitely going to be a performance issue (And the Python memcached module has this terrible setup where the exception is different every time, so our rate-limiting of a given error email doesn't work with it). The issue is almost certainly that the memcached authentication that Zulip sets up is broken, and your server cannot connect to memcached. I think zulip/zulip#14925 may help. |
Yes, I look forward to upgrade to something better. But my request was to find a way to stop these messages until that time. Is there no way to temporarily stop the messages, until the next upgrade...? |
Well, you can always change |
That seems it did the trick, thanks. Looking forward for the next release! |
Just what to raise up the ticket. p.s. That zulip instance is not much in use, I guess, that is the reason why I don't have too many messages, but I'm afraid that when it will get a normal load I will have tons of the same errors. |
I'm continuing with the investigation of what could be the reason for the errors. I restarted Zulip on 2020-10-15 and it was working ok till yesterday. The backend error log is quite the same, with @timabbott maybe you have any new ideas?
|
@Cybernisk I think your issue is unrelated to what @stratosgear originally reported. If I had to guess, you have some operational issue with your server (memory corruption, OOM kills due to insufficient memory, etc.) that resulted in postgres processes crashing. I'll also add that you should upgrade to current Zulip; we've made a lot of improvements that you're currently not benefitting from. |
I think the original issue with memcached authentication reported here was fixed, and we've also added https://zulip.readthedocs.io/en/latest/production/troubleshooting.html#restrict-unattended-upgrades to document why one might get a bunch of error emails after a postgres upgrade (one per process). However, I think we may still have some follow-up to do related to the possibility that Docker Swarm's networking stack kills postgres connections in the original report: #426. @stratosgear can you confirm whether you're still seeing that symptom of postgres restarting daily in Docker Swarm? |
Unfortunately, I will not be able to test this for you :( Due to company policies, we were forced to switch to Teams, and I hate every minute of using it!!! I look forward to using Zulip in the future again though, as I believe it is definitely a superior product. Keep up the good job! |
It looks like the remaining problem was a Docker Swarm default configuration problem; a potential solution is suggested here: vapor/postgres-kit#164 (comment) I'll transfer this to the docker-zulip repository. |
On the latest 2.1.3 (but also on 2.1.1, and 2.1.2) I am frequently (multiple times per day, 10-20?) getting this error emailed to me:
The installed version appears to work correctly though.
Elsewhere in other issues I have posted here, I have received the advice that this is caused by restarted services, but this is not something that I do as the Docker Swarm stack is not restarted and I see the services having an uptime of larger than one day (while I have received these error messages)
I am running in a single node Docker Swarm with a docker-compose identical to the one described in docker-zulip repo.
For validation, here is it, with the secrets redacted:
Some hopefully useful info:
This is a "sister" issue as zulip/zulip#14456 that I also opened (but with another error message).
Anything else I can provide the help solve this...?
The text was updated successfully, but these errors were encountered: