-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leader stopped sending appendEntries to the follower, but heartbeat was working #612
Comments
Hi thanks for the detailed description. We've not seen this occur before with any of our log stores. I say that just as information that it's not a common or known way for this library to fail which could mean you hit a rare bug or that your log store behaves a little differently to others people are using. Some more questions that might help:
I'm not sure how to answer the other questions without the full story here - are you able to share the logs from the leader and follower (filtered to just the raft libary output with timestamps) so we can help understand. This does sound like a bug potentially. The one way I can think it could possibly happen now would be if you have network connectivity issue on the replication TCP connection - the replication loop could in theory get stuck waiting for a response indefinitely if it gets just enough packets to keep the connection alive but not actually a full response. This makes sense because the standard |
Hi @banks, Thank you for the reply. To answer your question, we currently use Leader Logs:
Follower Logs:
Meanwhile, I'll also check further from |
Hi, Thanks for the great library! I really appreciate the efforts you guys have taken to build this project.
Issue Description:
We recently encountered an issue, leader stopped sending appendEntries to the follower but could send the heartbeat.
The screenshot below, taken from the metrics, shows the situation. Only one follower node was affected, and all other nodes were fine.
Let me describe the issue in more detail, and share my findings so far on the issue.
The above timeline represents the events that occur at the time of the issue. At the same time, I didn't get any warn/error from the leader.
Following is the Stats on the Follower node at 9.45:29.236
I'm referring to the code below,
I've the following questions.
When "failed to append to logs" error occurred on the follower node, on the leader
atomic.StoreUint64(&s.nextIndex, max(min(s.nextIndex-1, resp.LastLog+1), 1))
should sets.nextIndex
to 151153297, and try to replicate again from Index=151153297. But in this case "appendEntries rejected, sending older logs" error didn't occur on the leader node. Does that mean the leader received
resp.Success=true
when an error occurred?Does the leader consider it has replicated logs till Index 151153341 at the follower node? As from the logs on the follower, it is looking for previousLog at index=151153341
I don't understand the reason why replication stopped from Leader to Follower, i don't see "removed peer, stopping replication" in the logs on the leader node.
Note: when the leadershipTransfer happened again, a new leader was able to restart the replication.
The text was updated successfully, but these errors were encountered: