-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not able to measure the time it take for log from being dispatched to being committed and applied to the FSM #510
Comments
Just adding this for completeness if others come across this. Overall I like the suggestions here though I have a few questions:
I'm not quite clear how this is different to the existing commitTime? Is it not including the write to disk on the leader? but that does still include the time taken to write to disk on the followers so it's maybe a little hard to explain the value of that 🤔 . We already have timing for the
Names are hard... but it seems confusing to have two timers |
The perspective I had when writing this is to be able to pin point the source of slowness in raft. By adding
yes I agree, |
Thanks Dhia. Happy to approve with that changed name. Happy to take a look at the follow up PR for Consul's telemetry page where we explain the new metrics too! |
Currently raft have multiple metrics to measure the time it take for a log to go through different stages:
raft.commitTime
: measure the time it take for log to be written to disk and replicatedraft.fsm.enqueue
: measure the time it take for a log to be enqueued to the queue of logs that are to be applied to the FSMraft.fsm.apply
: measure the time it take for a log to be applied to the FSMThose metrics measure separately different stages of the log in raft but don't account for the time that a log spend in the FSM queue waiting to be applied.
Also, the replication time is aggregated with the disk write time, which make it harder to deduce the reason of a high
raft.commitTime
. That said, a highraft.commitTime
in conjunction with a highraft.LastContact
can be safely attributed to the replication taking too long.I suggest to add 3 metrics:
raft.replicationTime
: measure the time it take to replicate the log to enough nodesraft.applyTime
: measure the time from a log being dispatched to after it's applied to the FSMraft.fsm.queueWait
: measure the time from the log get enqueued to the fsm queue to being picked up to be applied to the fsmThe text was updated successfully, but these errors were encountered: