You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Dec 6, 2024. It is now read-only.
I setup a long-running load test against linkerd-tcp with slow_cooker with the following parameters: 3 nginx backends serving static content and 3 slow_cooker load testers with the following command-lines:
So this is about (10k + 200) qps with the second two slow_cookers checking for invalid response bodies.
After about 20 hours, linkerd-tcp stopped accepting new connections and stopped processing requests from existing connections. slow_cooker was reporting the following errors
Get http://proxy-test-4d:7474: read tcp 10.240.0.4:58016->10.240.0.21:7474: read: connection reset by peer
Get http://proxy-test-4d:7474: read tcp 10.240.0.4:57899->10.240.0.21:7474: read: connection reset by peer
Get http://proxy-test-4d:7474: read tcp 10.240.0.4:57902->10.240.0.21:7474: read: connection reset by peer
Get http://proxy-test-4d:7474: read tcp 10.240.0.4:57906->10.240.0.21:7474: read: connection reset by peer
Get http://proxy-test-4d:7474: read tcp 10.240.0.4:57853->10.240.0.21:7474: read: connection reset by peer
Get http://proxy-test-4d:7474: read tcp 10.240.0.4:57912->10.240.0.21:7474: read: connection reset by peer
2017-03-30T16:51:02Z 0/0/55 6000 0% 1m0s 0 [ 0 0 0 0 ] 0 0
Get http://proxy-test-4d:7474: dial tcp 10.240.0.21:7474: getsockopt: connection timed out
Get http://proxy-test-4d:7474: read tcp 10.240.0.4:57907->10.240.0.21:7474: read: connection reset by peer
Get http://proxy-test-4d:7474: read tcp 10.240.0.4:57908->10.240.0.21:7474: read: connection reset by peer
Interesting, the admin port was still processing requests and I setup a watch to look at output every few seconds. Here's an example report
3 of them were locked on the same mutex. Here's some output from rust-gdb. Unfortunately the process segfaulted before I captured backtraces from all the threads but I do have some output I can share:
Attaching to process 13505
[New LWP 13506]
[New LWP 13507]
[New LWP 13508]
[New LWP 13509]
[New LWP 13510]
[New LWP 13511]
[New LWP 13512]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f0ee2f716a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84
84 ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) thread 3
[Switching to thread 3 (Thread 0x7f0ee1fff700 (LWP 13507))]
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135 ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.
(gdb) bt
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f0ee3450ebd in __GI___pthread_mutex_lock (mutex=0x7f0ee200e150)
at ../nptl/pthread_mutex_lock.c:80
#2 0x000056016c9a0389 in std::panicking::try::do_call::hb458ba233f0b55f7 ()
#3 0x000056016c9c2f7b in panic_unwind::__rust_maybe_catch_panic ()
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libpanic_unwind/lib.rs:98
#4 0x000056016c9a4893 in _$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$::call_box::hc7b55c280fa43c1f ()
#5 0x000056016c9bad95 in alloc::boxed::{{impl}}::call_once<(),()> ()
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/liballoc/boxed.rs:624
#6 std::sys_common::thread::start_thread ()
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sys_common/thread.rs:21
#7 std::sys::imp::thread::{{impl}}::new::thread_start ()
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sys/unix/thread.rs:84
#8 0x00007f0ee344e6ca in start_thread (arg=0x7f0ee1fff700) at pthread_create.c:333
#9 0x00007f0ee2f710af in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:105
(gdb) thread 5
[Switching to thread 5 (Thread 0x7f0ee1bfd700 (LWP 13509))]
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135 ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.
(gdb) bt
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f0ee3450ebd in __GI___pthread_mutex_lock (mutex=0x7f0ee200e150)
at ../nptl/pthread_mutex_lock.c:80
#2 0x000056016c9a0389 in std::panicking::try::do_call::hb458ba233f0b55f7 ()
#3 0x000056016c9c2f7b in panic_unwind::__rust_maybe_catch_panic ()
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libpanic_unwind/lib.rs:98
#4 0x000056016c9a4893 in _$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$::call_box::hc7b55c280fa43c1f ()
#5 0x000056016c9bad95 in alloc::boxed::{{impl}}::call_once<(),()> ()
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/liballoc/boxed.rs:624
#6 std::sys_common::thread::start_thread ()
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sys_common/thread.rs:21
#7 std::sys::imp::thread::{{impl}}::new::thread_start ()
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sys/unix/thread.rs:84
#8 0x00007f0ee344e6ca in start_thread (arg=0x7f0ee1bfd700) at pthread_create.c:333
#9 0x00007f0ee2f710af in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:105
(gdb) thread 6
[Switching to thread 6 (Thread 0x7f0ee19fc700 (LWP 13510))]
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135 in ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
(gdb) bt
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f0ee3450ebd in __GI___pthread_mutex_lock (mutex=0x7f0ee200e150)
at ../nptl/pthread_mutex_lock.c:80
#2 0x000056016c9a0389 in std::panicking::try::do_call::hb458ba233f0b55f7 ()
#3 0x000056016c9c2f7b in panic_unwind::__rust_maybe_catch_panic ()
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libpanic_unwind/lib.rs:98
#4 0x000056016c9a4893 in _$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$::call_box::hc7b55c280fa43c1f ()
#5 0x000056016c9bad95 in alloc::boxed::{{impl}}::call_once<(),()> ()
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/liballoc/boxed.rs:624
#6 std::sys_common::thread::start_thread ()
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sys_common/thread.rs:21
#7 std::sys::imp::thread::{{impl}}::new::thread_start ()
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sys/unix/thread.rs:84
#8 0x00007f0ee344e6ca in start_thread (arg=0x7f0ee19fc700) at pthread_create.c:333
#9 0x00007f0ee2f710af in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:105
Some netstat output. I noticed that the Recv-Q for the listener has bytes in it. That's unexpected
ltrace there was movement on the process and something suspicious stood out. read(4, 0x7fffd9e9dfb0, 128) = -1 EAGAIN (Resource temporarily unavailable) lsof showed the following FD connected to 4. linkerd-t 13505 stevej 4r FIFO 0,10 0t0 89104520 pipe
Unfortunately, the process segfaulted after more ltraceing and most of my output was lost to a small tmux scrollback buffer.
Here's the segfault I did capture. Note that the activity here is the admin port writing out prometheus stats to my curl request in a watch loop
I setup a long-running load test against
linkerd-tcp
withslow_cooker
with the following parameters: 3 nginx backends serving static content and 3 slow_cooker load testers with the following command-lines:slow_cooker -host "perf-cluster" -qps 100 -concurrency 100 -interval 1m -reportLatenciesCSV linkerd-tcp.csv http://proxy-test-4d:7474
slow_cooker -host "perf-cluster" -qps 1 -concurrency 100 -interval 1s -hashValue 9745850253931700627 -hashSampleRate 0.1 http://proxy-test-4d:7474
slow_cooker -host perf-cluster -qps 1 -concurrency 100 -interval 1m -hashValue 9745850253931700627 -hashSampleRate 0.1 -compress http://proxy-test-4d:7474
So this is about (10k + 200) qps with the second two slow_cookers checking for invalid response bodies.
After about 20 hours, linkerd-tcp stopped accepting new connections and stopped processing requests from existing connections. slow_cooker was reporting the following errors
Interesting, the admin port was still processing requests and I setup a
watch
to look at output every few seconds. Here's an example reportThe process had a few more threads than I remember from the previous day.
3 of them were locked on the same mutex. Here's some output from
rust-gdb
. Unfortunately the process segfaulted before I captured backtraces from all the threads but I do have some output I can share:Some netstat output. I noticed that the Recv-Q for the listener has bytes in it. That's unexpected
ltrace
there was movement on the process and something suspicious stood out.read(4, 0x7fffd9e9dfb0, 128) = -1 EAGAIN (Resource temporarily unavailable)
lsof
showed the following FD connected to 4.linkerd-t 13505 stevej 4r FIFO 0,10 0t0 89104520 pipe
Unfortunately, the process segfaulted after more
ltrace
ing and most of my output was lost to a small tmux scrollback buffer.Here's the segfault I did capture. Note that the activity here is the admin port writing out prometheus stats to my curl request in a watch loop
I will put together a tool to gather full output from rust-gdb, lsof, and some strace and ltrace samples for the next time this pops up.
The text was updated successfully, but these errors were encountered: