Skip to content
This repository has been archived by the owner on Dec 6, 2024. It is now read-only.

linkerd-tcp process hangs in mutex lock after long load test. #14

Open
stevej opened this issue Mar 30, 2017 · 2 comments
Open

linkerd-tcp process hangs in mutex lock after long load test. #14

stevej opened this issue Mar 30, 2017 · 2 comments
Assignees

Comments

@stevej
Copy link

stevej commented Mar 30, 2017

I setup a long-running load test against linkerd-tcp with slow_cooker with the following parameters: 3 nginx backends serving static content and 3 slow_cooker load testers with the following command-lines:

slow_cooker -host "perf-cluster" -qps 100 -concurrency 100 -interval 1m -reportLatenciesCSV linkerd-tcp.csv http://proxy-test-4d:7474
slow_cooker -host "perf-cluster" -qps 1 -concurrency 100 -interval 1s -hashValue 9745850253931700627 -hashSampleRate 0.1 http://proxy-test-4d:7474
slow_cooker -host perf-cluster -qps 1 -concurrency 100 -interval 1m -hashValue 9745850253931700627 -hashSampleRate 0.1 -compress http://proxy-test-4d:7474

So this is about (10k + 200) qps with the second two slow_cookers checking for invalid response bodies.

After about 20 hours, linkerd-tcp stopped accepting new connections and stopped processing requests from existing connections. slow_cooker was reporting the following errors

Get http://proxy-test-4d:7474: read tcp 10.240.0.4:58016->10.240.0.21:7474: read: connection reset by peer
Get http://proxy-test-4d:7474: read tcp 10.240.0.4:57899->10.240.0.21:7474: read: connection reset by peer
Get http://proxy-test-4d:7474: read tcp 10.240.0.4:57902->10.240.0.21:7474: read: connection reset by peer
Get http://proxy-test-4d:7474: read tcp 10.240.0.4:57906->10.240.0.21:7474: read: connection reset by peer
Get http://proxy-test-4d:7474: read tcp 10.240.0.4:57853->10.240.0.21:7474: read: connection reset by peer
Get http://proxy-test-4d:7474: read tcp 10.240.0.4:57912->10.240.0.21:7474: read: connection reset by peer
2017-03-30T16:51:02Z      0/0/55 6000   0% 1m0s   0 [  0   0   0    0 ]    0      0
Get http://proxy-test-4d:7474: dial tcp 10.240.0.21:7474: getsockopt: connection timed out
Get http://proxy-test-4d:7474: read tcp 10.240.0.4:57907->10.240.0.21:7474: read: connection reset by peer
Get http://proxy-test-4d:7474: read tcp 10.240.0.4:57908->10.240.0.21:7474: read: connection reset by peer

Interesting, the admin port was still processing requests and I setup a watch to look at output every few seconds. Here's an example report

$ curl http://localhost:9989/metrics
metrics_count 12
success_count{proxy="default"} 79470054
connects{proxy="default", srv="0.0.0.0:7474"} 79470070
bytes_total{direction="tx", proxy="default"} 2382121311667
namerd_success_count{service="namerd"} 61943
conns_active{proxy="default"} 5
endpoints_retired{proxy="default"} 0
endpoints_ready{proxy="default"} 0
conns_established{proxy="default"} 0
conns_pending{proxy="default"} 0
endpoints_unready{proxy="default"} 3
poll_time_us{stat="count", proxy="default"} 6
poll_time_us{stat="mean", proxy="default"} 17.833333333333332
poll_time_us{stat="min", proxy="default"} 6
poll_time_us{stat="max", proxy="default"} 44
poll_time_us{stat="stddev", proxy="default"} 14.217555658019732
poll_time_us{stat="p50", proxy="default"} 6
poll_time_us{stat="p90", proxy="default"} 6
poll_time_us{stat="p95", proxy="default"} 6
poll_time_us{stat="p99", proxy="default"} 6
poll_time_us{stat="p999", proxy="default"} 6
poll_time_us{stat="p9999", proxy="default"} 6
namerd_request_latency_ms{stat="count", service="namerd"} 2
namerd_request_latency_ms{stat="mean", service="namerd"} 2
namerd_request_latency_ms{stat="min", service="namerd"} 2
namerd_request_latency_ms{stat="max", service="namerd"} 2
namerd_request_latency_ms{stat="stddev", service="namerd"} 0
namerd_request_latency_ms{stat="p50", service="namerd"} 2
namerd_request_latency_ms{stat="p90", service="namerd"} 2
namerd_request_latency_ms{stat="p95", service="namerd"} 2
namerd_request_latency_ms{stat="p99", service="namerd"} 2
namerd_request_latency_ms{stat="p999", service="namerd"} 2
namerd_request_latency_ms{stat="p9999", service="namerd"} 2

The process had a few more threads than I remember from the previous day.

stevej   13505 79.0  0.8  71776 33100 pts/2    Sl+  Mar27 4006:56 ./linkerd-tcp-1490585634 example.yml
stevej   13505 56.2  0.8  71776 33100 pts/2    Sl+  Mar27 2854:06 ./linkerd-tcp-1490585634 example.yml
stevej   13505  0.0  0.8  71776 33100 pts/2    Sl+  Mar27   0:01 ./linkerd-tcp-1490585634 example.yml
stevej   13505  0.0  0.8  71776 33100 pts/2    Sl+  Mar27   0:01 ./linkerd-tcp-1490585634 example.yml
stevej   13505  0.0  0.8  71776 33100 pts/2    Sl+  Mar27   0:01 ./linkerd-tcp-1490585634 example.yml
stevej   13505  0.0  0.8  71776 33100 pts/2    Sl+  Mar27   0:01 ./linkerd-tcp-1490585634 example.yml
stevej   13505  0.0  0.8  71776 33100 pts/2    Sl+  Mar27   0:07 ./linkerd-tcp-1490585634 example.yml
stevej   13505  0.0  0.8  71776 33100 pts/2    Sl+  Mar27   0:03 ./linkerd-tcp-1490585634 example.yml

3 of them were locked on the same mutex. Here's some output from rust-gdb. Unfortunately the process segfaulted before I captured backtraces from all the threads but I do have some output I can share:

Attaching to process 13505
[New LWP 13506]
[New LWP 13507]
[New LWP 13508]
[New LWP 13509]
[New LWP 13510]
[New LWP 13511]
[New LWP 13512]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f0ee2f716a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84
84      ../sysdeps/unix/syscall-template.S: No such file or directory.

(gdb) thread 3
[Switching to thread 3 (Thread 0x7f0ee1fff700 (LWP 13507))]
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135     ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.
(gdb) bt
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f0ee3450ebd in __GI___pthread_mutex_lock (mutex=0x7f0ee200e150)
    at ../nptl/pthread_mutex_lock.c:80
#2  0x000056016c9a0389 in std::panicking::try::do_call::hb458ba233f0b55f7 ()
#3  0x000056016c9c2f7b in panic_unwind::__rust_maybe_catch_panic ()
    at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libpanic_unwind/lib.rs:98
#4  0x000056016c9a4893 in _$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$::call_box::hc7b55c280fa43c1f ()
#5  0x000056016c9bad95 in alloc::boxed::{{impl}}::call_once<(),()> ()
    at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/liballoc/boxed.rs:624
#6  std::sys_common::thread::start_thread ()
    at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sys_common/thread.rs:21
#7  std::sys::imp::thread::{{impl}}::new::thread_start ()
    at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sys/unix/thread.rs:84
#8  0x00007f0ee344e6ca in start_thread (arg=0x7f0ee1fff700) at pthread_create.c:333
#9  0x00007f0ee2f710af in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:105
(gdb) thread 5
[Switching to thread 5 (Thread 0x7f0ee1bfd700 (LWP 13509))]
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135     ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.
(gdb) bt
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f0ee3450ebd in __GI___pthread_mutex_lock (mutex=0x7f0ee200e150)
    at ../nptl/pthread_mutex_lock.c:80
#2  0x000056016c9a0389 in std::panicking::try::do_call::hb458ba233f0b55f7 ()
#3  0x000056016c9c2f7b in panic_unwind::__rust_maybe_catch_panic ()
    at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libpanic_unwind/lib.rs:98
#4  0x000056016c9a4893 in _$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$::call_box::hc7b55c280fa43c1f ()
#5  0x000056016c9bad95 in alloc::boxed::{{impl}}::call_once<(),()> ()
    at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/liballoc/boxed.rs:624
#6  std::sys_common::thread::start_thread ()
    at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sys_common/thread.rs:21
#7  std::sys::imp::thread::{{impl}}::new::thread_start ()
    at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sys/unix/thread.rs:84
#8  0x00007f0ee344e6ca in start_thread (arg=0x7f0ee1bfd700) at pthread_create.c:333
#9  0x00007f0ee2f710af in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:105
(gdb) thread 6
[Switching to thread 6 (Thread 0x7f0ee19fc700 (LWP 13510))]
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135     in ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
(gdb) bt
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f0ee3450ebd in __GI___pthread_mutex_lock (mutex=0x7f0ee200e150)
    at ../nptl/pthread_mutex_lock.c:80
#2  0x000056016c9a0389 in std::panicking::try::do_call::hb458ba233f0b55f7 ()
#3  0x000056016c9c2f7b in panic_unwind::__rust_maybe_catch_panic ()
    at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libpanic_unwind/lib.rs:98
#4  0x000056016c9a4893 in _$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$::call_box::hc7b55c280fa43c1f ()
#5  0x000056016c9bad95 in alloc::boxed::{{impl}}::call_once<(),()> ()
    at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/liballoc/boxed.rs:624
#6  std::sys_common::thread::start_thread ()
    at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sys_common/thread.rs:21
#7  std::sys::imp::thread::{{impl}}::new::thread_start ()
    at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sys/unix/thread.rs:84
#8  0x00007f0ee344e6ca in start_thread (arg=0x7f0ee19fc700) at pthread_create.c:333
#9  0x00007f0ee2f710af in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:105

Some netstat output. I noticed that the Recv-Q for the listener has bytes in it. That's unexpected

sudo netstat -anp |head -n 2 && sudo netstat -anp |grep linkerd
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp      129      0 0.0.0.0:7474            0.0.0.0:*               LISTEN      13505/./linkerd-tcp
tcp        0      0 127.0.0.1:9989          0.0.0.0:*               LISTEN      13505/./linkerd-tcp
tcp       93      0 10.240.0.21:7474        10.240.0.4:48796        CLOSE_WAIT  13505/./linkerd-tcp
tcp        0      0 10.240.0.21:7474        10.240.0.14:60982       ESTABLISHED 13505/./linkerd-tcp
tcp       93      0 10.240.0.21:7474        10.240.0.4:48788        CLOSE_WAIT  13505/./linkerd-tcp
tcp       93      0 10.240.0.21:7474        10.240.0.4:48790        CLOSE_WAIT  13505/./linkerd-tcp
tcp        0      0 10.240.0.21:7474        10.240.0.14:34318       ESTABLISHED 13505/./linkerd-tcp
tcp       93      0 10.240.0.21:7474        10.240.0.4:48791        CLOSE_WAIT  13505/./linkerd-tcp
tcp       93      0 10.240.0.21:7474        10.240.0.4:48793        CLOSE_WAIT  13505/./linkerd-tcp
tcp       93      0 10.240.0.21:7474        10.240.0.4:48800        CLOSE_WAIT  13505/./linkerd-tcp
tcp       93      0 10.240.0.21:7474        10.240.0.4:48795        CLOSE_WAIT  13505/./linkerd-tcp
tcp       93      0 10.240.0.21:7474        10.240.0.4:48785        CLOSE_WAIT  13505/./linkerd-tcp
tcp        0      0 127.0.0.1:48284         127.0.0.1:4180          ESTABLISHED 13505/./linkerd-tcp
tcp       93      0 10.240.0.21:7474        10.240.0.4:48798        CLOSE_WAIT  13505/./linkerd-tcp
tcp       93      0 10.240.0.21:7474        10.240.0.4:48789        CLOSE_WAIT  13505/./linkerd-tcp
tcp       93      0 10.240.0.21:7474        10.240.0.4:48799        CLOSE_WAIT  13505/./linkerd-tcp

ltrace there was movement on the process and something suspicious stood out.
read(4, 0x7fffd9e9dfb0, 128) = -1 EAGAIN (Resource temporarily unavailable)
lsof showed the following FD connected to 4.
linkerd-t 13505 stevej 4r FIFO 0,10 0t0 89104520 pipe

Unfortunately, the process segfaulted after more ltraceing and most of my output was lost to a small tmux scrollback buffer.

Here's the segfault I did capture. Note that the activity here is the admin port writing out prometheus stats to my curl request in a watch loop

memcpy(0xe25fdc40, "\240\241\242\243\244\245\246\247\250\251\252\253", 12) = 0xe25fdc40
memcpy(0x7fffd9e9d192, "\377", 1)                                  = 0x7fffd9e9d192
memcpy(0xe25fdc4c, "\347", 1)                                      = 0xe25fdc4c
memcpy(0x7fffd9e9d193, "default", 7)                               = 0x7fffd9e9d193
memcpy(0xe25fdc4d, "\220\221\222\223\224", 5 <unfinished ...>
memcpy(0x7fffd9e9d19a, "\377", 1 <unfinished ...>
<... memcpy resumed> )                                             = 0xe25fdc4d
<... memcpy resumed> )                                             = 0x7fffd9e9d19a
memcpy(0xe25fdc52, "\347", 1 <unfinished ...>
memcpy(0x7fffd9e9d180, "conns_active", 12 <unfinished ...>
<... memcpy resumed> )                                             = 0xe25fdc52
<... memcpy resumed> )                                             = 0x7fffd9e9d180
memcpy(0xe25fdc53, "\230\231\232\233\234\235\236", 7 <unfinished ...>
memcpy(0x7fffd9e9d18c, "\377", 1 <unfinished ...>
<... memcpy resumed> )                                             = 0xe25fdc53
<... memcpy resumed> )                                             = 0x7fffd9e9d18c
memcpy(0xe25fdc5a, "\347", 1 <unfinished ...>
memcpy(0x7fffd9e9d18d, "proxy", 5 <unfinished ...>
<... memcpy resumed> )                                             = 0xe25fdc5a
<... memcpy resumed> )                                             = 0x7fffd9e9d18d
memcpy(0xe25fdc5b, "\b\t\n\v\f", 5 <unfinished ...>
memcpy(0x7fffd9e9d192, "\377", 1 <unfinished ...>
<... memcpy resumed> )                                             = 0xe25fdc5b
<... memcpy resumed> )                                             = 0x7fffd9e9d192
memcpy(0xe25fdc40, "\r\016\017", 3 <unfinished ...>
memcpy(0x7fffd9e9d193, "default", 7 <unfinished ...>
<... memcpy resumed> )                                             = 0xe25fdc40
<... memcpy resumed> )                                             = 0x7fffd9e9d193
memcpy(0xe25fdc43, "\020\021\022\023\024\025\026\027", 8)          = 0xe25fdc43
unexpected breakpoint at 0x56016c897de8
log2(8, 31, 0xe2121000, 1 <no return ...>
--- SIGSEGV (Segmentation fault) ---
<... log2 resumed> )                                               = 2
sigaction(SIGSEGV, { 0, <>, 0, 0 } <unfinished ...>
floor(0, 0xfffffc01, 0, 0)                                         = 2
pthread_mutex_lock(0xe2912288, 0xe2912280, 52, 0)                  = 0
<... sigaction resumed> , nil)                                     = 0
pthread_mutex_unlock(0xe2912288, 0xe20058e8, 0xe2914080, 0x6da8)   = 0
pthread_mutex_lock(0xe2912288, 0xe214bd00, 0xa0000, 0xe2912288)    = 0
--- SIGSEGV (Segmentation fault) ---
enable_breakpoint pid=13506, addr=0x56016c9ceca2, symbol=(null): No such process
pthread_mutex_unlock(Segmentation fault (core dumped)

I will put together a tool to gather full output from rust-gdb, lsof, and some strace and ltrace samples for the next time this pops up.

@stevej stevej self-assigned this Mar 30, 2017
@stevej
Copy link
Author

stevej commented Mar 30, 2017

Looking at a healthy linkerd, I also see 3 threads all waiting on the same mutex so that must not be what is causing this issue.

@stevej
Copy link
Author

stevej commented Mar 30, 2017

Attached is a txt file of rust-gdb output of backtraces of threads on a healthy linkerd-tcp.

healthy_linkerd_gdb.txt

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant