-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coredump on using AMDGPU #696
Comments
What OS are you on? Is this an official build of ROCm?
|
I run into the same problem on arch linux. Previously my setup worked but I think after a rocm update it stopped working. When I tried libtree, I noticed libmiopen was not actually installed. Maybe the ROCm packages where split up and a dependency is missing. Installing miopen did not fix the issue but gives this libtree.
|
Hi, @laochailan. Can you try moving: global libMIOpen_path = get_library(lib_prefix * "MIOpen"; rocm_path) before line: global libhsaruntime = if Sys.islinux()
get_library("libhsa-runtime64"; rocm_path, ext="so.1")
else
""
end in |
Also on Arch and also having the same issue. Moving the libMIOpen_path line doesn't seem to fix it. |
Update: moving the discovery of all libraries (rocblas, rocfft, rocsolver, etc.) before the hsaruntime one does the trick. Not sure what changed. I don't know what effect this might have on other platforms, but if you don't think if affects anything, I can submit a PR. julia: /usr/src/debug/hip-runtime/clr-rocm-6.2.2/hipamd/src/hip_code_object.cpp:1152: hip::FatBinaryInfo** hip::StatCO::addFatBinary(const void*, bool): Assertion `err == hipSuccess' failed.
[323653] signal 6 (-6): Aborted
in expression starting at REPL[3]:1
unknown function (ip: 0x7e1f4f62d3f4)
gsignal at /usr/bin/../lib/libc.so.6 (unknown line)
abort at /usr/bin/../lib/libc.so.6 (unknown line)
unknown function (ip: 0x7e1f4f5bb3de)
__assert_fail at /usr/bin/../lib/libc.so.6 (unknown line)
unknown function (ip: 0x7e1ef6a50954)
unknown function (ip: 0x7e1e766ec8a8)
unknown function (ip: 0x7e1f4f79e5b6)
unknown function (ip: 0x7e1f4f79e6ac)
_dl_catch_exception at /lib64/ld-linux-x86-64.so.2 (unknown line)
unknown function (ip: 0x7e1f4f7a54fb)
_dl_catch_exception at /lib64/ld-linux-x86-64.so.2 (unknown line)
unknown function (ip: 0x7e1f4f7a5903)
unknown function (ip: 0x7e1f4f626f13)
_dl_catch_exception at /lib64/ld-linux-x86-64.so.2 (unknown line)
unknown function (ip: 0x7e1f4f79b678)
unknown function (ip: 0x7e1f4f6269f2)
dlopen at /usr/bin/../lib/libc.so.6 (unknown line)
ijl_load_dynamic_library at /cache/build/builder-demeter6-6/julialang/julia-master/src/dlload.c:365
jl_get_library_ at /cache/build/builder-demeter6-6/julialang/julia-master/src/runtime_ccall.cpp:45 [inlined]
jl_get_library_ at /cache/build/builder-demeter6-6/julialang/julia-master/src/runtime_ccall.cpp:29
ijl_lazy_load_and_lookup at /cache/build/builder-demeter6-6/julialang/julia-master/src/runtime_ccall.cpp:73
macro expansion at /home/fra/.julia/packages/AMDGPU/yqCEl/src/utils.jl:134 [inlined]
rocblas_create_handle at /home/fra/.julia/packages/AMDGPU/yqCEl/src/blas/librocblas.jl:230
macro expansion at /home/fra/.julia/packages/AMDGPU/yqCEl/src/utils.jl:134 [inlined]
create_handle at /home/fra/.julia/packages/AMDGPU/yqCEl/src/blas/rocBLAS.jl:36 [inlined]
#14 at /home/fra/.julia/packages/AMDGPU/yqCEl/src/cache.jl:103 [inlined]
#5 at /home/fra/.julia/packages/AMDGPU/yqCEl/src/cache.jl:29
lock at ./lock.jl:232
check_cache at /home/fra/.julia/packages/AMDGPU/yqCEl/src/cache.jl:27 [inlined]
pop! at /home/fra/.julia/packages/AMDGPU/yqCEl/src/cache.jl:48 [inlined]
new_state at /home/fra/.julia/packages/AMDGPU/yqCEl/src/cache.jl:102
#18 at /home/fra/.julia/packages/AMDGPU/yqCEl/src/cache.jl:115 [inlined]
get! at ./dict.jl:458
library_state at /home/fra/.julia/packages/AMDGPU/yqCEl/src/cache.jl:115
lib_state at /home/fra/.julia/packages/AMDGPU/yqCEl/src/blas/rocBLAS.jl:48 [inlined]
gemm! at /home/fra/.julia/packages/AMDGPU/yqCEl/src/blas/wrappers.jl:562 [inlined]
generic_matmatmul! at /home/fra/.julia/packages/AMDGPU/yqCEl/src/blas/highlevel.jl:178
generic_matmatmul! at /home/fra/.julia/packages/AMDGPU/yqCEl/src/blas/highlevel.jl:148 [inlined]
_mul! at /cache/build/builder-demeter6-6/julialang/julia-master/usr/share/julia/stdlib/v1.11/LinearAlgebra/src/matmul.jl:287 [inlined]
mul! at /cache/build/builder-demeter6-6/julialang/julia-master/usr/share/julia/stdlib/v1.11/LinearAlgebra/src/matmul.jl:285 [inlined]
mul! at /cache/build/builder-demeter6-6/julialang/julia-master/usr/share/julia/stdlib/v1.11/LinearAlgebra/src/matmul.jl:253 [inlined]
* at /cache/build/builder-demeter6-6/julialang/julia-master/usr/share/julia/stdlib/v1.11/LinearAlgebra/src/matmul.jl:124
unknown function (ip: 0x7e1f42f27da6)
jl_apply at /cache/build/builder-demeter6-6/julialang/julia-master/src/julia.h:2157 [inlined]
do_call at /cache/build/builder-demeter6-6/julialang/julia-master/src/interpreter.c:126
eval_value at /cache/build/builder-demeter6-6/julialang/julia-master/src/interpreter.c:223
eval_stmt_value at /cache/build/builder-demeter6-6/julialang/julia-master/src/interpreter.c:174 [inlined]
eval_body at /cache/build/builder-demeter6-6/julialang/julia-master/src/interpreter.c:663
jl_interpret_toplevel_thunk at /cache/build/builder-demeter6-6/julialang/julia-master/src/interpreter.c:821
jl_toplevel_eval_flex at /cache/build/builder-demeter6-6/julialang/julia-master/src/toplevel.c:943
jl_toplevel_eval_flex at /cache/build/builder-demeter6-6/julialang/julia-master/src/toplevel.c:886
ijl_toplevel_eval_in at /cache/build/builder-demeter6-6/julialang/julia-master/src/toplevel.c:994
eval at ./boot.jl:430 [inlined]
eval_user_input at /cache/build/builder-demeter6-6/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:245
repl_backend_loop at /cache/build/builder-demeter6-6/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:342
#start_repl_backend#59 at /cache/build/builder-demeter6-6/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:327
start_repl_backend at /cache/build/builder-demeter6-6/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:324
#run_repl#72 at /cache/build/builder-demeter6-6/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:483
run_repl at /cache/build/builder-demeter6-6/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:469
jfptr_run_repl_10088 at /usr/share/julia/compiled/v1.11/REPL/u0gqU_GYsA8.so (unknown line)
#1139 at ./client.jl:446
jfptr_YY.1139_14649 at /usr/share/julia/compiled/v1.11/REPL/u0gqU_GYsA8.so (unknown line)
jl_apply at /cache/build/builder-demeter6-6/julialang/julia-master/src/julia.h:2157 [inlined]
jl_f__call_latest at /cache/build/builder-demeter6-6/julialang/julia-master/src/builtins.c:875
#invokelatest#2 at ./essentials.jl:1055 [inlined]
invokelatest at ./essentials.jl:1052 [inlined]
run_main_repl at ./client.jl:430
repl_main at ./client.jl:567 [inlined]
_start at ./client.jl:541
jfptr__start_72144.1 at /usr/lib/julia/sys.so (unknown line)
jl_apply at /cache/build/builder-demeter6-6/julialang/julia-master/src/julia.h:2157 [inlined]
true_main at /cache/build/builder-demeter6-6/julialang/julia-master/src/jlapi.c:900
jl_repl_entrypoint at /cache/build/builder-demeter6-6/julialang/julia-master/src/jlapi.c:1059
main at julia (unknown line)
unknown function (ip: 0x7e1f4f5bce07)
__libc_start_main at /usr/bin/../lib/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 6981943 (Pool: 6981676; Big: 267); GC: 9
zsh: IOT instruction (core dumped) julia It seems to be the rocblas call that is giving issues. If I do elementwise multiplication it works. However, upon calling exit(), I get a segfault. Definitely something fishy going on. |
Whatever it is got solved by downgrading ROCm to 6.0.2. Don't know if this is something Arch-specific. |
I have the exact same error as @ffrancesco94, also on arch, and downgrading to 6.0.2 didn't help. |
Also same problem here (happens immediately on |
Are these segfaults reproducible with C code? E.g. creating rocBLAS handle and doing matmul |
I tried to run the
It does seem that the error stems from the blas call but it's not outright segfaulting. |
Can you try running these scripts with |
Hi! Sorry for the late reply. If I do that I get directly a segfault (however, my GPU is of an older series so I'm not sure that env variable applies). EDIT: If I enforce EDIT 2: I forgot to mention that I have the feeling that this issue is Arch-specific. I have been using a cluster recently with ROCm 6.2.x and I can load AMDGPU.jl without any hiccups. |
Hoping somebody who understands HIP/ROCM better than me can help me understand whats going on here.
Using the version you get when you use "add AMDGPU" I get a core dump instantly.
By going into src/discovery/discovery.jl and moving
up to the top (it needs to come before libhsa gets loaded. one line below and the coredumps return.):
the core dumps stop and everything seems to work normally.
Anybody have any ideas? Thanks for any help.
commit:
AMDGPU.versioninfo()
GDB backtrace:
rocminfo:
The text was updated successfully, but these errors were encountered: