-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve cpu info for non-x86 architectures #9
Comments
|
Yes, that comment was a little out of date. You're right. I'll have to figure out how to do feature detection. Would also be good to know if 32 floating point registers are the norm. That would help with a lot of operations like BLAS. |
I think all AArch64/arm64 with SIMD will support Float64. So I think you'd be okay to make that particular assumption based on the architecture. AArch32 is more complicated since it represents a lot more versions of ARM. Based on the ARM v8 spec I think some AArch32 processors can run SIMD operations on regular registers as well as SIMD registers, which makes that more complicated. I'm pretty sure Raspberry PIs have been AArch32 for a long time. I'm not sure if there's new ones that are AArch64 but the vast majority will be AArch32. That said, AArch64 is going to be a huge new user base, since Amazon is heavily advertising their new Graviton2 processors, and Apple is switching to it for seemingly all of their products. Not sure about the number of registers. The ARMv8-A spec is here: https://documentation-service.arm.com/static/5f20515cbb903e39c84dc459 but that's only one spec. For feature detection on Linux, the kernel provides |
There's also this for feature detection: using Libdl
llvmlib = Libdl.dlopen(only(filter(lib->occursin(r"LLVM\b", basename(lib)), Libdl.dllist())))
gethostcpufeatures = Libdl.dlsym(llvmlib, :LLVMGetHostCPUFeatures)
features_cstring = ccall(gethostcpufeatures, Cstring, ())
features = filter(ext -> (m = match(r"\d", ext); isnothing(m) ? true : m.offset != 2 ), split(unsafe_string(features_cstring), ',')) But from what I recall of seeing discussions on Julia issues/PRs, this is going to be incomplete with ARM. I'll still have to see how to translate |
I think you can find cache info from the same kernel function Not sure about that register information. @PallHaraldsson answered a related question on Quora so he might have some suggestions. I'll note that the ARMv8 spec linked above does tell you exactly how many registers of what kind it has, so worst-case scenario you could hardcode the numbers by architecture. |
I think hard coding the number of registers would be worth it. For cross platform cache information, perhaps we should depend on Hwloc.jl? Although I'm not opposed to making OS-specific performance optimizations (already using faster SIMD special functions on Linux). |
That's a good idea! Looks like it needs some fixes first though: JuliaParallel/Hwloc.jl#31 |
I believe I am struggling with the same issue as others have reported: errors compiling LoopVectorization on a Jetson Nano.
Digging deeper,
Subsequently (from topology.jl, line 8)
I'd be happy to help fix this! |
I shouldn't assume that. Lots of non-x86 CPUs only have 2 levels of cache, including the M1. |
You're welcome to make a PR! Or, for |
Currently, it uses a generic build script.
This script assumes:
If any of these are violated, dependent libraries (e.g., LoopVectorization) are likely to produce suboptimal code. If these numbers undershoot, that would just mean some performance is left on the table, but it's likely to perform reasonably well.
If these numbers overshoot, performance consequences could be dire. Register spills galore.
I believe some ARM CPUs do not have SIMD
Float64
, so perhaps this should be handled somehow.Ideally, we'd use a library like CpuId.jl to query hardware info, like we do for AMD and Intel.
The text was updated successfully, but these errors were encountered: