DSP Performance Comparison Numpy vs. Cython vs. Numba vs. Pythran vs. Julia

DNF2 · on Jan 18, 2021

That Julia benchmark is really quite unfortunate:

1. It treats all arrays as row-major, while Julia arrays are column major. That is pretty unfair. 2. It creates lots of unnecessary copies, as slices in Julia does not return views (but you can easily make them do that.) 3. It creates an output array with an abstract element type, `zeros(Complex, pols, L)`, which can seriously hurt performance. 4. Benchmarking should preferably be done with BenchmarkTools.jl, though I'm not sure it that makes a big difference here.

It's ok to be a novice user of a new language, but when comparing and publishing benchmarks like this, I really think the author should solicit some feedback to avoid obvious rookie mistakes. I found no way of offering any feedback on that blog, however.

cycomanic · on Jan 19, 2021

Fair enough, as I said my experience with julia is limited.

I did ask for feedback (there is a link to the repository under code, but I saw that I accidentally removed it in one push to the repository, so it might have not been there when you looked).

I've just checked with your suggestions and I get a 2x improvement by going to zeros(Complex{Float64}, pols, L) and a bit more improvement by changing the calculations to column major to result in an overall improvement of about 60x compared to numpy (using @views did actually result in a slow down). I will update the post, thanks.

DNF2 · on Jan 19, 2021

That's good. I still cannot see the repository, though. I may be looking in the wrong place.

It's no problem being new to a language, I just found the conclusions as bit too 'conclusive', so to speak, in that case.

Views should definitely not slow this down, so there may be something off. Could you perhaps share some dummy input arrays? Can I generate them with rand, perhaps, if you can tell me the sizes?

cycomanic · on Jan 19, 2021

Just FYI I've just updated the post with changes according to your comments. I was actually incorrect about views causing a slow-down (that was from a quick and dirty test on my laptop), they result in some speed-up (at least on my desktop).

I appreciate your feedback, and I did not mean to be overly critical of Julia, I find it an interesting language. I might have sounded so critical, because I actually was expecting a bigger speed-up out of the box. I have read quite a view opinions that you get the speed of C with the convenience of Python, which probably set my expectations a bit too high.

I actually had similar "disappointment" the first time I used Cython, because using it and putting some type annotations in the function definitions did not speed things up at all. Goes to show you need to know what you're doing if you want performance.

DNF2 · on Jan 20, 2021

I submitted an issue that gives a further >4x speedup with just simple straightforward code (no simd, fastmath or threads).

It's important to make everything a view, and to remember the dots in the right places, otherwise this is quite straightforward.

I believe you can get significant _further_ speedups with better simd vectorization and threading. I made this as simple as possible, to be similar to the python/cython code.

cycomanic · on Jan 19, 2021

The repository is here https://gitlab.com/Jochen/jochen.gitlab.io

(you can find it under code in the navigation bar). The post is a jupyter notebook which can be found under content/blog

Regarding dummy arrays, you can just generate random arrays of complex values, that should normally not cause issues (although obviously the filter does not converge to anything). The size I used for the demo is (2, 200 000), i.e. 2 polarisations and 100 000 symbols 2 times oversampled

DNF2 · on Jan 19, 2021

I had the impression that the inputs were one 2D and one 3D array. Are they both complex? Also it seems like they did not have the same sizes, as well as dimensionalities.

cycomanic · on Jan 19, 2021

Ah yes sorry the filter array (wxy) is a 3d array of shape (2,2,21) initialised to 0 with the wxy[0,0, 21//2]=1 and wxy[1,1,21/2] =1. Which corresponds to a perfect impulse response function and wxy should also be complex (at least in this implementation of the adaptive filter).

dklend122 · on Jan 19, 2021

Which version of Julia did you use? @view didn't actually speed up code until 1.5, where they got stack allocated.

DNF2 · on Jan 19, 2021

Well, view could speed up code, but it also had some overhead, so creating lots of views came with a cost that would sometimes outweigh the benefit.

But, yeah, version info would be interesting.

snicker7 · on Jan 19, 2021

Interesting! But would there really be no perf improvements before 1.5, e.g. with memcpy?

DNF2 · on Jan 19, 2021

I commented on this, right next to your question: "Well, view could speed up code, but it also had some overhead, so creating lots of views came with a cost that would sometimes outweigh the benefit."

So yes, there was indeed perf improvements previously, they were simply not as lightweight as they are now.

cycomanic · on Jan 19, 2021

This is julia 1.5.2

amkkma · on Jan 20, 2021

Are the Cython and Pythran codes running in parallel? To do that with Julia: https://docs.julialang.org/en/v1/manual/multi-threading/

Or https://github.com/JuliaFolds/FLoops.jl

dklend122 · on Jan 19, 2021

Also, the first run of "@time" includes compilation. So for an accurate runtime, you'd want to run it twice and use the second timing.

cycomanic · on Jan 19, 2021

Doesn't the run before using @time take care of that?

DNF2 · on Jan 19, 2021

Yes it does. There are still issues with benchmarking in global scope, though, which is why it is better to use BenchmarkTools.jl

amkkma · on Jan 19, 2021

Sorry- missed that