function myexp!(x)
for i in 1:length(x)
x[i] = exp(x[i])
end
end
n = Int(1e7)
y = rand(n);
@time myexp!(y) ## Compilation time included. 0.074654 seconds (2.62 k allocations: 178.000 KiB, 10.70% compilation time)
This document is the fifth of a set of notes, this document focusing on writing efficient Julia code. The notes are not meant to be particularly complete in terms of useful functions (Google and LLMs can now provide that quite well), but rather to introduce the language and consider key programming concepts in the context of Julia.
Given that, the document heavily relies on demos, with interpretation in some cases left to the reader.
Being able to time code is critical for understanding and improving efficiency.
With Julia, we need to pay particular attention to the effect of just-in-time (JIT) compilation on timing. The first time a function is called with specific set of argument types, Julia will compile the method that is invoked. We generally don’t want to time the compilation, only the run time, assuming the function will be run repeatedly with a given set of argument types.
@time is a macro that will time some code. However, it’s better to use @btime from BenchmarkTools as that will run the code multiple times and will make sure not to count the compilation time.
function myexp!(x)
for i in 1:length(x)
x[i] = exp(x[i])
end
end
n = Int(1e7)
y = rand(n);
@time myexp!(y) ## Compilation time included. 0.074654 seconds (2.62 k allocations: 178.000 KiB, 10.70% compilation time)
y = rand(n);
@time myexp!(y) ## Compilation time not included. 0.066447 seconds
using BenchmarkTools
y = rand(n);
@btime myexp!(y) 58.960 ms (0 allocations: 0 bytes)
How long does that loop take in R or Python? What about a vectorized solution in R or Python?
We can time a block of code, but I’m not sure what Julia does in terms of JIT for code that is not in functions. You may discover more in working on the fourth problem of PS2.
@btime begin
y = 3
z = 7
end 1.299 ns (0 allocations: 0 bytes)
7
Profiling involves timing each step in a set of code. One can use the Profile module to do this in Julia.
One thing to keep in mind when profiling is whether the timing for nested function calls is included in the timing of the function that makes the nested function calls.
using Profile
function ols_slow(y::Vector{<:Number}, X::Matrix{<:Number})
xtx = X'X;
xty = X'y;
xtxinverse = inv(xtx); ## This is an inefficient approach.
return xtxinverse * xty
end
n = Int(1e4)
p = 2000
y = randn(n);
X = randn((n,p));
## Run once to avoid profiling JIT compilation.
coefs = ols_slow(y, X);Directly interpreting the Profile output can be difficulty. In this case, if we ran the following code, we’d see very long, hard-to-interpret information.
@profile coefs = ols_slow(y, X)
Profile.print() Instead let’s try a visualization. There are other Julia packages for visualizing profiler output. Some might be better than this. (I tried ProfileView and liked StatProfilerHTML better.)
using ProfileView
@profview ols_slow(y, X)
using StatProfilerHTML
@profilehtml ols_slow(y, X)@profilehtml produces [this output](statprof/index.html), which can in some ways be hard to interpret, but the color-coded division betweeninv,and` gives us an idea of where time is being spent. That output might not show up fully in the links - you might need to run the code above yourself.
In R (also with numpy arrays in Python), it’s a bad idea to iteratively increase the size of an object, such as doing this:
n <- 5000
x <- 1
for(i in 2:n)
x <- c(x, i)Python lists handle this much better by allocating increasingly large additional amounts of memory as the object grows when using .append().
Let’s consider this in Julia.
function fun_prealloc(n)
x = zeros(n);
for i in 1:n
x[i] = i;
end
return x
end
function fun_grow(n)
x = Float64[];
for i in 1:n
push!(x, i);
end
return x
end
using BenchmarkTools
n = 100000000
@btime x1 = fun_prealloc(n); 334.486 ms (2 allocations: 762.94 MiB)
@btime x2 = fun_grow(n); 1.718 s (23 allocations: 1019.60 MiB)
That indicates that it’s better to pre-allocate memory in Julia, but the time does not seem to grow as order of \(n^2\) as it does in R or with numpy arrays. So that suggests Julia is growing the array in a smart fashion.
We can verify that by looking at the memory allocation information returned by @btime.
For fun_prealloc, we see an allocation of ~800 MB, consistent with allocating an array of 100 million 8 byte floats. (It turns out the “second” allocation occurs because we are running @btime in the global scope).
For fun_grow, we see 23 allocations of ~1 GB, consistent with Julia growing the array in a smart fashion but with some additional memory allocation.
If the array were reallocated each time it grew by one, we’d allocate and copy \(1+2+\cdots+n = n (n+1)/2\) numbers in total over the course of the computation (but not all at once), which would take a lot of time.
As we’ve seen, the vectorized versions of functions have a dot after the function name (or before an operator).
x = ["spam", 2.0, 5, [10, 20]]
length(x)4
length.(x)4-element Vector{Int64}:
4
1
1
2
map(length, x)4-element Vector{Int64}:
4
1
1
2
x = [2.1, 3.1, 5.3, 7.9]
x .+ 104-element Vector{Float64}:
12.1
13.1
15.3
17.9
x + x4-element Vector{Float64}:
4.2
6.2
10.6
15.8
x .> 5.04-element BitVector:
0
0
1
1
x .== 3.14-element BitVector:
0
1
0
0
Unlike in Python or R, it shouldn’t matter for efficiency if you use a vectorized function or write a loop, because with Julia’s just-in-time compilation, the compiled code should be similar. (This assumes your code is inside a function.) So the main appeal of vectorization is code clarity and ease of writing the code.
We can automatically use the dot vectorization with functions we write:
function plus3(x)
return x + 3
end
plus3.(x)4-element Vector{Float64}:
5.1
6.1
8.3
10.9
This invokes broadcast(plus3, args...).
Broadcasting will happen over multiple arguments if more than one argument is an array.
Consider the difference between the following vectorized calls:
x = randn(5)
σ = 10;
y1 = x .+ σ .* randn.()
y2 = x .+ σ .* randn()
print((y1 - x) / σ)
print((y2 - x) / σ)[0.14677616340308833, 0.581444839867881, -0.3787768358297307, 1.236223662587814, 0.3476772389317324][-0.2981459541428111, -0.2981459541428111, -0.2981459541428111, -0.2981459541428111, -0.2981459541428111]
That’s perhaps a bit surprising given one might think that because the multiplication is done first, the σ .* randn.() might produce a scalar, as it does if you just run σ .* randn.() on its own.
If one runs a vectorized calculation that involves multiple steps in a language like R or Python, there are some inefficiencies.
Consider this computation:
x = tan(x) + 3*sin(x)
If run as vectorized code in a language like R or Python, it’s much faster than using a loop, but it does have some downsides.
tan(x), sin(x), 3*sin(x)). (We can consider what the abstract syntax tree would be for that calculation.)x to calculate tan(x), sin(x), etc. (For example in R or Python/numpy, multiple for loops would get run in the underlying C code.)In contrast, running via a for loop (in R or Python or Julia) avoids the temporary arrays and involves a single loop:
for i in 1:length(x)
x[i] = tan(x[i]) + 3*sin(x[i])
endThankfully, Julia “fuses” the loops of vectorized code automatically when one uses the dot syntax for vectorization, so one shouldn’t suffer from the downsides of vectorization. One could of course use a loop in Julia, and it should be fast, but it’s more code to write and harder to read.
Let’s look at memory allocation when putting the code into a function:
function mymath(x)
return tan(x) + 3*sin(x)
end
function mymathloop(x)
for i in 1:length(x)
x[i] = tan(x[i]) + 3*sin(x[i])
end
return x
end
n = 100000000;
x = rand(n);
@btime y = mymath.(x); 2.407 s (3 allocations: 762.94 MiB)
@btime y = mymathloop(x); 3.115 s (0 allocations: 0 bytes)
Note that it appears only 800 MB (~760 MiB; ~0.95 MiB = 1 MB) are allocated (for the output) in the (presumably) fused operation, rather than multiples of 800 MB for various temporary arrays that one might expect to be created.
And in the loop, there is no allocation. We might expect some allocation of scalars, but those are probably handled differently than allocating memory for arrays off the heap. I’ve seen some information for how Julia handles allocation of space for immutable objects (including scalars and strings), but I hvaen’t had a chance to absorb that.
We can do addition or subtraction of two arrays or multiplication/division with array and scalar without the “dot” vectorization. However, as seen with the additional memory allocation here, the loop fusion is not done.
function mymath2(x)
return 3*x+x/7
end
@btime y = mymath2(x); 1.078 s (6 allocations: 2.24 GiB)
In contrast, here we see only the allocation for the output object.
@btime y = mymath2.(x); 454.045 ms (3 allocations: 762.94 MiB)
Julia stores the values in a matrix contiguously column by column (and analogously for higher-dimensional arrays).
We should therefore access matrix elements within a column rather than within a row. Why is that?
When a value is retrieved from main memory into the CPU cache, a block of values will be retrieved, and those will generally include the values in the same column but (for large enough arrays) not all the values in the same row. If subsequent operations work on values from that column, the values won’t need to be moved into the cache. (This is called a “cache hit”).
Let’s first see if it makes a difference when using Julia’s built-in sum function, which can do the reduction operation on various dimensions of the array.
using Random
using BenchmarkTools
nr = 800000;
nc = 100;
A = randn(nr, nc); # long matrix
tA = randn(nc, nr); # wide matrix
function sum_by_column(X)
return sum(X, dims=1)
end
function sum_by_row(X)
return sum(X, dims=2)
end
@btime tmp = sum_by_column(A); 38.943 ms (1 allocation: 896 bytes)
@btime tmp = sum_by_row(tA); 42.923 ms (5 allocations: 976 bytes)
There’s little difference.
Are we wrong about how the cache works? Probably not; rather it’s probably that Julia’s sum() is set up to take advantage of how the cache works by being careful about the order of operations used to sum the rows or columns.
How could you program the for loops involved in row-wise summation to be efficient when a matrix is stored column-major given how caching work? If you retrieve the data by column, how do you get the row sums?
In contrast, if we manually loop over rows or columns, we do see a big (almost order-of-magnitude) difference.
@btime tmp = [sum(A[:,col]) for col in 1:size(A,2)]; 133.829 ms (405 allocations: 610.36 MiB)
@btime tmp = [sum(A[row,:]) for row in 1:size(A,1)]; 779.155 ms (4798474 allocations: 750.71 MiB)
So while one lesson is to code with the cache in mind, another is to use built-in functions that are probably written for efficiency.
In your own work, can you think of an algorithm and associated data structures where one has to retrieve a lot of data and one would want to think about cache hits and misses? In general the idea is that if you retrieve a value, try to make use of the nearby values at that same time, rather than retrieving the nearby values later on in the computation.
If we are storing an array of all the same type of values, these can be stored contiguously. That’s not the case with abstract types.
For example, here Real values can vary in size.
a = Real[]
sizeof(a)
push!(a, 3.5)
sizeof(a)
push!(a, Int16(2))
sizeof(a[2])
sizeof(a)16
And we see that having an array of Reals is bad for performance. As part of this notice the additional allocation.
using LinearAlgebra
n = 100;
A = rand(n, n);
@btime tmp = A'A; # Equivalent to A' * A or transpose(A) * A. 32.629 μs (3 allocations: 78.19 KiB)
100×100 Matrix{Float64}:
36.9452 24.6027 29.9873 28.0731 … 28.1847 28.5245 24.4188 24.118
24.6027 27.4659 26.1276 23.3942 23.4948 22.6057 21.3434 22.1072
29.9873 26.1276 40.6052 29.5876 29.6133 31.3339 26.9341 30.7159
28.0731 23.3942 29.5876 35.9484 26.1913 28.8578 25.5692 25.8774
27.7808 23.7795 31.4558 29.0977 28.4308 30.2666 26.0004 28.154
23.5452 19.4437 26.9979 24.5491 … 23.7654 26.7213 22.5085 23.4483
24.2351 20.9448 27.1727 24.9626 25.5764 27.1748 22.9908 25.1908
24.2352 21.1566 27.9748 26.2707 24.5957 27.2691 21.3511 27.2015
28.0796 25.104 30.0434 27.1167 27.8067 28.7391 22.4229 28.0245
25.212 21.6682 27.5374 25.7883 24.3316 27.0613 22.3679 25.011
24.8338 20.8862 27.0238 24.971 … 24.3629 25.4161 22.7082 27.2442
28.1029 24.5713 31.3381 29.325 29.0279 31.3595 25.7613 32.1622
27.7427 23.8843 32.116 28.6105 26.8459 30.8597 24.8898 28.3011
⋮ ⋱
28.2413 22.4823 31.8238 28.0718 28.5363 29.8544 25.617 30.8102
25.9031 22.3158 30.2684 26.1889 26.8291 29.3049 24.9818 27.6305
24.4902 21.7182 27.3646 25.1295 … 24.8507 26.6667 23.323 27.5324
26.8904 22.4576 29.1816 26.7035 26.9178 29.1504 23.4181 27.3258
25.3568 20.0535 25.8843 24.9667 24.4746 26.4892 22.1566 25.2715
23.7338 19.7943 25.0737 23.2836 22.768 24.5572 21.7913 24.1243
24.5237 20.9264 28.9895 25.7668 26.1901 27.8939 23.8366 26.1041
25.6805 22.8274 27.4589 24.8699 … 25.0053 26.1065 22.3841 25.4258
28.1847 23.4948 29.6133 26.1913 33.458 29.1317 24.1902 28.0532
28.5245 22.6057 31.3339 28.8578 29.1317 38.251 25.3375 28.9003
24.4188 21.3434 26.9341 25.5692 24.1902 25.3375 30.1251 24.9156
24.118 22.1072 30.7159 25.8774 28.0532 28.9003 24.9156 38.3942
rA = convert(Array{Real}, A);
@btime tmp = rA'rA; 41.719 ms (2030004 allocations: 31.05 MiB)
If we have code that needs to retrieve a lot of values from a data structure, it’s worth knowing the situations in which we can expect that lookup to be fast.
Lookup in arrays is fast (\(O(1)\); i.e., not varying with the size of the array) because of the “random access” aspect of RAM (random access memory).
n=Int(1e7);
x = randn(n);
ind = Int(n/2);
@btime x[ind]; 19.324 ns (1 allocation: 16 bytes)
y = rand(10);
@btime y[5]; 19.481 ns (1 allocation: 16 bytes)
Next, lookup in a Julia dictionary is fast \(O(1)\) because dictionaries using hashing (like Python dictionaries and R environments).
function makedict(n)
d=Dict{String,Int}()
for i in 1:n
push!(d, string(i) => i)
end
return d
end
## Make a large dictionary, with keys equal to strings representing integers.
d = makedict(n);
indstring = string(ind);
@btime d[indstring]; 40.196 ns (1 allocation: 16 bytes)
Finally, let’s consider tuples. Lookup by index is quite slow, which is surprising as I was expecting it to be similar to lookup in the array, as I think the tuple in this case has values stored contiguously.
xt = Tuple(x);
@btime xt[ind]; 49.652 ms (1 allocation: 16 bytes)
For named tuples, I’m not sure how realistic this is, since it would probably be a pain to create a large named tuple. But we see that lookup by name is slow, even though we are using a smaller tuple than the array and dictionary above.
## Set up a named tuple (this is very slow for large array, so use a subset).
dsub = makedict(100000);
xsub = x[1:100000];
names = Symbol.('x' .* keys(dsub)); # For this construction of tuple, the keys need to be symbols.
xtnamed = (;zip(names, xsub)...);
@btime xtnamed.x50000; 60.522 μs (1 allocation: 16 bytes)
Note that while all the individual operations above are fast in absolute terms for a single lookup, for simple operations, we generally want them to be really fast (e.g., order of nanoseconds) because we’ll generally be doing a lot of such operations for any sizeable overall computation.
The Julia manual has an extensive section on performance.
We won’t dive too deeply into all the complexity, but here are a few key tips, which mainly relate to writing in a way that is aware of the JIT compilation that will happen: