Notes 5: Efficiency

Author

Chris Paciorek

Published

February 4, 2025

Introduction

This document is the fifth of a set of notes, this document focusing on writing efficient Julia code. The notes are not meant to be particularly complete in terms of useful functions (Google and LLMs can now provide that quite well), but rather to introduce the language and consider key programming concepts in the context of Julia.

Given that, the document heavily relies on demos, with interpretation in some cases left to the reader.

Timing

Being able to time code is critical for understanding and improving efficiency.

Compilation time

With Julia, we need to pay particular attention to the effect of just-in-time (JIT) compilation on timing. The first time a function is called with specific set of argument types, Julia will compile the method that is invoked. We generally don’t want to time the compilation, only the run time, assuming the function will be run repeatedly with a given set of argument types.

@time is a macro that will time some code. However, it’s better to use @btime from BenchmarkTools as that will run the code multiple times and will make sure not to count the compilation time.

function myexp!(x)
  for i in 1:length(x)
    x[i] = exp(x[i])
  end
end

n = Int(1e7)
y = rand(n);
@time myexp!(y) ## Compilation time included.
  0.074654 seconds (2.62 k allocations: 178.000 KiB, 10.70% compilation time)
y = rand(n);
@time myexp!(y) ## Compilation time not included.
  0.066447 seconds
using BenchmarkTools

y = rand(n);
@btime myexp!(y) 
  58.960 ms (0 allocations: 0 bytes)
Exercise

How long does that loop take in R or Python? What about a vectorized solution in R or Python?

We can time a block of code, but I’m not sure what Julia does in terms of JIT for code that is not in functions. You may discover more in working on the fourth problem of PS2.

@btime begin
y = 3
z = 7
end
  1.299 ns (0 allocations: 0 bytes)
7

Profiling

Profiling involves timing each step in a set of code. One can use the Profile module to do this in Julia.

One thing to keep in mind when profiling is whether the timing for nested function calls is included in the timing of the function that makes the nested function calls.

using Profile

function ols_slow(y::Vector{<:Number}, X::Matrix{<:Number})
    xtx = X'X; 
    xty = X'y;
    xtxinverse = inv(xtx);  ## This is an inefficient approach.
    return xtxinverse * xty
end

n = Int(1e4)
p = 2000
y = randn(n);
X = randn((n,p));

## Run once to avoid profiling JIT compilation.
coefs = ols_slow(y, X);

Directly interpreting the Profile output can be difficulty. In this case, if we ran the following code, we’d see very long, hard-to-interpret information.

@profile coefs = ols_slow(y, X)
Profile.print()  

Instead let’s try a visualization. There are other Julia packages for visualizing profiler output. Some might be better than this. (I tried ProfileView and liked StatProfilerHTML better.)

using ProfileView
@profview ols_slow(y, X)

using StatProfilerHTML
@profilehtml ols_slow(y, X)

@profilehtml produces [this output](statprof/index.html), which can in some ways be hard to interpret, but the color-coded division betweeninv,and` gives us an idea of where time is being spent. That output might not show up fully in the links - you might need to run the code above yourself.

Pre-allocation

In R (also with numpy arrays in Python), it’s a bad idea to iteratively increase the size of an object, such as doing this:

n <- 5000
x <- 1
for(i in 2:n)
  x <- c(x, i)

Python lists handle this much better by allocating increasingly large additional amounts of memory as the object grows when using .append().

Let’s consider this in Julia.

function fun_prealloc(n) 
  x = zeros(n);
  for i in 1:n
    x[i] = i;
  end
  return x
end

function fun_grow(n) 
  x = Float64[];
  for i in 1:n
    push!(x, i);
  end
  return x
end

using BenchmarkTools

n = 100000000
@btime x1 = fun_prealloc(n);
  334.486 ms (2 allocations: 762.94 MiB)
@btime x2 = fun_grow(n);
  1.718 s (23 allocations: 1019.60 MiB)

That indicates that it’s better to pre-allocate memory in Julia, but the time does not seem to grow as order of \(n^2\) as it does in R or with numpy arrays. So that suggests Julia is growing the array in a smart fashion.

We can verify that by looking at the memory allocation information returned by @btime.

For fun_prealloc, we see an allocation of ~800 MB, consistent with allocating an array of 100 million 8 byte floats. (It turns out the “second” allocation occurs because we are running @btime in the global scope).

For fun_grow, we see 23 allocations of ~1 GB, consistent with Julia growing the array in a smart fashion but with some additional memory allocation.

If the array were reallocated each time it grew by one, we’d allocate and copy \(1+2+\cdots+n = n (n+1)/2\) numbers in total over the course of the computation (but not all at once), which would take a lot of time.

Vectorization

As we’ve seen, the vectorized versions of functions have a dot after the function name (or before an operator).

x = ["spam", 2.0, 5, [10, 20]]
length(x)
4
length.(x)
4-element Vector{Int64}:
 4
 1
 1
 2
map(length, x)
4-element Vector{Int64}:
 4
 1
 1
 2
x = [2.1, 3.1, 5.3, 7.9]
x .+ 10
4-element Vector{Float64}:
 12.1
 13.1
 15.3
 17.9
x + x
4-element Vector{Float64}:
  4.2
  6.2
 10.6
 15.8
x .> 5.0
4-element BitVector:
 0
 0
 1
 1
x .== 3.1
4-element BitVector:
 0
 1
 0
 0

Unlike in Python or R, it shouldn’t matter for efficiency if you use a vectorized function or write a loop, because with Julia’s just-in-time compilation, the compiled code should be similar. (This assumes your code is inside a function.) So the main appeal of vectorization is code clarity and ease of writing the code.

We can automatically use the dot vectorization with functions we write:

function plus3(x)
  return x + 3
end

plus3.(x)
4-element Vector{Float64}:
  5.1
  6.1
  8.3
 10.9

This invokes broadcast(plus3, args...).

Broadcasting will happen over multiple arguments if more than one argument is an array.

Consider the difference between the following vectorized calls:

x = randn(5)
σ = 10;
y1 = x .+ σ .* randn.()
y2 = x .+ σ .* randn()
print((y1 - x) / σ)
print((y2 - x) / σ)
[0.14677616340308833, 0.581444839867881, -0.3787768358297307, 1.236223662587814, 0.3476772389317324][-0.2981459541428111, -0.2981459541428111, -0.2981459541428111, -0.2981459541428111, -0.2981459541428111]

That’s perhaps a bit surprising given one might think that because the multiplication is done first, the σ .* randn.() might produce a scalar, as it does if you just run σ .* randn.() on its own.

Loop fusion

If one runs a vectorized calculation that involves multiple steps in a language like R or Python, there are some inefficiencies.

Consider this computation:

x = tan(x) + 3*sin(x)

If run as vectorized code in a language like R or Python, it’s much faster than using a loop, but it does have some downsides.

  • First, it will use additional memory (temporary arrays will be created to store tan(x), sin(x), 3*sin(x)). (We can consider what the abstract syntax tree would be for that calculation.)
  • Second, multiple for loops will have to get executed when the vectorized code is run, looping over the elements of x to calculate tan(x), sin(x), etc. (For example in R or Python/numpy, multiple for loops would get run in the underlying C code.)

In contrast, running via a for loop (in R or Python or Julia) avoids the temporary arrays and involves a single loop:

for i in 1:length(x)
    x[i] = tan(x[i]) + 3*sin(x[i])
end

Thankfully, Julia “fuses” the loops of vectorized code automatically when one uses the dot syntax for vectorization, so one shouldn’t suffer from the downsides of vectorization. One could of course use a loop in Julia, and it should be fast, but it’s more code to write and harder to read.

Memory allocation with loop fusion

Let’s look at memory allocation when putting the code into a function:

function mymath(x)
   return tan(x) + 3*sin(x)
end

function mymathloop(x)
  for i in 1:length(x)
    x[i] = tan(x[i]) + 3*sin(x[i])
  end
  return x
end 

n = 100000000;
x = rand(n);

@btime y = mymath.(x);
  2.407 s (3 allocations: 762.94 MiB)
@btime y = mymathloop(x);
  3.115 s (0 allocations: 0 bytes)

Note that it appears only 800 MB (~760 MiB; ~0.95 MiB = 1 MB) are allocated (for the output) in the (presumably) fused operation, rather than multiples of 800 MB for various temporary arrays that one might expect to be created.

And in the loop, there is no allocation. We might expect some allocation of scalars, but those are probably handled differently than allocating memory for arrays off the heap. I’ve seen some information for how Julia handles allocation of space for immutable objects (including scalars and strings), but I hvaen’t had a chance to absorb that.

Cases without loop fusion

We can do addition or subtraction of two arrays or multiplication/division with array and scalar without the “dot” vectorization. However, as seen with the additional memory allocation here, the loop fusion is not done.

function mymath2(x)
   return 3*x+x/7
end

@btime y = mymath2(x);
  1.078 s (6 allocations: 2.24 GiB)

In contrast, here we see only the allocation for the output object.

@btime y = mymath2.(x);
  454.045 ms (3 allocations: 762.94 MiB)

Cache-aware programming and array storage

Julia stores the values in a matrix contiguously column by column (and analogously for higher-dimensional arrays).

We should therefore access matrix elements within a column rather than within a row. Why is that?

Memory access and the cache

When a value is retrieved from main memory into the CPU cache, a block of values will be retrieved, and those will generally include the values in the same column but (for large enough arrays) not all the values in the same row. If subsequent operations work on values from that column, the values won’t need to be moved into the cache. (This is called a “cache hit”).

Let’s first see if it makes a difference when using Julia’s built-in sum function, which can do the reduction operation on various dimensions of the array.

using Random
using BenchmarkTools

nr = 800000;
nc = 100;
A = randn(nr, nc);    # long matrix
tA = randn(nc, nr);   # wide matrix

function sum_by_column(X)
    return sum(X, dims=1) 
end

function sum_by_row(X)
    return sum(X, dims=2)  
end

@btime tmp = sum_by_column(A);
  38.943 ms (1 allocation: 896 bytes)
@btime tmp = sum_by_row(tA);
  42.923 ms (5 allocations: 976 bytes)

There’s little difference.

Are we wrong about how the cache works? Probably not; rather it’s probably that Julia’s sum() is set up to take advantage of how the cache works by being careful about the order of operations used to sum the rows or columns.

Exercise

How could you program the for loops involved in row-wise summation to be efficient when a matrix is stored column-major given how caching work? If you retrieve the data by column, how do you get the row sums?

In contrast, if we manually loop over rows or columns, we do see a big (almost order-of-magnitude) difference.

@btime tmp = [sum(A[:,col]) for col in 1:size(A,2)];
  133.829 ms (405 allocations: 610.36 MiB)
@btime tmp = [sum(A[row,:]) for row in 1:size(A,1)];
  779.155 ms (4798474 allocations: 750.71 MiB)

So while one lesson is to code with the cache in mind, another is to use built-in functions that are probably written for efficiency.

Exercise

In your own work, can you think of an algorithm and associated data structures where one has to retrieve a lot of data and one would want to think about cache hits and misses? In general the idea is that if you retrieve a value, try to make use of the nearby values at that same time, rather than retrieving the nearby values later on in the computation.

Store values contiguously in memory

If we are storing an array of all the same type of values, these can be stored contiguously. That’s not the case with abstract types.

For example, here Real values can vary in size.

a = Real[]
sizeof(a)
push!(a, 3.5)
sizeof(a)
push!(a, Int16(2))
sizeof(a[2])
sizeof(a)
16

And we see that having an array of Reals is bad for performance. As part of this notice the additional allocation.

using LinearAlgebra
n = 100;
A = rand(n, n);
@btime tmp = A'A;  # Equivalent to A' * A or transpose(A) * A.
  32.629 μs (3 allocations: 78.19 KiB)
100×100 Matrix{Float64}:
 36.9452  24.6027  29.9873  28.0731  …  28.1847  28.5245  24.4188  24.118
 24.6027  27.4659  26.1276  23.3942     23.4948  22.6057  21.3434  22.1072
 29.9873  26.1276  40.6052  29.5876     29.6133  31.3339  26.9341  30.7159
 28.0731  23.3942  29.5876  35.9484     26.1913  28.8578  25.5692  25.8774
 27.7808  23.7795  31.4558  29.0977     28.4308  30.2666  26.0004  28.154
 23.5452  19.4437  26.9979  24.5491  …  23.7654  26.7213  22.5085  23.4483
 24.2351  20.9448  27.1727  24.9626     25.5764  27.1748  22.9908  25.1908
 24.2352  21.1566  27.9748  26.2707     24.5957  27.2691  21.3511  27.2015
 28.0796  25.104   30.0434  27.1167     27.8067  28.7391  22.4229  28.0245
 25.212   21.6682  27.5374  25.7883     24.3316  27.0613  22.3679  25.011
 24.8338  20.8862  27.0238  24.971   …  24.3629  25.4161  22.7082  27.2442
 28.1029  24.5713  31.3381  29.325      29.0279  31.3595  25.7613  32.1622
 27.7427  23.8843  32.116   28.6105     26.8459  30.8597  24.8898  28.3011
  ⋮                                  ⋱                             
 28.2413  22.4823  31.8238  28.0718     28.5363  29.8544  25.617   30.8102
 25.9031  22.3158  30.2684  26.1889     26.8291  29.3049  24.9818  27.6305
 24.4902  21.7182  27.3646  25.1295  …  24.8507  26.6667  23.323   27.5324
 26.8904  22.4576  29.1816  26.7035     26.9178  29.1504  23.4181  27.3258
 25.3568  20.0535  25.8843  24.9667     24.4746  26.4892  22.1566  25.2715
 23.7338  19.7943  25.0737  23.2836     22.768   24.5572  21.7913  24.1243
 24.5237  20.9264  28.9895  25.7668     26.1901  27.8939  23.8366  26.1041
 25.6805  22.8274  27.4589  24.8699  …  25.0053  26.1065  22.3841  25.4258
 28.1847  23.4948  29.6133  26.1913     33.458   29.1317  24.1902  28.0532
 28.5245  22.6057  31.3339  28.8578     29.1317  38.251   25.3375  28.9003
 24.4188  21.3434  26.9341  25.5692     24.1902  25.3375  30.1251  24.9156
 24.118   22.1072  30.7159  25.8774     28.0532  28.9003  24.9156  38.3942
rA = convert(Array{Real}, A);
@btime tmp = rA'rA;
  41.719 ms (2030004 allocations: 31.05 MiB)

Lookup speed

If we have code that needs to retrieve a lot of values from a data structure, it’s worth knowing the situations in which we can expect that lookup to be fast.

Lookup in arrays is fast (\(O(1)\); i.e., not varying with the size of the array) because of the “random access” aspect of RAM (random access memory).

n=Int(1e7);

x = randn(n);
ind = Int(n/2);
@btime x[ind];
  19.324 ns (1 allocation: 16 bytes)
y = rand(10);
@btime y[5];
  19.481 ns (1 allocation: 16 bytes)

Next, lookup in a Julia dictionary is fast \(O(1)\) because dictionaries using hashing (like Python dictionaries and R environments).

function makedict(n)
  d=Dict{String,Int}()
  for i in 1:n
    push!(d, string(i) => i)
  end
  return d
end

## Make a large dictionary, with keys equal to strings representing integers.
d = makedict(n);
indstring = string(ind);
@btime d[indstring]; 
  40.196 ns (1 allocation: 16 bytes)

Finally, let’s consider tuples. Lookup by index is quite slow, which is surprising as I was expecting it to be similar to lookup in the array, as I think the tuple in this case has values stored contiguously.

xt = Tuple(x);
@btime xt[ind];  
  49.652 ms (1 allocation: 16 bytes)

For named tuples, I’m not sure how realistic this is, since it would probably be a pain to create a large named tuple. But we see that lookup by name is slow, even though we are using a smaller tuple than the array and dictionary above.

## Set up a named tuple (this is very slow for large array, so use a subset).
dsub = makedict(100000);
xsub = x[1:100000];
names = Symbol.('x' .* keys(dsub));  # For this construction of tuple, the keys need to be symbols.
xtnamed = (;zip(names, xsub)...); 
@btime xtnamed.x50000;
  60.522 μs (1 allocation: 16 bytes)
Developing a perspective on speed

Note that while all the individual operations above are fast in absolute terms for a single lookup, for simple operations, we generally want them to be really fast (e.g., order of nanoseconds) because we’ll generally be doing a lot of such operations for any sizeable overall computation.

Performance tips

The Julia manual has an extensive section on performance.

We won’t dive too deeply into all the complexity, but here are a few key tips, which mainly relate to writing in a way that is aware of the JIT compilation that will happen:

  • Code for which performance is important should be inside a function, as this allows for JIT compilation.
  • Avoid use of global variables that don’t have a type, as that is hard to optimize since the type could change.
  • The use of immutable objects can improve performance.
  • Have functions always return the same type and avoid changing (or unknown) variable types within a function.