Notes 9: Numerics

Author

Chris Paciorek

Published

February 20, 2025

Introduction

This document is the ninth of a set of notes, this document focusing on numerical questions, random number generation, and linear algebra. The notes are not meant to be particularly complete in terms of useful functions (Google and LLMs can now provide that quite well), but rather to introduce the language and consider key programming concepts in the context of Julia.

Given that, the document heavily relies on demos, with interpretation in some cases left to the reader.

Random number generation

The seed

As usual, it’s best to set the random number seed.

using Random
Random.seed!(1234);      # Seed number 1234
println(rand(3))
[0.32597672886359486, 0.5490511363155669, 0.21858665481883066]
println(rand(5))
[0.8942454282009883, 0.35311164439921205, 0.39425536741585077, 0.9531246272848422, 0.7955469475347194]
Random.seed!(1234);     # Re-seed with same number - will give the same sequence of random numbers
println(rand(2))
[0.32597672886359486, 0.5490511363155669]
println(rand(6))
[0.21858665481883066, 0.8942454282009883, 0.35311164439921205, 0.39425536741585077, 0.9531246272848422, 0.7955469475347194]

rand has a variety of methods.

rand(1:10, 3)
3-element Vector{Int64}:
 5
 8
 6
rand(['a','b','c'], 5)
5-element Vector{Char}:
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

Generators

The manual discusses the available generators.

The default RNG is Xoshiro256++, but there are details related to random number streams when working in parallel that I have not absorbed yet.

Random.default_rng();
Random.seed!(1234)
println(rand(3))
[0.32597672886359486, 0.5490511363155669, 0.21858665481883066]
rng = Random.Xoshiro(1234)
println(rand(rng, 3))
[0.32597672886359486, 0.5490511363155669, 0.21858665481883066]

The default used to be the Mersenne Twister (still the default in R and formerly the default in Python/numpy).

rng = Random.MersenneTwister(1234);
println(rand(rng, 3))
[0.5908446386657102, 0.7667970365022592, 0.5662374165061859]

Distributions

As with distributions in SciPy in Python, we first define the distribution of interest and then carry out distributional operations with it.

using Distributions
beta_dist = Beta(2, 5)
beta_samples = rand(beta_dist, 10)
10-element Vector{Float64}:
 0.4542940512832609
 0.27004146470644036
 0.3870672630509891
 0.21116529046475746
 0.5041249425689261
 0.2784321400052276
 0.6829933994934323
 0.1295297689806119
 0.43795698919772685
 0.39395847238977155
using Plots
grid = 0:.01:1
plot(grid, pdf.(beta_dist, grid))
pdf.(beta_dist, beta_samples[1:5])
5-element Vector{Float64}:
 1.2086291289041646
 2.300088115277408
 1.6389245754521693
 2.4529451430316644
 0.9144258514892248
logpdf.(beta_dist, beta_samples[1:5])
5-element Vector{Float64}:
  0.1894867660152757
  0.8329474331914342
  0.4940402800647119
  0.8972894018105007
 -0.08945889536877027

Floating point issues

Integer and floating point types

64 bit integers can represent \(-2^{63} \ldots 2^{63}\). 32 bit integers can represent \(-2^{31} \ldots 2^{31}\).

Similarly for 16 and 128 bit integers.

Values outside that range overflow.

xi::Int64 = 2^62
4611686018427387904
xi::Int64 = 2^70  # Overflows
0
xi::Int64 = 2^63  # Just overflows.
-9223372036854775808
yi::Int128 = 2^63  # Hmmm.
-9223372036854775808
yi::Int128 = Int128(2)^63
9223372036854775808
Int64(yi)
LoadError: InexactError: trunc(Int64, 9223372036854775808)
InexactError: trunc(Int64, 9223372036854775808)

Stacktrace:
 [1] throw_inexacterror(f::Symbol, ::Type{Int64}, val::Int128)
   @ Core ./boot.jl:634
 [2] checked_trunc_sint
   @ ./boot.jl:656 [inlined]
 [3] toInt64
   @ ./boot.jl:705 [inlined]
 [4] Int64(x::Int128)
   @ Core ./boot.jl:784
 [5] top-level scope
   @ In[19]:1
xi::Int64 = 2^63 - 1  # A mystery: shouldn't this overflow when calculating 2^63?
9223372036854775807
x = parse(BigInt, "1234567890123456789012345678901234567890")
1234567890123456789012345678901234567890
x+1
1234567890123456789012345678901234567891

There are 16, 32, and 64 bit floating point numbers, as well as a BigFloat type.

More a bit later.

Floating point precision

Let’s consider how much precision we have with real-valued numbers because of limited floating point precision.

For 64-bit floats, 53 bits are used for precision, which translates to approximately 16 digits of accuracy in base 10, regardless of the magnitude of the number. How many digits of accuracy do we have with 32-bit and 16-bit floats?

using Printf
@sprintf("%.20f", Float64(1/3))
"0.33333333333333331483"
@sprintf("%.20f", Float32(1/3))
"0.33333334326744079590"
@sprintf("%.20f", Float16(1/3))
"0.33325195312500000000"
BigFloat(1/3)  # Hmmm.
0.333333333333333314829616256247390992939472198486328125
BigFloat(1) / BigFloat(3)
0.3333333333333333333333333333333333333333333333333333333333333333333333333333348
BigFloat("0.3")
0.3000000000000000000000000000000000000000000000000000000000000000000000000000017
BigFloat("0.3", precision=500)  
0.30000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000006

Computation with BigFloats will be slow, so you wouldn’t want to do matrix operations with a matrix of them.

Floating point details

With Float64, any number is stored as a base 2 number of the form:

\[(-1)^{S}\times1.d\times2^{e-1023}=(-1)^{S}\times1.d_{1}d_{2}\ldots d_{52}\times2^{e-1023}\] where the computer uses base 2, \(b=2\), (so \(d_{i}\in\{0,1\}\)) because base-2 arithmetic is faster than base-10 arithmetic. The leading 1 normalizes the number; i.e., ensures there is a unique representation for a given computer number. This avoids representing any number in multiple ways, e.g., either \(1=1.0\times2^{0}=0.1\times2^{1}=0.01\times2^{2}\). For a double, we have 8 bytes=64 bits. Consider our representation as (\(S,d,e\)) where \(S\) is the sign. The leading 1 is the hidden bit and doesn’t need to be stored because it is always present. In general \(e\) is represented using 11 bits (\(2^{11}=2048\)), and the subtraction takes the place of having a sign bit for the exponent. (Note that in our discussion we’ll just think of \(e\) in terms of its base 10 representation, although it is of course represented in base 2.) This leaves \(p=52 = 64-1-11\) bits for \(d\).

The representations for floating point numbers with more or fewer bits than 64 is similar in structure but with a different split of bits used for the magnitude and the precision.

53 bits of precision in base 10 is about 16 digits. And 11 bits for the magnitude corresponds with when over/underflow occur. For 32-bit floats, we have both less precision and we more easily over/underflow.

Overflow

Integer numbers can be represented exactly by Float64 up to \(2^{53}\).

function pri(x)
  @sprintf("%.20i", x)
end

pri(2.0^52)
"4503599627370496"
pri(2.0^52 + 1)
"4503599627370497"
pri(2.0^53)
"9007199254740992"
pri(2.0^53 + 1)
"9007199254740992"
pri(2.0^53 + 2)
"9007199254740994"
pri(2.0^63)
"9223372036854775808"
pri(2.0^70)     # No overflow here unlike Int64.
"1180591620717411303424"
pri(12345678123456781234.0)  # Not exact.
"12345678123456780288"

Float64 overflow is not until ~\(2^{1023} \approx 10^{308}\).

function prf(x)
  @sprintf("%.20f", x)
end

prf(2.0^1022)
"44942328371557897693232629769725618340449424473557664318357520289433168951375240783177119330601884005280028469967848339414697442203604155623211857659868531094441973356216371319075554900311523529863270738021251442209537670585615720368478277635206809290837627671146574559986811484619929076208839082406056034304.00000000000000000000"
prf(2.0^1023)
"89884656743115795386465259539451236680898848947115328636715040578866337902750481566354238661203768010560056939935696678829394884407208311246423715319737062188883946712432742638151109800623047059726541476042502884419075341171231440736956555270413618581675255342293149119973622969239858152417678164812112068608.00000000000000000000"
prf(10.0^308)
"100000000000000001097906362944045541740492309677311846336810682903157585404911491537163328978494688899061249669721172515611590283743140088328307009198146046031271664502933027185697489699588559043338384466165001178426897626212945177628091195786707458122783970171784415105291802893207873272974885715430223118336.00000000000000000000"
prf(2.0^1024)
"Inf"

But Int64s that big do overflow.

pri(10^307)
"00000000000000000000"
pri(10^310)
"00000000000000000000"

Implications for comparisons and calculations

prf(1-2/3)
"0.33333333333333337034"
prf(1/3)
"0.33333333333333331483"
prf(2/3-1/3)
"0.33333333333333331483"
prf(0.3 - 0.2)
"0.09999999999999997780"
prf(0.1)
"0.10000000000000000555"
0.3 - 0.2 == 0.1
false

Here’s an example of catastrophic cancellation:

prf(123456781234.56)
"123456781234.55999755859375000000"
prf(123456781234.00)
"123456781234.00000000000000000000"
prf(123456781234.56 - 123456781234.00)
"0.55999755859375000000"

And here the precision is that of the larger magnitude number:

prf(1.0 + 1e-8)
"1.00000000999999993923"
prf(1.0 + 1e-17)
"1.00000000000000000000"
1.0 + 1e-17 == 1.0
true

Avoid multiplying/dividing many numbers

How large does \(n\) need to be to see underflow here? In other cases it could overflow.

using Distributions
normd = Normal(0, 1)
samples = rand(normd, n);
prod(pdf.(normd, samples))  # Log-likelihood/log-density
sum(logpdf.(normd, samples))

Linear algebra

We’ll see that linear algebra operations heavily exploit Julia’s multiple dispatch system, calling the most appropriate method for different kinds of input matrices.

Arithmetic

A = rand(3, 3);
B = rand(3, 3);
A + B;   # Element-wise addition
A .+ B;  # Element-wise addition
A * B;   # Matrix multiplication
A .* B;  # Element-wise multiplication

Solving systems of equations / inversion

using LinearAlgebra

A = rand(3, 3);

AtA = A'A;    # A' * A
b = rand(3);
AtA \ b       # Solve system of equations (AtA^{-1} b)
3-element Vector{Float64}:
 -1.3254878978759739
  5.5034206079487324
  0.30748299868104345
inv(AtA) * b  # Not as efficient
3-element Vector{Float64}:
 -1.3254878978759748
  5.503420607948734
  0.30748299868104345
eigvals(AtA)
3-element Vector{Float64}:
 0.03992762277861558
 0.22181922851410624
 2.0138442890779333
det(AtA)      # Be careful of over/underflow!
0.017836043877867285
logdet(AtA)
-4.026533932146638
tr(A)
1.5834783240781707
chol = cholesky(AtA)
Cholesky{Float64, Matrix{Float64}}
U factor:
3×3 UpperTriangular{Float64, Matrix{Float64}}:
 0.957996  0.309353  0.873741
  ⋅        0.213918  0.168095
  ⋅         ⋅        0.651687

We have $ (U{top}U){-1}b = U{-1}(U{}^{-1}b) $, which in Julia is implemented internally.

chol \ b                # Automatically exploits triangularity.
3-element Vector{Float64}:
 -1.3254878978759743
  5.503420607948734
  0.30748299868104323
chol.U \ (chol.L \ b)   # Manual equivalent solution.
3-element Vector{Float64}:
 -1.325487897875974
  5.503420607948734
  0.3074829986810431
typeof(chol)            # Special type of object
Cholesky{Float64, Matrix{Float64}}
typeof(chol.U)          # Special kind of matrix.
UpperTriangular{Float64, Matrix{Float64}}

This works in Julia but is having problems via Quarto, so I’ll just paste in some timings.

using BenchmarkTools

n = 5000;
A = rand(n, n);
AtA = A'A;
b = rand(n);
@btime AtA \ b;       # Solve system of equations (AtA^{-1} b) via Gaussian elimination (LU)
# 633.046 ms (6 allocations: 190.81 MiB)
@btime inv(AtA) * b;  # Not as efficient
# 2.538 s (8 allocations: 193.25 MiB)
@btime cholesky(AtA) \ b;  # Best if matrix is positive definite.
# 546.957 ms (5 allocations: 190.77 MiB)
@btime chol = cholesky(AtA); chol.U \ (chol.L \ b);  # As good but verbose.
# 520.021 ms (3 allocations: 190.73 MiB)

In principle, the Cholesky approach should involve \(n^3/6\) calculations and the Gaussian elimination \(n^3/3\), but we don’t see a two-fold difference here in practice.

Spectral (eigen) decomposition

Γ = eigvecs(AtA);
λ = eigvals(AtA);
result = Γ * diagm(λ) * Γ';

result[1:3,1:3]
3×3 Matrix{Float64}:
 0.917756  0.296359  0.83704
 0.296359  0.14146   0.306253
 0.83704   0.306253  1.21638
AtA[1:3,1:3]
3×3 Matrix{Float64}:
 0.917756  0.296359  0.83704
 0.296359  0.14146   0.306253
 0.83704   0.306253  1.21638
result == AtA
false
isapprox(result, AtA)
true
result  AtA  # \approx TAB
true

Don’t forget floating point issues

n=500;
A = rand(n, n);
det(A'A)     
Inf
logdet(A'A)
1364.3425695000474

Here’s a positive definite matrix (mathematically) that has all real, positive eigenvalues. On a computer, it’s not positive definite, and therefore not invertible/full rank.

xs = 0:99
# Compute distance matrix.
dists = abs.(xs .- xs')  # Using broadcasting with ' (an "outer" operation).
# Create correlation matrix.
corr_matrix = exp.(-(dists/10).^2)
# Compute eigenvalues and get last 20
eigvals(corr_matrix)[1:20] 
20-element Vector{Float64}:
 -5.050107193681261e-15
 -3.975288045003424e-15
 -3.0343039148117795e-15
 -2.3309913036821875e-15
 -1.5093549661154687e-15
 -1.3511621460923302e-15
 -1.2848549714671506e-15
 -1.0472866233731733e-15
 -8.644575073124488e-16
 -8.364136467422076e-16
 -7.989065965702987e-16
 -6.829697931404996e-16
 -6.091008178138255e-16
 -6.030845253775599e-16
 -5.700968077566563e-16
 -5.37906121536584e-16
 -4.615826595950015e-16
 -4.4365869197860045e-16
 -4.0894133652379844e-16
 -3.580464402267892e-16
chol = cholesky(corr_matrix);
LoadError: PosDefException: matrix is not positive definite; Cholesky factorization failed.
PosDefException: matrix is not positive definite; Cholesky factorization failed.

Stacktrace:
 [1] checkpositivedefinite
   @ /system/linux/julia-1.10.4/share/julia/stdlib/v1.10/LinearAlgebra/src/factorization.jl:67 [inlined]
 [2] cholesky!(A::Hermitian{Float64, Matrix{Float64}}, ::NoPivot; check::Bool)
   @ LinearAlgebra /system/linux/julia-1.10.4/share/julia/stdlib/v1.10/LinearAlgebra/src/cholesky.jl:269
 [3] cholesky!
   @ /system/linux/julia-1.10.4/share/julia/stdlib/v1.10/LinearAlgebra/src/cholesky.jl:267 [inlined]
 [4] cholesky!(A::Matrix{Float64}, ::NoPivot; check::Bool)
   @ LinearAlgebra /system/linux/julia-1.10.4/share/julia/stdlib/v1.10/LinearAlgebra/src/cholesky.jl:301
 [5] cholesky! (repeats 2 times)
   @ /system/linux/julia-1.10.4/share/julia/stdlib/v1.10/LinearAlgebra/src/cholesky.jl:295 [inlined]
 [6] cholesky
   @ /system/linux/julia-1.10.4/share/julia/stdlib/v1.10/LinearAlgebra/src/cholesky.jl:401 [inlined]
 [7] cholesky(A::Matrix{Float64})
   @ LinearAlgebra /system/linux/julia-1.10.4/share/julia/stdlib/v1.10/LinearAlgebra/src/cholesky.jl:401
 [8] top-level scope
   @ In[75]:1

One should be able to use an approximate pivoted Cholesky that sets diagonal elements to zero corresponding to the rank deficiency. I’m having trouble seeing how to do that.

This does seem to work to solve the system of equations. We’d have to investigate to know what Julia is doing behind the scenes, but it’s probably using a pivoted LU decomposition.

b = rand(100);
out = corr_matrix \ b;

Smart factorization

n = 500
A = rand(n, n);
typeof(factorize(A'A))
Cholesky{Float64, Matrix{Float64}}
typeof(factorize(A))           # I wouldn't have guessed that A is invertible!
LU{Float64, Matrix{Float64}, Vector{Int64}}
typeof(factorize(A[:,1:10]))   
QRPivoted{Float64, Matrix{Float64}, Vector{Float64}, Vector{Int64}}

The orthogonal matrices generated by certain factorizations can most efficiently be worked with by having them treated as “matrix-backed, function-based linear operators”.

QRresult = qr(A[:,1:10]);
typeof(QRresult.Q)
LinearAlgebra.QRCompactWYQ{Float64, Matrix{Float64}, Matrix{Float64}}
QRresult.Q * rand(10);

Use known structure!

using BenchmarkTools

n = 5000
A = rand(n, n);
AtA = A'A;
b = rand(n);
chol = cholesky(AtA);
typeof(chol.U)
@btime chol.U \ b;
  6.782 ms (3 allocations: 39.12 KiB)
U_dense = Matrix(chol.U);
@btime U_dense \ b;
  15.069 ms (2 allocations: 39.11 KiB)
@btime logdet(chol.U);
  104.026 μs (2 allocations: 32 bytes)
@btime logdet(U_dense);
  835.477 ms (5 allocations: 190.77 MiB)
@btime sum(log.(diag(chol.U)));
  99.164 μs (7 allocations: 78.27 KiB)
using SparseArrays

n = 5000;
A = Matrix(1.0I, n, n);
A[1,3] = 7.7;
A[5,9] = 2.3;
sA = sparse(A);

sizeof(A)
200000000
sizeof(sA)  # Not helpful given the pointers involved.
40
# sA. TAB  # What are the components of `sA`?
sizeof(sA.colptr) + sizeof(sA.nzval) + sizeof(sA.rowval)
120040
b = rand(n);
@btime A * b;
  10.192 ms (2 allocations: 39.11 KiB)
@btime sA * b;
  10.830 μs (2 allocations: 39.11 KiB)

Optimization

A good place to start for a variety of standard optimization algorithms is Optim.jl.

The main function is optimize(), and you provide the optimization method you want to use as an argument. For example, with the famous Rosenbrock multi-modal test function:

using Optim

rosenbrock(x) =  (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2

result = optimize(rosenbrock, zeros(2), BFGS())
Precompiling Optim
  ✓ CUDA
  ✓ ArrayInterface → ArrayInterfaceCUDAExt
  ✓ CUDA → SpecialFunctionsExt
  ✓ CUDA → ChainRulesCoreExt
  ✓ Optim
  5 dependencies successfully precompiled in 69 seconds. 109 already precompiled.
[ Info: Precompiling Optim [429524aa-4258-5aef-a3af-852621145aeb]
┌ Warning: Module StatsBase with build ID fafbfcfd-e473-0c57-0005-3290a6117ddd is missing from the cache.
│ This may mean StatsBase [2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:1948
[ Info: Skipping precompilation since __precompile__(false). Importing Optim [429524aa-4258-5aef-a3af-852621145aeb].
 * Status: success

 * Candidate solution
    Final objective value:     5.471433e-17

 * Found with
    Algorithm:     BFGS

 * Convergence measures
    |x - x'|               = 3.47e-07 ≰ 0.0e+00
    |x - x'|/|x'|          = 3.47e-07 ≰ 0.0e+00
    |f(x) - f(x')|         = 6.59e-14 ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 1.20e+03 ≰ 0.0e+00
    |g(x)|                 = 2.33e-09 ≤ 1.0e-08

 * Work counters
    Seconds run:   0  (vs limit Inf)
    Iterations:    16
    f(x) calls:    53
    ∇f(x) calls:   53
Optim.minimizer(result)
2-element Vector{Float64}:
 0.9999999926033423
 0.9999999852005355
Optim.minimum(result)
5.471432670590216e-17

Some of the optimizers include:

  • Derivative-free
    • Nelder-Mead
    • Simulated annealing
    • Particle swarm
  • Gradient-based
    • Adam and AdaMax
    • Conjugate gradient
    • Gradient-descent
    • BFGS and LBFGS (limited-memory BFGS)
  • Hessian-based
    • Newton
    • Newton with trust region (recommended by the package developers)
    • Interior point Newton

This collection is somewhat similar to the optimizers available with optim in R and with scipy.optimize in Python.