Prettier code is always better. Actually, comprehensible code is always better, that’s why I document like a crazy person with no short term memory, while the latter is partially true. Sadly, some of the lessons to be learned in this set of articles will result in incomprehensible, unappealing code. The only compliment that might be shared is that it is a bit shorter and in some cases, faster. Most of my m-scripting experience was acquired using the latest versions of MATLAB. However those were the oh-so-happy university days, so most of the outputs provided here were implemented using Octave.
Memory Access Patterns
Although this is certainly not indigenous to MATLAB, the matrices are stored in the memory in a column-wise fashion. Namely, given an MxN matrix, A, the 1st element in the 1st column can be accessed both as A(1,1) and A(1). The M+1-th element can be accessed both as A(2,1) and A(M+1), respectively. This logic extends to “multi-dimensional” matrices (more correctly referred to as matrix arrays), namely matrices with more than 2 dimensions. Given that A is , then the
– th element can be accessed both as
and
.
While many of you probably already knew this, this actually more important than just to know how to make crazy indexing tricks. This implies that in many cases it will be more efficient to perform matrix reduction operations, i.e. min, max, sum, pi, etc., in a column-wise fashion. Again, for some of you this is common knowledge. Well, I hope I will be able to entertain you along this article, or maybe some of the next. If not, I always find short films of kittens mildly entertaining. Lets start ticking!
Any junior MATLAB profiler is well acquainted with the tic and toc function pair. Truth be told, Iv’e been carrying around for years a small C++ library which has only these two functions. Luckily, Octave has these implemented, as well. Le’ts try this simple script.
clc
clear
close all
Ms = [1000 2500 5000 10000 25000 50000 100000];
N = 100;
nIters = 100;
t_col = zeros(size(Ms));
t_row = t_col;
t_col_curr = 0;
t_row_curr = 0;
for mIdx = 1:numel(Ms)
M = Ms(mIdx);
A = rand([M N]);
for i = 1:nIters
tic
B = sum(A,1);
t_col_curr = toc;
t_col(mIdx) = t_col(mIdx) + t_col_curr;
end
A = A.';
for i = 1:nIters
tic
B = sum(A,2);
t_row_curr = toc;
t_row(mIdx) = t_row(mIdx) + t_row_curr;
end
end
% Average the timings to diminish effects of "first run"
t_col = t_col/nIters;
t_row = t_row/nIters;
figure;
semilogx(Ms,t_col*1000,'-b');
hold on;
semilogx(Ms,t_row*1000,'-r');
hold off;
xlabel('Number or rows','fontsize',17);
ylabel('t [msec]','fontsize',17);
set(gca,'fontsize',15)
hdl = legend({'Column reduction','Row reduction'});
set(hdl,'fontsize',15);
An initial run in octave on my PC shows a confusing result:

t seems that the column reduction is only slightly faster. Well, one explanation could be that my RAM access speeds are really fast. I mean really fast! So fast, that the time difference of repeatedly accessing memory zones which are far apart is negligible. I find this unlikely, so perhaps the implementation of octave transposes the second case prior to the reduction (this could be done by detecting the larger of the two dimensions). This is unlikely as transposing a matrix is time consuming on it’s own. Then there is no choice but to assume that the octave implementation is inefficient on both cases.
For now, I still don’t have a valid MATLAB version. I am open for collaboration in this matter! Feel free to contact me by E-Mail or by the contact box in this website.
Now, lets get real!
When is this even useful? Well, for an example lets observe optimization problems. Every optimization scheme (discrete or continuous) has a cost function, which is comprised of all of the optimization goals. For example, a commonly used function is the square error.
The cost function f depicts the total cost of the i-th iteration, which is comprised of N goals. The squared error is summed between the obtained weights (marked with w) in the i-th iteration, and the goals of this optimization function. Usually, the optimization scheme is designed to bring f to a minimum. In some optimization schemes, several weight sets are suggested in each iteration. To conserve calculation time, it is possible to arrange the weights along the columns, so the reduction will be performed faster. While this may seem negligible, note that in some cases a very large number of goals (weights) is set in the cost function, hence this can conserve valuable time.
It is important to know when this method is less efficient, as well. The best example for this, in my opinion, is a matrix batch multiplication. Given an MxNaxL matrix array, A and an one, B, a matrix batch multiplication can be performed, for example, with this script:
function C = batchMul(A,B)
% Arrange so that the cumulative product will be done
% along the rows.
A = permute(A,[2 1 4 3]);
B = permute(B,[1 4 2 3]);
% Replicate the dimensions such that the "resulting"
% matrix now resides in the 2nd and 3rd dimensions.
A = repmat(A,[1 1 size(B,3)]);
B = repmat(B,[1 size(A,2)]);
% Cumulative product
C = sum(A.*B,1);
C = permute(C,[2 3 4 1]);
end
Surprisingly, a column-wise reduction version of this operation (with no for loops) will run significantly slower on both MATLAB and Octave. Here is my example of such an implementation
function C = batchMul_forloop(A,B)
C = zeros([size(A,1) size(B,2) size(A,3)]);
for l = 1:size(B,3)
C(:,:,l) = A(:,:,l)*B(:,:,l);
end
end
Test it for yourself! This runs much slower than the version with the for loops. There is a way to speed up the no-loop implementation, but it will be discussed in future posts.
Take care!