4

I have some pretty massive data files (256 channels, on the order of 75-100 million samples = ~40-50 GB or so per file) in int16 format. It is written in flat binary format, so the structure is something like: CH1S1,CH2S1,CH3S1 ... CH256S1,CH1S2,CH2S2,...

I need to read in each channel separately, filter and offset correct it, then save. My current bottleneck is loading each channel, which takes about 7-8 minutes... scale that up 256 times, and I'm looking at nearly 30 hours just to load the data! I am trying to intelligently use fread, to skip bytes as I read each channel; I have the following code in a loop over all 256 channels to do this:

offset = i - 1;
fseek(fid,offset*2,'bof');
dat = fread(fid,[1,nSampsTotal],'*int16',(nChan-1)*2);

Reading around, this is typically the fastest way to load parts of a large binary file, but is the file simply too large to do this any faster?

I'm not loading that much data... the test file I'm working with is 37GB, for one of the 256 channels, I'm only loading 149MB for the entire trace... maybe the 'skip' functionality of fread is suboptimal?

System details: MATLAB 2017a, Windows 7, 64bit, 32GB RAM

Chris
  • 111
  • 7
  • 2
    If you can afford to load the full file in one go, do that. Otherwise, try loading large chunks and use indexing to keep the parts you need, and see if that is any faster. – Cris Luengo Aug 20 '18 at 20:39
  • 1
    Q1) Why do you *"need to read each channel in separately"*? Q2) Why don't you either store your data in a more sensible format, or temporarily split it using a high-performance language, filter with R, then reassemble it with a high-performance language? – Mark Setchell Aug 20 '18 at 20:55
  • @Cris: Loading it in one go would be great, but I don't have enough RAM (32GB). Loading large chunks isn't great, because of edge artifacts from filtering (unless I pad the loaded chunks and recombine, which I could do but wanted to avoid). Mostly, I'm confused why the 'skip' functionality of fread is so slow... everything I read tells me that skipping bytes is the fastest way to load portions of binary files in Matlab – Chris Aug 20 '18 at 21:25
  • @MarkSetchell: Reading separately is not a necessity, but ideal for filtering as mentioned above. For storage, this is the format of the raw data from an electrophysiology rig. What high performance language would you suggest? C? Most of my experience is in Matlab and Python, but I could give it a go... – Chris Aug 20 '18 at 21:34
  • 1
    I use a Mac with an NVME SSD drive that sustains 2GB/s, so will read a 50GB file in 25s and write it in a further 30-40s. A conventional drive would probably run around 10-15x slower, so you could expect to split the files in 15 mins and to recombine in the same time afterwards. Not sure how long your processing takes per channel, but presumably you could trivially parallelise it across 4-8 threads if the per channel data was separate - as I am suggesting. – Mark Setchell Aug 20 '18 at 21:46
  • I was suggesting you read a chunk (say 10Gb) without skipping, then use MATLAB indexing to extract your channel, read the next chunk, etc. Then assemble the chunks. Compare that execution time to your current method, see if there's any difference. – Cris Luengo Aug 20 '18 at 21:52
  • @CrisLuengo That sounds good. I think the secret is to read sequentially and not skip around. If you skip to each of 100m samples of each of 256 channels that is an awful lot of system calls and latencies are cumulative. – Mark Setchell Aug 20 '18 at 21:58
  • @CrisLuengo Ah, I see what you're saying. Maybe I can write each channel to a temporary file and then load that for analysis. And yeah, it also sounds like it would be better to ditch HDDs for this analysis – Chris Aug 21 '18 at 14:04

1 Answers1

5

@CrisLuengo's idea was much faster: essentially, chunking the data, loading each chunk and then splitting that out to separate channel files to save RAM.

Here is some code for just the loading part which is fast, less than 1 minute:

% fake raw data
disp('building... ');
nChan = 256;
nSampsTotal = 10e6;
tic; DATA = rand(nChan,nSampsTotal); toc;
fid = fopen('rawData.dat','w');
disp('writing flat binary file... ');
tic; fwrite(fid,DATA(:),'int16'); toc;
fclose(fid);

% compute the number of samples and chunks
chunkSize = 1e6;
nChunksTotal = ceil(nSampsTotal/chunkSize);


%% load by chunks
t1 = tic;
fid = fopen('rawData.dat','r');
dat = zeros(nChan,chunkSize,'int16');
chunkCnt = 1;
while 1
    tic
    if chunkCnt <= nChunksTotal
        % load the data
        fprintf('Chunk %02d/%02d: loading... ',chunkCnt,nChunksTotal);
        dat = fread(fid,[nChan,chunkSize],'*int16');
    else
        break;
    end
    toc;
    chunkCnt = chunkCnt + 1;
end
t = toc(t1); fprintf('Total time: %4.2f secs.\n\n\n',t);
% Total time: 55.07 secs.
fclose(fid);

On the other hand, loading by channel by skipping through the file takes about 20x longer, a little over 20 minutes:

%% load by channels (slow)
t1 = tic;
fid = fopen('rawData.dat','r');
dat = zeros(1,nSampsTotal);
for i = 1:nChan
    tic;
    fprintf('Channel %03d/%03d: loading... ');
    offset = i-1;
    fseek(fid,offset*2,'bof');
    dat = fread(fid,[1,nSampsTotal],'*int16',(nChan-1)*2);
    toc;
end
t = toc(t1); fprintf('Total time: %4.2f secs.\n\n\n',t);
% Total time: 1133.48 secs.
fclose(fid);

I'd also like to thank OCDER on the Matlab forums for their help: link

Chris
  • 111
  • 7