Can improved performance be obtained from parallel reading in Fortran?

Question

I have a fortran90 code that spends (by far) most of the time on I/O, because very large data files (at least 1GB and up from there), need to be read. Smaller, but still large, data files with the results of the calculations need to be written. Comparatively, some fast Fourier transforms and other calculations are done in no time. I have parallelized (OpenMP) some of these calculations but the overall gain in performance is minimal given the mentioned I/O issues.

My strategy at the moment is to read the whole file at once:

open(unit=10, file="data", status="old")

do i=1,verylargenumber
  read(10,*) var1(i), var2(i), var3(i)
end do

close(10)

and then perform operations on var1, etc. My question is whether there is a suitable strategy using (preferably) OpenMP that would allow me to speed up the reading process, especially with the consideration in mind (if it makes any difference) that the data files are quite large.

I have the possibility to run these calculations on Lustre file systems, which in principle offer advantages for parallel I/O, although a general solution for regular file systems would be appreciated.

My intuition is that there is no work around this issue but I wanted to check for sure.

Reading (or writing) the same file from two threads simultaneously is very likely to lead to contention over access to the single i/o hardware channel between RAM and disk surface. (Unless you have a parallel-at-hardware-level disk system that is.) In general your current approach, reading (and writing) large files in one go, is generally the best approach. You might get better performance by carefully matching input/output buffer sizes to the blocks of memory you want to process but that takes you well outside Fortran. There are other tricks too but again, extra-Fortran-ic. — High Performance Mark, Dec 08 '15 at 13:12
@HighPerformanceMark Thanks for the input. It is possible for me to run these calculations on Lustre file systems, which (as far as I know, I'm certainly no expert here) is what you call a "parallel-at-hardware-level disk system". Do you think that would make thinks better? In general though, the typical user would run this on a normal machine. — Miguel, Dec 08 '15 at 13:21
Well, yes, Lustre is just one of those types of file system which *may* provide faster reading and writing for parallel programs. I'm unable to provide assistance using OpenMP on Lustre, but I think it is material information for anyone else who happens across your question so edit the question. Don't rely on people seeing material in comments. — High Performance Mark, Dec 08 '15 at 13:23
Do the files have to be human readable? If not, you will get much faster performance with unformatted files (sometimes called "binary"). Use form='unformatted' in the open statement. Much of the runtime is probably spent in the conversions between the character and internal representations of the numbers. — M. S. B., Dec 08 '15 at 13:38
@M.S.B. Thanks, this might actually help. The files would usually be provided in typical formats for molecular dynamics simulations, which are some times ASCII, some times binary. I guess I can write a small interface that allows the user to employ binary files and improve performance. — Miguel, Dec 08 '15 at 13:51
You could also split your file in multiple independent files. Now, each thread can read a different file. If you are able put your files on different physical hard drives, you will gain from parallelization. — Anthony Scemama, Dec 08 '15 at 14:21
I'm surprised nobody has mentioned that if you have access to separate *machines* (or clusters/nodes/etc), you can vastly improve your I/O with MPI. That being said, it will require a complete re-write of many parts of your application, but molecular dynamics simulations can often be made embarrassingly parallel (or close to it), so it will improve your computation time as well. Using MPI across threads within one machine however is not likely to improve much at all. — NoseKnowsAll, Dec 08 '15 at 17:21

score 0 · Answer 1 · answered Apr 19 '18 at 19:16

I'm not a Fortran guru, but it looks like you are reading the values from the file in very small chunks (3x integers at a time, at most a few dozen bytes). Reading the file in large chunks (multi-MB at a time) is going to provide a significant improvement in performance, since you will be reducing the number of underlying read() system calls (and corresponding locking overhead) by many orders of magnitude.

If your large files are written in Lustre with multiple stripes (e.g. in a directory with lfs setstripe -c 8 -S 4M <dir> to set a default stripe count of 8 with a stripe size of 4MB for all new files in that directory) then this may improve the aggregate read performance - assuming that you are reading only a single file at one time, and you are not limited by the client network bandwidth. If your program is running on multiple nodes and/or threads concurrently, and each of those threads is itself reading its own file, then you will already have parallelism above the file level. Even reading from a single file can do quite well (if the reads are large) because the Lustre client will do readahead in the background.

If you have multiple compute threads that are each working on different chunks of the file at one time (e.g. 4MB chunks) then you could read each of the 4MB chunks from a different thread, which may improve performance since you will have more IO requests in flight. However, there is still a limit on how fast a single client can read files over the network. Reading from a multi-striped file from multiple clients concurrently will allow you to aggregate the network and disk bandwidth from multiple clients and servers, which is where Lustre does best.

Can improved performance be obtained from parallel reading in Fortran?

1 Answers1