5

I use Tensorflow for deep learning work, but I was interested in some of the features of Julia for ML. Now in Tensorflow, there is a clear standard that protocol buffers--meaning TFRecords format is the best way to load sizable datasets to the GPUs for model training. I have been reading the Flux, KNET, documentation as well as other forum posts looking to see if there is any particular recommendation on the most efficient data format. But I have not found one.

My question is, is there a recommended data format for the Julia ML libraries to facilitate training? In other words, are there any clear dataset formats that I should avoid because of bad performance?

Now, I know that there is a Protobuf.jl library so users can still use protocol buffers. I was planning to use protocol buffers for now, since I can then use the same data format for Tensorflow and Julia. However, I also found this interesting Reddit post about how the user is not using protocol buffers and just using straight Julia Vectors.

https://www.reddit.com/r/MachineLearning/comments/994dl7/d_hows_julia_language_mit_for_ml/

I get that the Julia ML libraries are likely data storage format agnostic. Meaning that no matter what format in which the data is stored, the data gets decoded to some sort of vector or matrix format anyway. So in that case I can use whatever format. But just wanted to make sure I did not miss anything in the documentation or such about problems or low performance due to using the wrong data storage format.

logankilpatrick
  • 13,148
  • 7
  • 44
  • 125
krishnab
  • 9,270
  • 12
  • 66
  • 123

1 Answers1

1

For in-memory use just use arrays and vectors. They're just big contiguous lumps of memory with some metadata. You can't really get any better than that.

For serializing to another Julia process, Julia will handle that for you and use the stdlib Serialization module.

For serializing to disk you should either Just use Serialization.serialize (possibly compressed) or, if you think you might need to read from another program or if you think you'll change Julia version before you're done with the data you can use BSON.jl or Feather.jl.

In the near future, JLSO.jl will be a good option for replacing Serialization.

cmc
  • 973
  • 8
  • 17
  • Thanks for the insights. I was wondering about this for training medium sized datasets that were serialized. Cool. BSON.jl seems similar to protocol buffers, which are familiar. I had not seen the Serialization module or JLSO.jl module before. I will have to check them out. I guess I will need to write some simple tests and benchmark them for speed in decoding. – krishnab Jul 18 '19 at 02:08
  • 1
    stdlib Serialization + compression are the fastest, afaik. JLSO.jl just uses Serialization but keeps some metadata in BSON about what packages you have installed (because the stdlib serialization format is notionally unstable across versions - in reality I suspect arrays of floats will be safe). BSON is unlike protocolbuffers in that BSON is unsuitable for in-memory computation use. Because I crave karma and to make this answer more visible to others, please upvote and accept the answer if it's a suitable answer to your initial query. – cmc Jul 18 '19 at 14:21
  • knets batch is a tuple of arrays. – Kermit Jul 03 '21 at 23:50