Why using database (redis, SQL) would help when loading big data and RAM is running out of memory?

Question

I need to take 100 000 images from a directory, put them all in one big dictionary where the keys are the ids of the pictures and the values are the numpy arrays of the pixels of the images. Creating this dict takes 19 GB of my RAM and I have 24GB in total. Then I need to order the dictionary with respect to the key and at the end take only the values of this ordered dictionary and save it as one big numpy array. I need this big numpy array because I want to sent it to train_test_split sklearn function and split the whole data to train and test sets with respect to their label. I found this question where they have the same problem with running out of RAM in the step where after creating the dictionary of 19GB I try to sort the dict: How to sort a LARGE dictionary and people suggest using database.

def save_all_images_as_one_numpy_array():
    data_dict = {}
    for img in os.listdir('images'):
        id_img = img.split('_')[1]
        loadimg = load_img(os.path.join('images', img))
        x = image.img_to_array(loadimg)
        data_dict[id_img] = x

data_dict = np.stack([ v for k, v in sorted(data_dict.items(), key = lambda x: int(x[0]))])
mmamfile = open_memmap('trythismmapfile.npy', dtype=np.float32, mode='w+',shape=data_dict.shape)
mmamfile[:] = data_dict[:]


def load_numpy_array_with_images():
    a = open_memmap('trythismmapfile.npy', dtype=np.float32, mode='r')

When using np.stack I am stacking each numpy array in new array and this is where I run out of RAM. I can't afford to buy more RAM. I thought I can use redis in docker container but I don't understand why and how using a database will solve my problem?

I'm not sure why you want to put Reddit into Docker if it's already available at https://www.reddit.com (you mean Redis, don't you?) — ForceBru, Jan 04 '19 at 09:45
@mitevva_t 19GB is a *small* database for 2018. Big databases run in the TB range these days. 100K records on the other hand is a tiny amount of records. In your case though I don't see why you should load the images *at all*. All you care about are the keys/paths. You don't even need a dictionary, since you don't need random key-based access. Load the paths in an array, sort the array then iterate over the paths and copy them one by one into the target file. — Panagiotis Kanavos, Jan 04 '19 at 09:55
@mitevva_t I suspect you could do the same with a one-line shell script that orders filenames then copies them to a target. — Panagiotis Kanavos, Jan 04 '19 at 09:57
@PanagiotisKanavos thanks a lot! That makes sense. Just one following question, when I save them one by one into numpy file i will have many numpy arrays but I need one big numpy array of shape (100000, width, heights, channels) ? — mitevva_t, Jan 04 '19 at 10:02

score 1 · Accepted Answer · answered Jan 04 '19 at 10:06

The reason using a DB helps is because the DB library stores data on the hard-disk rather than in memory. If you look at the documentation for the library the linked answer suggests then you'll see that the first argument is filename, demonstrating that the hard-disk is used.
https://docs.python.org/2/library/bsddb.html#bsddb.hashopen

However, the linked question is talking about sorting by value, not key. Sorting by key will be much less memory intensive although you'll likely still have memory issues when training your model. I'd suggest trying something along the lines of

# Get the list of file names
imgs = os.listdir('images')

# Create a mapping of ID to file name
# This will allow us to sort the IDs then load the files in order
img_ids = {int(img.split('_')[1]): img for img in imgs}

# Get the list of file names sorted by ID
sorted_imgs = [v for k, v in sorted(img_ids.items(), key=lambda x: x[0])]

# Define a function for loading a named img
def load_img(img):
    loadimg = load_img(os.path.join('images', img))
    return image.img_to_array(loadimg)

# Iterate through the sorted file names and stack the results
data_dict = np.stack([load_img(img) for img in sorted_imgs])

Thanks a lot! This is exactly what I need! – mitevva_t Jan 04 '19 at 10:19 — mitevva_t, Jan 04 '19 at 10:19

Why using database (redis, SQL) would help when loading big data and RAM is running out of memory?

1 Answers1