6

I'm developing a Python package, and the implementation of the Python package depends on some research result. I saved my research result in a Python dictionary. I have 2 questions:

  1. How to save this dictionary as internal data in my package?
  2. When writing functions in the package, how to load this internal data?

I looked into this, but still couldn't get an idea of how to save package internal data from scratch. It also doesn't show how to load the saved internal data. Is there anything like devtools::use_data in R?

user9721758
  • 71
  • 1
  • 3
  • Save it to a file, load it in a module, import the module. Dictionaries are pretty easy to serialize as json - https://docs.python.org/3/library/json.html. Or you could [pickle](https://docs.python.org/3/library/pickle.html#module-pickle) it. https://docs.python.org/3/library/persistence.html#data-persistence. – wwii Sep 01 '19 at 20:38
  • I guess I should have asked if the `research result` is dynamic or static? Does the `result` change during package/module execution? Is the `result` determined once when execution begins then used from then on? Does the `result` never change and you just need to load it when execution begins? – wwii Sep 01 '19 at 20:45
  • You linked to the version 2.6 docs. If possible you should switch to Python 3.7+. – wwii Sep 01 '19 at 20:46
  • @wwii Thanks for your response, and the link has been change. Yes, `research result` is determined. The `result` does't change during package/module execution. I just need to load it when execution begins. – user9721758 Sep 01 '19 at 21:07
  • So you are asking how to load it at the beginning of execution and make it available? You are NOT asking how to package it for distribution? – wwii Sep 01 '19 at 21:27
  • sorry about all the questions - but - is this what you are asking? [Can I use __init__.py to define global variables?](https://stackoverflow.com/questions/1383239/can-i-use-init-py-to-define-global-variables) – wwii Sep 01 '19 at 21:58

2 Answers2

7

This is what I generally do for standard python3 distribution with pip (it mirrors a bit R data distribution).

  1. Within your code directory create a folder for the data, lets call it "my_data". Here you can put anything you want: csv, json, pickle... but be aware pickle may have some issues when loading into python versions others than the one used to create it. Also there are some security concerns with pickle so if you are going to distribute the package choose another format.

Then, If your package is called for instance "my_data_pack" you will have this folder structure:

.
├── my_data_pack
│   ├── __init__.py
│   └── my_data
│       └── data_file.txt
└── setup.py

  1. Include this lines In the setup function of your your setup.py file:
from setuptools import setup, find_packages

setup(
    name='my_data_pack',
    packages=find_packages(),
    package_data={'my_data_pack': ['my_data/*']}
)

This will make the data to be included in the tar.gz distribution file when building for pip. Depending on your package structure you may need to change the line to something like package_data={'mypkg': ['my_data/*.dat']}, as it is indicated in the link you mention.

  1. The final and tricky thing is how to make the modules in your package to find the dataset when installed. The idea is first to locate the data file in the directory where the package is installed and then to load the data into your module. To locate the data file you can use os or pkg_resources

To use os include these lines in your __init__.py file (or in any other submodule you are using):

import os

location = os.path.dirname(os.path.realpath(__file__))
my_file = os.path.join(location, 'my_data', 'data_file.txt')

with open(my_file) as fin:
    my_data_object = fin.readlines()

or these if you prefer to use pkg_resources:

import pkg_resources

my_file = pkg_resources.resource_filename('my_data_pack', 'my_data/data_file.txt')

with open(my_file2) as fin:
    my_data_object = fin.readlines()

Change the readlines section to read your own data format. That is all you need for the package code.

  1. To make the library distribution I run:
python3 setup.py sdist

This will create a new directory called "dist" with at tar.gz file in it. Then you can install your package as

pip3 install dist/my_data_pack-0.0.0.tar.gz

To access the data in your python session you will do:

import my_data_pack
print(my_data_pack.my_data_object)

In the old R times (before devtools :) you would use the system.file function with the option package to find the location of your installed library and then load the data... something similar to the python os.path.realpah.

dmontaner
  • 2,076
  • 1
  • 14
  • 17
  • Is `3.` done in `__init__.py`? I was thinking that you would load the datafile in `__init__.py` then it would be available as a package level object. – wwii Sep 01 '19 at 21:51
  • Yes you can put it in `__init__.py` or in any other submodule of the package. I made it more clear above. – dmontaner Sep 01 '19 at 23:53
  • [`load_file = lambda filename: pkgutil.get_data('my_data_pack', 'my_data/' + filename)`](https://docs.python.org/3/library/pkgutil.html#pkgutil.get_data) e.g.: `data = my_data_pack.load_file('data_file.txt')` – jfs Oct 11 '20 at 05:25
2

Python 3.4 added the pathlib module to the standard library, which makes working with file and directory locations more elegant.

To obtain the directory in which your package is installed, you can include this in your __init__.py:

from pathlib import Path
PACKAGEDIR = Path(__file__).parent.absolute()

To obtain the path of a file inside the package directory, you can construct a path as follows:

my_file = PACKAGEDIR / 'my_data' / 'data_file.txt'
GeertHub
  • 21
  • 1