This is what I generally do for standard python3 distribution with pip (it mirrors a bit R data distribution).
- Within your code directory create a folder for the data, lets call it "my_data".
Here you can put anything you want: csv, json, pickle...
but be aware pickle may have some issues when loading into python versions others than the one used to create it.
Also there are some security concerns with pickle so if you are going to distribute the package choose another format.
Then, If your package is called for instance "my_data_pack" you will have this folder structure:
.
├── my_data_pack
│ ├── __init__.py
│ └── my_data
│ └── data_file.txt
└── setup.py
- Include this lines In the
setup
function of your your setup.py
file:
from setuptools import setup, find_packages
setup(
name='my_data_pack',
packages=find_packages(),
package_data={'my_data_pack': ['my_data/*']}
)
This will make the data to be included in the tar.gz distribution file when building for pip.
Depending on your package structure you may need to change the line to something like package_data={'mypkg': ['my_data/*.dat']},
as it is indicated in the link you mention.
- The final and tricky thing is how to make the modules in your package to find the dataset when installed.
The idea is first to locate the data file in the directory where the package is installed and then to load the data into your module.
To locate the data file you can use
os
or pkg_resources
To use os
include these lines in your __init__.py
file (or in any other submodule you are using):
import os
location = os.path.dirname(os.path.realpath(__file__))
my_file = os.path.join(location, 'my_data', 'data_file.txt')
with open(my_file) as fin:
my_data_object = fin.readlines()
or these if you prefer to use pkg_resources
:
import pkg_resources
my_file = pkg_resources.resource_filename('my_data_pack', 'my_data/data_file.txt')
with open(my_file2) as fin:
my_data_object = fin.readlines()
Change the readlines
section to read your own data format. That is all you need for the package code.
- To make the library distribution I run:
python3 setup.py sdist
This will create a new directory called "dist" with at tar.gz file in it.
Then you can install your package as
pip3 install dist/my_data_pack-0.0.0.tar.gz
To access the data in your python session you will do:
import my_data_pack
print(my_data_pack.my_data_object)
In the old R times (before devtools
:) you would use the system.file
function with the option package
to find the location of your installed library and then load the data... something similar to the python os.path.realpah
.