2

I'm trying to change a list of excel files to csvs before loading them into a pandas dataframe, but I'm unsure how I can convert them in my script. Csvkit and xlsx2csv seem to work for doing it from the command line, but when I try to start a subprocess like so

for filename in sorted_files:
file = subprocess.Popen("in2csv filename", stdout=subprocess.PIPE)
print file.stdout
dataframe = pd.read_csv(file)

I'm getting the error

IOError: Expected file path name or file-like object, got type schema must not be null when format is "fixed"

Is it possible to get the output from the subprocess and pipe that to a dataframe? Any help greatly appreciated!

boson
  • 886
  • 1
  • 12
  • 25

1 Answers1

0

Although it has been so long since the question was made, I had the same issue and this is the way it was implemented inside a python script:
Could only execute Xlsx2csv with sheetid parameter. In order to get sheet names and ids, get_sheet_details was used.
csvfrmxlsx creates csv files for each sheet in csv folder under parent directory.

import pandas as pd
from pathlib import Path


def get_sheet_details(filename):
    import xmltodict
    import shutil
    import zipfile
    sheets = []
    # Make a temporary directory with the file name
    directory_to_extract_to = (filename.with_suffix(''))
    directory_to_extract_to.mkdir(parents=True, exist_ok=True)
    # Extract the xlsx file as it is just a zip file
    zip_ref = zipfile.ZipFile(filename, 'r')
    zip_ref.extractall(directory_to_extract_to)
    zip_ref.close()
    # Open the workbook.xml which is very light and only has meta data, get sheets from it
    path_to_workbook = directory_to_extract_to / 'xl' / 'workbook.xml'
    with open(path_to_workbook, 'r') as f:
        xml = f.read()
        dictionary = xmltodict.parse(xml)
        for sheet in dictionary['workbook']['sheets']['sheet']:
            sheet_details = {
                'id': sheet['@sheetId'],  # can be sheetId for some versions
                'name': sheet['@name']  # can be name
            }
            sheets.append(sheet_details)
    # Delete the extracted files directory
    shutil.rmtree(directory_to_extract_to)
    return sheets


def csvfrmxlsx(xlsxfl, df):  # create csv files in csv folder on parent directory
from xlsx2csv import Xlsx2csv
(xlsxfl.parent / 'csv').mkdir(parents=True, exist_ok=True)
for index, row in df.iterrows():  
    shnum = row['id']
    shnph = xlsxfl.parent / 'csv' / Path(row['name'] + '.csv')  # path for converted csv file
    Xlsx2csv(str(xlsxfl), outputencoding="utf-8").convert(str(shnph), sheetid=int(shnum))  
return


pthfnc = 'c:/xlsx/'
wrkfl = 'my.xlsx'
xls_file = Path(pthfnc + wrkfl)
sheetsdic = get_sheet_details(xls_file)  # dictionary with sheet names and ids without opening xlsx file
df = pd.DataFrame.from_dict(sheetsdic)
csvfrmxlsx(xls_file, df)  # df with sheets to be converted
GERMAN RODRIGUEZ
  • 397
  • 1
  • 4
  • 9