0

I want to find list of columns present in more than 500 csv files. Read only columns & put in dataframe or list and later I want to export to csv.

Learnings
  • 2,780
  • 9
  • 35
  • 55

1 Answers1

2

pandas can easily extract only certain columns and write out a new CSV file for you:

import pandas as pd

data = pd.read_csv('input.csv', usecols=[1, 3], index_col=1)
data.to_csv('output.csv')

If you wanted to extract the same two columns from multiple files into a single output, you could do the following:

import pandas as pd
import glob

combined_data = pd.DataFrame()

for csv_file in glob.glob('*.csv'):
    data = pd.read_csv(csv_file, usecols=[1, 3], index_col=1, skiprows=1, header=None, names=['Col1', 'Col2'])
    combined_data = combined_data.append(data)

combined_data.to_csv('output.csv')

This might need tweaking depending on whether or not your files have headers and the same column names.


To extract the headers from all of the files (without loading the whole file), you could simply use the CSV library as follows:

import csv
import glob

with open('headers.csv', 'w', newline='') as f_headers:
    csv_headers = csv.writer(f_headers)

    for csv_file in glob.glob('input*.csv'):
        with open(csv_file, 'r', newline='') as f_csv:
            csv_headers.writerow([csv_file] + next(csv.reader(f_csv)))

This would also put the filename as the first entry on each row.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97