2

I have hundreds of .xls and .xlsx files with some sort of order but quite messy.

  • There are sometimes empty cells inbetween, those are irrelevant and should be ignored.
  • If there is a bold cell followed by a bold cell the first bold cell can be ignored.
  • If there is a bold cell followed by one or more empty cells and then a new bold cell the first bold cell + the empty cell can be ignored.
  • If there is a cell with "Logistic Data" or "Estimated logistic data:" or some text with lower("logistic data") any bold cell under it should be ignored.

Can I achieve the end result without going crazy by using Python/Pandas?

Example 1 (Filename: "example1 - headphone.xls"): ** = bold, doesn't seem to work in code blocks

**Features**
**Transmitter**
2.4GHZ
5GHZ
Power LED
- 1 x 3.5mm mic jack

**Receiver**
diameter
impedance
Mic Jack

**Logistic data:**
**Gift Box Measurement**
20'
40'

Example 2 ("example2 - keyboard.xls"):

**BSCI / SA8000**
BSCI / SA8000 certified
Certificate validity

**Upcost**

**Features**

Display (LCD, or LED)
Sync
**Estimated logistic data:**
Qty
Carton
20'

End result should be an .xlsx file with a drop-down menu for the values in each column like:

|-----------|----------------------|------------------|------------------|-------------------------|-------------------------|
| Article   | Transmitter          | Receiver         | Logistic data:   | BSCI / SA8000           | Features                |
|-----------|----------------------|------------------|------------------|-------------------------|-------------------------|
| headphone | 2.4GHZ               | diameter         | 20'              |                         |                         | 
|           | 5GHZ                 | impedance        | 40'              |                         |                         |
|           | Power LED            | Mic Jack         |                  |                         |                         |                        
|           | - 1 x 3.5mm mic jack |                  |                  |                         |                         |                        
|-----------|----------------------|------------------|------------------|-------------------------|-------------------------|
| keyboard  |                      |                  | Qty              | BSCI / SA8000 certified | Display (LCD, or LED)   |  
|           |                      |                  | Carton           | Certificate validity    | Sync                    | 
|           |                      |                  | 20'              |
Vega
  • 2,661
  • 5
  • 24
  • 49

2 Answers2

1

For detecting the styles, you may use an external package like styleframe. Repeat Step 1 and 2 for each example file.

  1. Read the example file, identify the indices where the style is Bold
from styleframe import StyleFrame
sf = StyleFrame.read_excel('Example-1.xlsx', read_style=True, use_openpyxl_styles=False, headers=None)
indices=[]
for i in range(0, len(sf)):
    for val in sf.iloc[i]:
        if(val.style.bold):
            indices.append(i)
  1. Find the values between the indices.
df=pd.read_excel("Example-1.xlsx", headers=None)
df=df.astype(str)
columns=[]
values=[]
for i in range(0,len(indices)):
    print(i)
    columns.append(df.iloc[indices[i]].values[0])
    if(i+1<len(indices)):
        values.append(list(df.iloc[indices[i]+1:indices[i+1]].values))
    else:
        if(indices[i]+1<len(df)):
            values.append(list(df.iloc[indices[i]+1:].values))
        else:
            values.append([])

values=list(map(lambda z: " ".join([x[0] for x in z]), values))
temp_dict=dict(zip(columns, values))
  1. The following code creates the final dataframe as required -
final_dict=[]
final_dict.append(temp_dict)
final_df=pd.DataFrame.from_dict(final_dict)

Example File must contain an additional Header for lesser ambiguity.

Sample Input

Anant Kumar
  • 611
  • 5
  • 20
  • Thanks for the answer but StyleFrame is using openpyxl which does not support the .xls format? – Vega Aug 12 '20 at 13:43
  • Is it fine with you to convert 'xls' file to 'xlsx'? If it is fine, you may refer the following link to do so - https://stackoverflow.com/questions/9918646/how-to-convert-xls-to-xlsx – Anant Kumar Aug 12 '20 at 15:47
  • Yes, that's fine. But I tried your code with a test file where I made some cells in the first column bold and some non-bold and get the wrong cells (either bold+non bold ones or only some of the non-bold ones) with your code? Seems like StyleFrame can't handle .xlsx styles? – Vega Aug 13 '20 at 08:59
  • It should support xlsx styles, can you attach the code you used? – AsafSH Sep 26 '20 at 11:03
0

Only works for .xlsx files

from openpyxl import load_workbook

path = "test.xlsx"
book = load_workbook(path)
sheet = book.worksheets[0] # get first Excel sheet of test.xlsx

for cells in range(1, 201): # check first 200 cells
        cell = sheet.cell(cells, 1) # iterate over cells in Column 1 = A -> A1, A2, A3, ...
        if cell.value != None and cell.font.b == 1: # ignore empty cells and get bold cells
        ... "do stuff with cell" ...
Vega
  • 2,661
  • 5
  • 24
  • 49