Read Excel files and arrange rows as columns (bold) with row values under bold rows?

Question

I have hundreds of .xls and .xlsx files with some sort of order but quite messy.

There are sometimes empty cells inbetween, those are irrelevant and should be ignored.
If there is a bold cell followed by a bold cell the first bold cell can be ignored.
If there is a bold cell followed by one or more empty cells and then a new bold cell the first bold cell + the empty cell can be ignored.
If there is a cell with "Logistic Data" or "Estimated logistic data:" or some text with lower("logistic data") any bold cell under it should be ignored.

Can I achieve the end result without going crazy by using Python/Pandas?

Example 1 (Filename: "example1 - headphone.xls"): ** = bold, doesn't seem to work in code blocks

**Features**
**Transmitter**
2.4GHZ
5GHZ
Power LED
- 1 x 3.5mm mic jack

**Receiver**
diameter
impedance
Mic Jack

**Logistic data:**
**Gift Box Measurement**
20'
40'

Example 2 ("example2 - keyboard.xls"):

**BSCI / SA8000**
BSCI / SA8000 certified
Certificate validity

**Upcost**

**Features**

Display (LCD, or LED)
Sync
**Estimated logistic data:**
Qty
Carton
20'

End result should be an .xlsx file with a drop-down menu for the values in each column like:

|-----------|----------------------|------------------|------------------|-------------------------|-------------------------|
| Article   | Transmitter          | Receiver         | Logistic data:   | BSCI / SA8000           | Features                |
|-----------|----------------------|------------------|------------------|-------------------------|-------------------------|
| headphone | 2.4GHZ               | diameter         | 20'              |                         |                         | 
|           | 5GHZ                 | impedance        | 40'              |                         |                         |
|           | Power LED            | Mic Jack         |                  |                         |                         |                        
|           | - 1 x 3.5mm mic jack |                  |                  |                         |                         |                        
|-----------|----------------------|------------------|------------------|-------------------------|-------------------------|
| keyboard  |                      |                  | Qty              | BSCI / SA8000 certified | Display (LCD, or LED)   |  
|           |                      |                  | Carton           | Certificate validity    | Sync                    | 
|           |                      |                  | 20'              |

score 1 · Answer 1 · answered Aug 12 '20 at 12:14

For detecting the styles, you may use an external package like styleframe. Repeat Step 1 and 2 for each example file.

Read the example file, identify the indices where the style is Bold

from styleframe import StyleFrame
sf = StyleFrame.read_excel('Example-1.xlsx', read_style=True, use_openpyxl_styles=False, headers=None)
indices=[]
for i in range(0, len(sf)):
    for val in sf.iloc[i]:
        if(val.style.bold):
            indices.append(i)

Find the values between the indices.

df=pd.read_excel("Example-1.xlsx", headers=None)
df=df.astype(str)
columns=[]
values=[]
for i in range(0,len(indices)):
    print(i)
    columns.append(df.iloc[indices[i]].values[0])
    if(i+1<len(indices)):
        values.append(list(df.iloc[indices[i]+1:indices[i+1]].values))
    else:
        if(indices[i]+1<len(df)):
            values.append(list(df.iloc[indices[i]+1:].values))
        else:
            values.append([])

values=list(map(lambda z: " ".join([x[0] for x in z]), values))
temp_dict=dict(zip(columns, values))

The following code creates the final dataframe as required -

final_dict=[]
final_dict.append(temp_dict)
final_df=pd.DataFrame.from_dict(final_dict)

Example File must contain an additional Header for lesser ambiguity.

Thanks for the answer but StyleFrame is using openpyxl which does not support the .xls format? — Vega, Aug 12 '20 at 13:43
Is it fine with you to convert 'xls' file to 'xlsx'? If it is fine, you may refer the following link to do so - https://stackoverflow.com/questions/9918646/how-to-convert-xls-to-xlsx — Anant Kumar, Aug 12 '20 at 15:47
Yes, that's fine. But I tried your code with a test file where I made some cells in the first column bold and some non-bold and get the wrong cells (either bold+non bold ones or only some of the non-bold ones) with your code? Seems like StyleFrame can't handle .xlsx styles? — Vega, Aug 13 '20 at 08:59
It should support xlsx styles, can you attach the code you used? — AsafSH, Sep 26 '20 at 11:03

Vega · Accepted Answer · 2020-12-14T16:47:59.120

Only works for .xlsx files

from openpyxl import load_workbook

path = "test.xlsx"
book = load_workbook(path)
sheet = book.worksheets[0] # get first Excel sheet of test.xlsx

for cells in range(1, 201): # check first 200 cells
        cell = sheet.cell(cells, 1) # iterate over cells in Column 1 = A -> A1, A2, A3, ...
        if cell.value != None and cell.font.b == 1: # ignore empty cells and get bold cells
        ... "do stuff with cell" ...

Read Excel files and arrange rows as columns (bold) with row values under bold rows?

2 Answers2