I have hundreds of .xls and .xlsx files with some sort of order but quite messy.
- There are sometimes empty cells inbetween, those are irrelevant and should be ignored.
- If there is a bold cell followed by a bold cell the first bold cell can be ignored.
- If there is a bold cell followed by one or more empty cells and then a new bold cell the first bold cell + the empty cell can be ignored.
- If there is a cell with "Logistic Data" or "Estimated logistic data:" or some text with lower("logistic data") any bold cell under it should be ignored.
Can I achieve the end result without going crazy by using Python/Pandas?
Example 1 (Filename: "example1 - headphone.xls"): ** = bold, doesn't seem to work in code blocks
**Features**
**Transmitter**
2.4GHZ
5GHZ
Power LED
- 1 x 3.5mm mic jack
**Receiver**
diameter
impedance
Mic Jack
**Logistic data:**
**Gift Box Measurement**
20'
40'
Example 2 ("example2 - keyboard.xls"):
**BSCI / SA8000**
BSCI / SA8000 certified
Certificate validity
**Upcost**
**Features**
Display (LCD, or LED)
Sync
**Estimated logistic data:**
Qty
Carton
20'
End result should be an .xlsx file with a drop-down menu for the values in each column like:
|-----------|----------------------|------------------|------------------|-------------------------|-------------------------|
| Article | Transmitter | Receiver | Logistic data: | BSCI / SA8000 | Features |
|-----------|----------------------|------------------|------------------|-------------------------|-------------------------|
| headphone | 2.4GHZ | diameter | 20' | | |
| | 5GHZ | impedance | 40' | | |
| | Power LED | Mic Jack | | | |
| | - 1 x 3.5mm mic jack | | | | |
|-----------|----------------------|------------------|------------------|-------------------------|-------------------------|
| keyboard | | | Qty | BSCI / SA8000 certified | Display (LCD, or LED) |
| | | | Carton | Certificate validity | Sync |
| | | | 20' |