0

I am trying parse this portion of XML and would like it to run in iterative mode on it's own by determining how many times it needs to run.

Also, the line items may or may not have all the values for each column and i am trying to fill these blanks with None, if particular tag/text doesnt exist for any of those, so to map it to right column later in csv conversion

My XML to be parsed (highlighted in bold, invoice line items):

<Invoice>
    <DocumentSource>Supplier</DocumentSource>
    <DocumentType>Invoice</DocumentType>
    .
    .
    .

    <InvoiceLineItems>
        <LineItem>
            <InvoiceLineNum>1</InvoiceLineNum>
            <POLineNum>1</POLineNum>
            <Quantity>2</Quantity>
            <UOM>EA</UOM>
            <UnitPrice>50.00</UnitPrice>
            <LineAmount>100.00</LineAmount>
            <SalesTaxPercent>9.75</SalesTaxPercent>
            <SupplierPartNum />
            <ShortDescription>Marley &amp; Me</ShortDescription>
            <LongDescription>Marley &amp; Me</LongDescription>
            <DeliveryChargeCode/>
        </LineItem>
        <LineItem>
            <InvoiceLineNum>2</InvoiceLineNum>
            <LineAmount>-10.00</LineAmount>
        </LineItem>
    </InvoiceLineItems>
</Invoice> 

Output I am expecting need to look something like ...

What I have right now is pretty basic and looks like below:

# Counting Line Items under an Invoice Line Items
for inv_line_items in root.findall('InvoiceLineItems'):
    countX = sum([1 for entry in inv_line_items.getiterator('LineItem')])
    print(countX)

invoice_ln1 = []
invoice_ln2 = []

for i in range(0, countX):
    for z in root[18][i]:
        if i == 0:
            #invoice_hdr0.append(z.text)
            if z.tag == 'InvoiceLineNum':
                invoice_ln1.append(z.text)
            
            if z.tag == 'POLineNum':
                invoice_ln1.append(z.text)
                
            if z.tag == 'Quantity':
                invoice_ln1.append(z.text)
                
            if z.tag == 'UOM':
                invoice_ln1.append(z.text)
                
            if z.tag == 'Unit_Price':
                invoice_ln1.append(z.text)
                
            if z.tag == 'LineAmount':
                invoice_ln1.append(z.text)
                
            if z.tag == 'SalesTaxPercent':
                invoice_ln1.append(z.text)
                
            if z.tag == 'SupplierPartNum':
                invoice_ln1.append(z.text)
                
            if z.tag == 'ShortDescription':
                invoice_ln1.append(z.text)
                
            if z.tag == 'LongDescription':
                invoice_ln1.append(z.text)
                
            if z.tag == 'DeliveryChargeCode':
                invoice_ln1.append(z.text)
                
            print(invoice_ln1)

        else:
            #invoice_hdr1.append(z.text)
            if z.tag == 'InvoiceLineNum':
                invoice_ln2.append(z.text)

            if z.tag == 'POLineNum':
                invoice_ln2.append(z.text)

            if z.tag == 'Quantity':
                invoice_ln2.append(z.text)

            if z.tag == 'UOM':
                invoice_ln2.append(z.text)

            if z.tag == 'Unit_Price':
                invoice_ln2.append(z.text)

            if z.tag == 'LineAmount':
                invoice_ln2.append(z.text)

            if z.tag == 'SalesTaxPercent':
                invoice_ln2.append(z.text)

            if z.tag == 'SupplierPartNum':
                invoice_ln2.append(z.text)

            if z.tag == 'ShortDescription':
                invoice_ln2.append(z.text)
                
            if z.tag == 'LongDescription':
                invoice_ln2.append(z.text)

            if z.tag == 'DeliveryChargeCode':
                invoice_ln2.append(z.text)

            print(invoice_ln2)

1 Answers1

0

It's a little late, but let's try this:

items = """[your xml above]"""

import lxml.html
import pandas as pd

categories = ["invoicelinenum", "polinenum","quantity","uom","unitprice","lineamount","salestaxpercent","supplierpartnum","shortdescription",
"longdescription","deliverychargecode]"]

columns = ['ILI Line Num','ILI PO Line',
          'ILI QTY', 'ILI UOM','ILI Unit Price','ILI Line Amt','ILI Sales Tax %',
           'ILI Supply','ShortDesc','LongDesc','ChargeCode']

doc = lxml.html.fromstring(items)
invoices = doc.xpath('//InvoiceLineItems/LineItem'.lower())

def dict_to_list(d, keys):
    return [d.get(key, None) for key in keys]
#credit: https://stackoverflow.com/a/58192327/9448090

all_inv = []
fin_dicts=[]
fin_list = []

for invoice in invoices:    
    items = []
    for item in invoice:
        item_dict = {}
        item_dict[item.tag]= item.text
        items.append(item_dict)
    all_inv.append(items)

for inv in all_inv:
    temp_dict={}
    for d in inv:
        temp_dict.update(d)
    fin_dicts.append(temp_dict)

for dict in fin_dicts:
    fin_list.append(dict_to_list(dict, categories))

df = pd.DataFrame(fin_list,columns=columns)
df

And this should give you the table you were looking for.

Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45