How to normalize complex JSON structure with 4 levels of nested arrays?

Question

As I'm fairly new to python I've tried various ways based on answers found here but can't manage to normalize my json file. As I checked in Postman it has 4 levels of nested arrays. For suppliers I want to expand all the levels of data.The problem I have is with the score_card subtree for which I want to pull out all risk_groups , then name,risk_score and all riskstogether with name,risk_score and indicators with the deepest level containing again name and risk_score. and I'm not sure whether it is possible in one go based on ambiguous names between levels. Below I'm sharing a reproducible example:

data = {"suppliers": [
            {
                "id": 1260391,
                "name": "2712270 MANITOBA LTD",
                "duns": "rm-071-7291",
                "erp_number": "15189067;15189067",
                "material_group_ids": [
            176069
                ],
                "business_unit_ids": [
            13728
        ],
                "responsible_user_ids": [
            37761,
            37760,
            37759,
            37758,
            37757,
            37756,
            36520,
            37587,
            36494,
            22060,
            37742,
            36446,
            36289
        ],
                "address": {
                "address1": "BOX 550 NE 16-20-26",
                "address2": None,
                "zip_code": "R0J 1W0",
                "city": "RUSSELL",
                "state": None,
                "country_code": "ca",
                "latitude": 50.7731176,
                "longitude": -101.2862461
        },
                "score_card": {
                "risk_score": 26.13,
                "risk_groups": [
                    {
                    "name": "Viability",
                    "risk_score": 43.33,
                    "risks": [
                        {
                            "name": "Financial stability supplier",
                            "risk_score": None,
                            "indicators": [
                                {
                                    "name": "Acquisitions",
                                    "risk_score": None
                                }
                            ]
                        }
                    ]
                },
            ]
        }
    }
 ]
       }

And here is how it should look:

expected = [[1260391,'2712270 MANITOBA LTD','rm-071-7291',176069,13728,[37761,37760,        
37759,37758,37757,37756,36520,37587,36494,22060,37742,36446,36289],
'BOX 550 NE 16-20-26','None','R0J 1W0','RUSSELL','None','ca',50.7731176,
-101.2862461,26.13,'Viability',43.33,'Financial stability supplier','None',
'Acquisitions','None']]

df = pd.DataFrame(expected,columns=['id','name','duns','material_groups_ids',
'business_unit_ids','responsible_user_ids',
'address.address1','address.address2','address.zip_code',
'address.city','address.state','address.country_code','address.latitude',
'address.longitude','risk_score','risk_groups.name',
'risk_groups.risk_score','risk_groups.risks.name',
'risk_groups.risks.risk_score','risk_groups.risks.indicators.name',
'risk_groups.risks.indicators.risk_score'])

What I tried since last 3 days is to use json_normalize:

example = pd.json_normalize(data['suppliers'],
record_path=['score_card','risk_groups','risks',
'indicators'],meta = ['id','name','duns','erp_number','material_group_ids',
'business_unit_ids','responsible_user_ids','address',['score_card','risk_score']],record_prefix='_')

but when I specify the fields for record_path which seems to be a vital param here it always return this prompt:

TypeError: {'name': 'Acquisitions', 'risk_score': None} has non list value Acquisitions for path name. Must be list or null.

What I want to get a a table with columns as presented below:

So it seems that python doesn't know how to treat most granular level which contains null values.

I've tried approach from here: Use json_normalize to normalize json with nested arrays but unsuccessfully. I would like to get the same view which I was able to 'unpack' in Power BI but this way of having the data is not satisfying me. All risk indicators are the components of risks which are part of risk groups as You can see on attached pictures. Going to the most granular level it goes this way: Risk groups -> Risks -> Indicators so I would like to unpack it all.

Any help more than desperately needed. Kindly thank You for any support here.

Try adding `record_prefix='_'` to the parameters of `json_normalize` — sagi, Feb 02 '23 at 11:53
Thanks @sagi, at least the code did not broke however it did not return any nested fields, right now I've added `name` from the deepest level to pull out and it gave me an error saying: " TypeError: {'name': 'Acquisitions', 'risk_score': None} has non list value Acquisitions for path name. Must be list or null." So should I do kind of transformation ? — Rafał Pietrak, Feb 02 '23 at 12:02
It should still return nested fields but their names won't be the same. Try debugging the result and see which fields it holds. — sagi, Feb 02 '23 at 12:03
Your json indenting is all over the place, so I can't tell what belongs where without scrolling all over the place. Please fix the indentation. Also, [mre] suggest ***minimal*** examples. Do you really need such a long json object to demonstrate this? Same comment for your expect results. Please do not just dump long streams of data on a single line. — MatBailie, Feb 07 '23 at 09:22
Thanks @MatBailie for You comment, I've corrected the indentation. Regarding minimal reproducible example I've provided the minimal portion of that API content in order to recreate my issue, as this is related with the deepest level of that API example so I couldn't limit it more. — Rafał Pietrak, Feb 07 '23 at 10:53
Why did you add `'name'` to `record_path` param? `score_card.risk_groups.risks.indicators` is a list of records; `score_card.risk_groups.risks.indicators.name` isn't. But there are bugs ([pandas-dev/pandas#34465](https://github.com/pandas-dev/pandas/issues/34465), [pandas-dev/pandas#37782](https://github.com/pandas-dev/pandas/issues/37782)) that are blockers for extracting `address.*` and `responsible_user_ids`. — aaron, Feb 11 '23 at 00:53
Thanks @aaron, really appreciate that. I've corrected the code. But can I try any workaround to extract non list elements, especially address details ? — Rafał Pietrak, Feb 14 '23 at 14:39

How to normalize complex JSON structure with 4 levels of nested arrays?

0 Answers0