When a Pandas Data Frame is Subset, does it need to be reindex?

Question

I am making a multi-plot using a Pandas dataframe and matplotlib. However, when I alter the dataframe and remove one of the items I get an error:

ValueError: cannot reindex from a duplicate axis

My initial code is the following and it plots great, but I have an extra group (plot) that I don't need:

branchGroups = allData['BranchGroupings'].unique()

fig2 = plt.figure(figsize = (15,15))

for i,branchGroups in enumerate(branchGroups):
    ax = plt.subplot(3,3,i+1)
    idx = allData['BranchGroupings'] == branchGroups

    kmf.fit(T[idx], C[idx], label=branchGroups)


    kmf.plot(ax=ax, legend=False)
    plt.title(branchGroups)
    plt.xlabel('Timeline in Months')
    plt.xlim(0,150)


fig2.tight_layout()
fig2.suptitle('Cumulative Hazard Function of Employee Groups', size = 16)
fig2.subplots_adjust(top=0.88, hspace = .4)
plt.show()

In branchGroups there are 7 items when I print them out:

['BranchMgr', 'Banker', 'Service', 'MDOandRSM', 'SBRMandBBRM','FC', 'DE']

The code above makes all seven plots nicely, but I don't need the 'DE' grouping (one plot for each of the groups).

So, I did a drop of the DE by performing the following:

#remove the DE from the data set
noDE = allData[allData.BranchGroupings != 'DE']

This drops the 'DE' from the categories and reduces the number of rows. I do a head(), and it looks great; a new data frame.

Then, modifying the plot to give the 6 groups and plot the reduced data frame noDE, I used the same code with some name changes like fig3 rather than fig2 and changed idx to idxx to prevent overwriting, otherwise it's the same except the new data frame reference noDE:

Groups = noDE['BranchGroupings'].unique()  #new data frame noDE

fig3 = plt.figure(figsize = (15,15))

for i,Groups in enumerate(Groups):
ax = plt.subplot(3,2,i+1)
idxx = noDE['BranchGroupings'] == Groups   #new idxx rather than idx

kmf.fit(T[idxx], C[idxx], label=Groups)


kmf.plot(ax=ax, legend=False)
plt.title(Groups)
plt.xlabel('Timeline in Months')
plt.xlim(0,150)
if  i ==0:
    plt.ylabel('Frac Employed After $n$ Months')
if  i ==3:
    plt.ylabel('Frac Employed After $n$ Months')

fig3.tight_layout()
fig3.suptitle('Survivability of Branch Employees', size = 16)
fig3.subplots_adjust(top=0.88, hspace = .4)
plt.show()

Except, I get the error mentioned above

cannot reindex from a duplicate axis

and the traceback shows that it is associated with the line below:

kmf.fit(T[idxx], C[idxx], label=Groups)

Most likely due to the re-assignment line above it:

idxx = noDE['BranchGroupings'] == Groups

Do I need to reset/drop or do something to the new data frame noDE to reset this?

Update - this has been solved; I am not sure how 'pythonic' it is, but it works:

Okay, after more research on this, it seems that when slicing a dataframe, there is an inheritance issue. I found out from another post here.

Initially, performing the following:

noDE.index.is_unique returns False

To make the clean slice the following steps are needed:

#create the slice using the .copy
noDE = allData[['ProdCat', 'Duration', 'Observed', 'BranchGroupings']].copy()

#remove the DE from the data set
noDE = noDE.loc[noDE['BranchGroupings'] != 'DE'] #use .loc for cleaner slice

#reset the index so that it is unique
noDE['index'] = np.arange(len(noDE))
noDE = noDE.set_index('index')

Now performing the noDE.index.is_unique returns True and the error is gone.

When a Pandas Data Frame is Subset, does it need to be reindex?

0 Answers0