Improvement in pandas dataframe conversion in Python

Question

I have a pandas dataframe in the following form:

            id2_cond1  id2_cond2  id2_cond3  id2_cond4
id2_cond1   1.000000   0.819689  -0.753702  -0.617213
id2_cond2   0.819689   1.000000  -0.554437  -0.295122
id2_cond3  -0.753702  -0.554437   1.000000   0.939336
id2_cond4  -0.617213  -0.295122   0.939336   1.000000

What I want to do is to convert the dataframe into the following form:

      cond1_cond2 cond1_cond3 cond1_cond4 cond2_cond3 cond2_cond4 cond3_cond4
id2    0.8196886  -0.7537023  -0.6172134   -0.554437  -0.2951216   0.9393364

I can do this properly using the following script:

df_tmp = pd.DataFrame(index=[identifier], columns=cols)
counter = 0
for x in range(len(df)):
    for y in range(x + 1, len(df)):
        df_tmp.ix[0, counter] = df.ix[x, y]
        counter += 1
print(df_tmp)

The problem with this approach is that I have to predefine the columns and I have to know the order.

cols = ["cond1_cond2", "cond1_cond3", "cond1_cond4", "cond2_cond3", "cond2_cond4", "cond3_cond4"]

Is there a better way of converting this dataframe, that creates automatically the different combinations?

From where do you get the original dataframe? It looks like a product of two original dataframes. I feel like while this is a trivial problem to solve, but I think you may be trying to solve a problem in a more complicated way than needed. — firelynx
– firelynx, Commented Jun 11, 2015 at 13:21
Initially I have a tuple in the following form: (('id2_cond1', [0, 1, 2, 3, 4, 5]), ('id2_cond2', [3, 1, 3, 3, 4, 5]), ('id2_cond3', [9, 1, 2, 3, 0, 0]), ('id2_cond4', [12, 1, 3, 3, 1, 1])). The I convert it to a dict, and then to a dataframe in order to calculate the correlation coeefficient: df=pd.DataFrame(dict(f)).corr(method='spearman') — fgypas
– fgypas, Commented Jun 11, 2015 at 13:33
Could maybe this question be related to what you want? stackoverflow.com/questions/24002820/… — firelynx
– firelynx, Commented Jun 11, 2015 at 13:53

Alexander · Accepted Answer · 2015-06-11 13:43:51Z

Original DataFrame:

df = pd.DataFrame({'id2_cond1': {'id2_cond1': 1.0, 'id2_cond2': 0.81968899999999989, 'id2_cond3': -0.75370200000000009, 'id2_cond4': -0.61721300000000001},
                   'id2_cond2': {'id2_cond1': 0.81968899999999989, 'id2_cond2': 1.0, 'id2_cond3': -0.55443699999999996, 'id2_cond4': -0.295122},
                   'id2_cond3': {'id2_cond1': -0.75370200000000009, 'id2_cond2': -0.55443699999999996, 'id2_cond3': 1.0, 'id2_cond4': 0.93933600000000006},
                   'id2_cond4': {'id2_cond1': -0.61721300000000001, 'id2_cond2': -0.295122, 'id2_cond3': 0.93933600000000006, 'id2_cond4': 1.0}})

First, let's strip out the name ('id2' in this example):

name = df.index[0].split("_")[0]

Then, let's get the name of each attribute. I've assumed that the name can also include an underscore character (which isn't present in this example), so I've first split based on the underscore, took all elements barring the first, and then joined them back together using an underscore:

conds = ["_".join(i.split("_")[1:]) for i in df.index]

Now, let's use list comprehension to generate all of the name combinations:

idx = ['{0}_{1}'.format(conds[i], conds[j]) 
        for i in range(len(conds)) 
        for j in range(i + 1, len(conds))]

We'll use the same technique to flatten the data:

data = [df.iat[i, j] 
        for i in range(len(conds)) 
        for j in range(i + 1, len(conds))]

Finally, we'll create a Series from the above information:

corr_matrix_flat = pd.Series(data, index=idx, name=name)
>>> corr_matrix 
cond1_cond2    0.819689
cond1_cond3   -0.753702
cond1_cond4   -0.617213
cond2_cond3   -0.554437
cond2_cond4   -0.295122
cond3_cond4    0.939336
Name: id2, dtype: float64

dct · Accepted Answer · 2015-06-13 17:10:49Z

Here is another version using pandas built-in function stack.

import pandas as pd

df = pd.DataFrame({'id2_cond1': {'id2_cond1': 1.0, 'id2_cond2': 0.81968899999999989, 'id2_cond3': -0.75370200000000009, 'id2_cond4': -0.61721300000000001},
                   'id2_cond2': {'id2_cond1': 0.81968899999999989, 'id2_cond2': 1.0, 'id2_cond3': -0.55443699999999996, 'id2_cond4': -0.295122},
                   'id2_cond3': {'id2_cond1': -0.75370200000000009, 'id2_cond2': -0.55443699999999996, 'id2_cond3': 1.0, 'id2_cond4': 0.93933600000000006},
                   'id2_cond4': {'id2_cond1': -0.61721300000000001, 'id2_cond2': -0.295122, 'id2_cond3': 0.93933600000000006, 'id2_cond4': 1.0}})

Convert df to Series by df.stack()

s = df.stack()
print s

Output

id2_cond1  id2_cond1    1.000000
           id2_cond2    0.819689
           id2_cond3   -0.753702
           id2_cond4   -0.617213
id2_cond2  id2_cond1    0.819689
           id2_cond2    1.000000
           id2_cond3   -0.554437
           id2_cond4   -0.295122
id2_cond3  id2_cond1   -0.753702
           id2_cond2   -0.554437
           id2_cond3    1.000000
           id2_cond4    0.939336
id2_cond4  id2_cond1   -0.617213
           id2_cond2   -0.295122
           id2_cond3    0.939336
           id2_cond4    1.000000
dtype: float64

Next delete diagonal and lower triangle parts.

    ind_upper = []
    for i in range(len(df)):
        for j in range(len(df)):
...         if i < j:
...             ind_upper.append(True)
...         else:
...             ind_upper.append(False)

s = s[ind_upper]

Next combine index and columns into one.

index = list(s.index)
print index
[('id2_cond1', 'id2_cond2'), ('id2_cond1', 'id2_cond3'), ('id2_cond1', 'id2_cond4'), ('id2_cond2', 'id2_cond3'), ('id2_cond2', 'id2_cond4'), ('id2_cond3', 'id2_cond4')]

index = ['_'.join(id) for id in index]
index = [id.replace('id2_', '') for id in index]
print index
['cond1_cond2', 'cond1_cond3', 'cond1_cond4', 'cond2_cond3', 'cond2_cond4', 'cond3_cond4']

Assign index to s

s.index = index
print s
cond1_cond2    0.819689
cond1_cond3   -0.753702
cond1_cond4   -0.617213
cond2_cond3   -0.554437
cond2_cond4   -0.295122
cond3_cond4    0.939336
dtype: float64

One problem with this solution is that it reports more combination than it should. For example it contains both the cond1_cond_2 and cond2_cond1
Problem solved. Removed diagonal and lower triangle parts of df.

Collectives™ on Stack Overflow

Improvement in pandas dataframe conversion in Python

2 Answers 2

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related