0

I have a dataframe like this

group           b             c           d           e        label
A           0.577535    0.299304    0.617103    0.378887       1
            0.167907    0.244972    0.615077    0.311497       0
B           0.640575    0.768187    0.652760    0.822311       0
            0.424744    0.958405    0.659617    0.998765       1
            0.077048    0.407182    0.758903    0.273737       0

I want to reshape it into a 3D array which an LSTM could use as input, using padding. So group A should feed in a sequence of length 3 (after padding) and group B of length 3. Desired output something like

array1 = [[[0.577535, 0.299304, 0.617103, 0.378887],
          [0.167907, 0.244972, 0.615077, 0.311497],
          [0, 0, 0, 0]],
         [[0.640575, 0.768187, 0.652760, 0.822311],
          [0.424744, 0.958405, 0.659617, 0.998765],
          [0.077048, 0.407182, 0.758903, 0.273737]]]

and then the labels have to be reshaped accordingly too

array2 = [[1,
           0,
           0],
          [0,
           1,
           0]]

How can I put in the padding and reshape my data?

1
  • Would you make your dataframe itself reproducible? ie what code should we run to have that dataframe. If yes, I think I'll be able to help. Commented Aug 23, 2020 at 19:58

2 Answers 2

1

You can first use cumcount to create a count for each group, reindex by MultiIndex.from_product and fill with 0, and finally export to list:

df["count"] = df.groupby("group")["label"].cumcount()
mux = pd.MultiIndex.from_product([df["group"].unique(), range(max(df["count"]+1))], names=["group","count"])

df = df.set_index(["group","count"]).reindex(mux, fill_value=0)

print (df.iloc[:,:4].groupby(level=0).apply(pd.Series.tolist).values.tolist())

[[[0.577535, 0.299304, 0.617103, 0.378887],
  [0.167907, 0.24497199999999997, 0.6150770000000001, 0.31149699999999997],
  [0.0, 0.0, 0.0, 0.0]],
 [[0.640575, 0.768187, 0.65276, 0.822311],
  [0.42474399999999995, 0.958405, 0.659617, 0.998765],
  [0.077048, 0.40718200000000004, 0.758903, 0.273737]]]

print (df.groupby(level=0)["label"].apply(list).tolist())

[[1, 0, 0], [0, 1, 0]]
Sign up to request clarification or add additional context in comments.

1 Comment

thanks. I get an error on df.iloc[:,:4].groupby(level=0).apply(pd.Series.tolist).values.tolist(), saying 'DataFrame' object has no attribute 'dtype'. I must admit I replaced df.iloc[:,:4] with df..iloc[:,:-1] for sake of generality, but can't see how that should make a difference
0

I'm assuming your group column consists of many values and not just 1 'A' and 1 'B'. This code worked for me, you can give it a try as well:

import pandas as pd

df = pd.read_csv('file2.csv')
vals = df['group'].unique()

array1 = []
array2 = []

for val in vals:
    
    val_df = df[df.group == val]
    val_label = val_df.label
    smaller_array = []
    
    label_small_array = []
    
    for label in val_label:
        label_small_array.append(label)
        
    array2.append(label_small_array)
    
    for i in range(val_df.shape[0]):
        smallest_array = []
        
        for j in val_df.columns:
            smallest_array.append(j)
        
        smaller_array.append(smallest_array)
    
    array1.append(smaller_array)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.