Convert Array Lists to dataframe

Question

Hello I have a dataset that looks like this:

array([['1;"Female";133;132;124;"118";"64.5";816932'],
       ['2;"Male";140;150;124;".";"72.5";1001121'],
       ['3;"Male";139;123;150;"143";"73.3";1038437'],
       ['4;"Male";133;129;128;"172";"68.8";965353'],
       ['5;"Female";137;132;134;"147";"65.0";951545'],
       ['6;"Female";99;90;110;"146";"69.0";928799'],
       ['7;"Female";138;136;131;"138";"64.5";991305']], dtype=object)

I would like to convert is to a dataframe with this columns

columns = ["Gender";"FSIQ";"VIQ";"PIQ";"Weight";"Height";"MRI_Count"]

NB: From the array list the separator of rows values is a semicolon (;).Help me organize it to a dataframe with column names and row values from array

What have you tried so far? Please post your code.

James
– James

2020-02-26 09:25:31 +00:00
Commented Feb 26, 2020 at 9:25 — James
– James, Commented Feb 26, 2020 at 9:25

jezrael · Accepted Answer · 2020-02-26 09:31:00Z

Create DataFrame and Series.str.split with expand=True for new columns:

a = np.array([['1;"Female";133;132;124;"118";"64.5";816932'],
       ['2;"Male";140;150;124;".";"72.5";1001121'],
       ['3;"Male";139;123;150;"143";"73.3";1038437'],
       ['4;"Male";133;129;128;"172";"68.8";965353'],
       ['5;"Female";137;132;134;"147";"65.0";951545'],
       ['6;"Female";99;90;110;"146";"69.0";928799'],
       ['7;"Female";138;136;131;"138";"64.5";991305']], dtype=object)

df = pd.DataFrame(a)[0].str.split(';', expand=True)
df.columns = ['ID',"Gender","FSIQ","VIQ","PIQ","Weight","Height","MRI_Count"]

Last some data cleaning - removed traling "" by Series.str.strip and convert columns to numeric by to_numeric with DataFrame.apply:

df['Gender'] = df['Gender'].str.strip('"')
c = ["ID", "FSIQ","VIQ","PIQ","Weight","Height","MRI_Count"]
df[c] = df[c].apply(lambda x: pd.to_numeric(x.str.strip('"'), errors='coerce'))
print (df)
  ID  Gender  FSIQ  VIQ  PIQ  Weight  Height  MRI_Count
0  1  Female   133  132  124   118.0    64.5     816932
1  2    Male   140  150  124     NaN    72.5    1001121
2  3    Male   139  123  150   143.0    73.3    1038437
3  4    Male   133  129  128   172.0    68.8     965353
4  5  Female   137  132  134   147.0    65.0     951545
5  6  Female    99   90  110   146.0    69.0     928799
6  7  Female   138  136  131   138.0    64.5     991305

Chris Adams · Accepted Answer · 2020-02-26 09:46:41Z

Another potential solution would be to use io.StringIO and pandas.read_csv. Just join each element in the array with a \n character:

from io import StringIO

# Setup
a = np.array([['1;"Female";133;132;124;"118";"64.5";816932'],
       ['2;"Male";140;150;124;".";"72.5";1001121'],
       ['3;"Male";139;123;150;"143";"73.3";1038437'],
       ['4;"Male";133;129;128;"172";"68.8";965353'],
       ['5;"Female";137;132;134;"147";"65.0";951545'],
       ['6;"Female";99;90;110;"146";"69.0";928799'],
       ['7;"Female";138;136;131;"138";"64.5";991305']])

columns = ["Gender", "FSIQ", "VIQ", "PIQ", "Weight", "Height", "MRI_Count"]

df = pd.read_csv(StringIO('\n'.join(a.ravel())), header=None,
                 sep=';', names=columns, na_values=['.'])

[out]

   Gender  FSIQ  VIQ  PIQ  Weight  Height  MRI_Count
1  Female   133  132  124   118.0    64.5     816932
2    Male   140  150  124     NaN    72.5    1001121
3    Male   139  123  150   143.0    73.3    1038437
4    Male   133  129  128   172.0    68.8     965353
5  Female   137  132  134   147.0    65.0     951545
6  Female    99   90  110   146.0    69.0     928799
7  Female   138  136  131   138.0    64.5     991305

pandas should do a pretty good job of interpreting dtypes

print(df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7 entries, 1 to 7
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Gender     7 non-null      object 
 1   FSIQ       7 non-null      int64  
 2   VIQ        7 non-null      int64  
 3   PIQ        7 non-null      int64  
 4   Weight     6 non-null      float64
 5   Height     7 non-null      float64
 6   MRI_Count  7 non-null      int64  
dtypes: float64(2), int64(4), object(1)
memory usage: 448.0+ bytes

Collectives™ on Stack Overflow

Convert Array Lists to dataframe

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related