How to parse array from .txt file to python DataFrame

Question

Hi, I have such array in my .txt file:

n|vechicle.car.characteristics[0].speed|180
n|vechicle.car.characteristics[0].weight|3
c|vechicle.car.characteristics[0].color|black
c|vechicle.car.characteristics[0].fuel|95
n|vechicle.car.characteristics[1].speed|160
c|vechicle.car.characteristics[1].color|green
c|vechicle.car.characteristics[1].fuel|92
n|vechicle.car.characteristics[2].speed|200
n|vechicle.car.characteristics[2].weight|5
c|vechicle.car.characteristics[2].fuel|95
n|vechicle.car.characteristics[3].speed|180
n|vechicle.car.characteristics[3].weight|3
c|vechicle.car.characteristics[3].color|black
c|vechicle.car.characteristics[3].fuel|95
c|vechicle.car.characteristics[3].cost|30000

And I'd like to parse it into such dataFrame:

  speed weight  color fuel   cost
0   180      3  black   95    NaN
1   160    NaN  green   92    NaN
2   200      5    NaN   95    NaN
3   180      3  black   95  30000

That's, how i solved it:

import re
import pandas as pd

df_output_list  = {}
df_output_dict  = []
match_counter = 1

with open('sample_car.txt',encoding='utf-8') as file:
    line = file.readline()
    while line:

        result = re.split(r'\|',line.rstrip())
        result2 = re.findall(r'.(?<=\[)(\d+)(?=\])',result[1])

        regex = re.compile('vechicle.car.characteristics.')
        match = re.search(regex, result[1])
        if match:

            if match_counter == 1:
                ArrInd = 0
            match_counter+=1
            #print(df_output_list)
            if ArrInd == int(result2[0]):
                df_output_list[result[1].split('.')[3]] = result[2]
                ArrInd = int(result2[0])

            else:
                df_output_dict.append(df_output_list)
                df_output_list  = {}
                df_output_list[result[1].split('.')[3]] = result[2]
                ArrInd = int(result2[0])

        line = file.readline()
    df_output_dict.append(df_output_list)
#print(df_output_dict)
df_output = pd.DataFrame(df_output_dict)
print(df_output)

And i found it so complicated. Is it possible to simplify it?
Column names should be parsed automatically.

sammywemmy · Accepted Answer · 2020-05-26 12:46:04Z

Gong the Pandas route on this one. The data has unique characteristics that allows the cleanup to be a bit easy (specific numbers and pattern, header is the last word after numbers in brackets and a period, actual value lies behind the "|" , which we can use as our delimiter).

data = """
n|vechicle.car.characteristics[0].speed|180
n|vechicle.car.characteristics[0].weight|3
c|vechicle.car.characteristics[0].color|black
c|vechicle.car.characteristics[0].fuel|95
n|vechicle.car.characteristics[1].speed|160
c|vechicle.car.characteristics[1].color|green
c|vechicle.car.characteristics[1].fuel|92
n|vechicle.car.characteristics[2].speed|200
n|vechicle.car.characteristics[2].weight|5
c|vechicle.car.characteristics[2].fuel|95
n|vechicle.car.characteristics[3].speed|180
n|vechicle.car.characteristics[3].weight|3
c|vechicle.car.characteristics[3].color|black
c|vechicle.car.characteristics[3].fuel|95
c|vechicle.car.characteristics[3].cost|30000"""

import pandas as pd
from io import StringIO

res = (pd.read_csv(StringIO(data), sep="|", header = None)
       #extract the numbers from col 1
       .assign(number = lambda x: x[1].str.extract(r"(\d+)"),
               #get the tail of the string in column 1
               headers = lambda x: x[1].str.split(r"\[\d+\]\.").str[-1]
              )
        #set numbers and headers as index 
        #and keep only the last column, which is relevant
       .set_index(['number','headers'])
       .filter([2])
        #unstacking here ensures the headers
       # are directly on top of each related data in column 2
       .unstack()
        #some cleanups
       .droplevel(0,axis=1)
       .rename_axis(None,axis=1)
       .rename_axis(None)
      )

res


   color    cost    fuel    speed   weight
0   black   NaN     95      180       3
1   green   NaN     92      160       NaN
2   NaN     NaN     95      200       5
3   black   30000   95      180       3

Serge Ballesta · Accepted Answer · 2020-05-26 13:15:18Z

So you have a text file where each line contains 3 interesting fields that can be easily extracted with a regex. I would directly build a dataframe from that:

rx = re.compile(r'.\|vechicle\.car\.characteristics\[(\d+)\]\.(.*)\|(.*?)\s*$')

df = pd.DataFrame([rx.match(line).groups() for line in open('sample_car.txt',
                                                             encoding='utf-8')])
column = df[1].unique()     # store the column names in file order

It is now enough to unstack the dataframe, clean it and reorder the columns:

df = df.set_index([0, 1]).unstack().droplevel(0, axis=1).rename_axis(
    index=None, columns=None).reindex(columns, axis=1)

It gives as expected:

  speed weight  color fuel   cost
0   180      3  black   95    NaN
1   160    NaN  green   92    NaN
2   200      5    NaN   95    NaN
3   180      3  black   95  30000

Collectives™ on Stack Overflow

How to parse array from .txt file to python DataFrame

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related