0

Hi, I have such array in my .txt file:

n|vechicle.car.characteristics[0].speed|180
n|vechicle.car.characteristics[0].weight|3
c|vechicle.car.characteristics[0].color|black
c|vechicle.car.characteristics[0].fuel|95
n|vechicle.car.characteristics[1].speed|160
c|vechicle.car.characteristics[1].color|green
c|vechicle.car.characteristics[1].fuel|92
n|vechicle.car.characteristics[2].speed|200
n|vechicle.car.characteristics[2].weight|5
c|vechicle.car.characteristics[2].fuel|95
n|vechicle.car.characteristics[3].speed|180
n|vechicle.car.characteristics[3].weight|3
c|vechicle.car.characteristics[3].color|black
c|vechicle.car.characteristics[3].fuel|95
c|vechicle.car.characteristics[3].cost|30000

And I'd like to parse it into such dataFrame:

  speed weight  color fuel   cost
0   180      3  black   95    NaN
1   160    NaN  green   92    NaN
2   200      5    NaN   95    NaN
3   180      3  black   95  30000

That's, how i solved it:

import re
import pandas as pd

df_output_list  = {}
df_output_dict  = []
match_counter = 1

with open('sample_car.txt',encoding='utf-8') as file:
    line = file.readline()
    while line:

        result = re.split(r'\|',line.rstrip())
        result2 = re.findall(r'.(?<=\[)(\d+)(?=\])',result[1])

        regex = re.compile('vechicle.car.characteristics.')
        match = re.search(regex, result[1])
        if match:

            if match_counter == 1:
                ArrInd = 0
            match_counter+=1
            #print(df_output_list)
            if ArrInd == int(result2[0]):
                df_output_list[result[1].split('.')[3]] = result[2]
                ArrInd = int(result2[0])

            else:
                df_output_dict.append(df_output_list)
                df_output_list  = {}
                df_output_list[result[1].split('.')[3]] = result[2]
                ArrInd = int(result2[0])

        line = file.readline()
    df_output_dict.append(df_output_list)
#print(df_output_dict)
df_output = pd.DataFrame(df_output_dict)
print(df_output)

And i found it so complicated. Is it possible to simplify it?
Column names should be parsed automatically.

2 Answers 2

3

Gong the Pandas route on this one. The data has unique characteristics that allows the cleanup to be a bit easy (specific numbers and pattern, header is the last word after numbers in brackets and a period, actual value lies behind the "|" , which we can use as our delimiter).

data = """
n|vechicle.car.characteristics[0].speed|180
n|vechicle.car.characteristics[0].weight|3
c|vechicle.car.characteristics[0].color|black
c|vechicle.car.characteristics[0].fuel|95
n|vechicle.car.characteristics[1].speed|160
c|vechicle.car.characteristics[1].color|green
c|vechicle.car.characteristics[1].fuel|92
n|vechicle.car.characteristics[2].speed|200
n|vechicle.car.characteristics[2].weight|5
c|vechicle.car.characteristics[2].fuel|95
n|vechicle.car.characteristics[3].speed|180
n|vechicle.car.characteristics[3].weight|3
c|vechicle.car.characteristics[3].color|black
c|vechicle.car.characteristics[3].fuel|95
c|vechicle.car.characteristics[3].cost|30000"""

import pandas as pd
from io import StringIO

res = (pd.read_csv(StringIO(data), sep="|", header = None)
       #extract the numbers from col 1
       .assign(number = lambda x: x[1].str.extract(r"(\d+)"),
               #get the tail of the string in column 1
               headers = lambda x: x[1].str.split(r"\[\d+\]\.").str[-1]
              )
        #set numbers and headers as index 
        #and keep only the last column, which is relevant
       .set_index(['number','headers'])
       .filter([2])
        #unstacking here ensures the headers
       # are directly on top of each related data in column 2
       .unstack()
        #some cleanups
       .droplevel(0,axis=1)
       .rename_axis(None,axis=1)
       .rename_axis(None)
      )

res


   color    cost    fuel    speed   weight
0   black   NaN     95      180       3
1   green   NaN     92      160       NaN
2   NaN     NaN     95      200       5
3   black   30000   95      180       3
Sign up to request clarification or add additional context in comments.

Comments

1

So you have a text file where each line contains 3 interesting fields that can be easily extracted with a regex. I would directly build a dataframe from that:

rx = re.compile(r'.\|vechicle\.car\.characteristics\[(\d+)\]\.(.*)\|(.*?)\s*$')

df = pd.DataFrame([rx.match(line).groups() for line in open('sample_car.txt',
                                                             encoding='utf-8')])
column = df[1].unique()     # store the column names in file order

It is now enough to unstack the dataframe, clean it and reorder the columns:

df = df.set_index([0, 1]).unstack().droplevel(0, axis=1).rename_axis(
    index=None, columns=None).reindex(columns, axis=1)

It gives as expected:

  speed weight  color fuel   cost
0   180      3  black   95    NaN
1   160    NaN  green   92    NaN
2   200      5    NaN   95    NaN
3   180      3  black   95  30000

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.