I have this string vocab file: https://drive.google.com/file/d/1mL461QGC5KcA3M1r8AESaPjZ3D_ufgPA/view?usp=sharing.
I have this sentences file, made from all vocab file above: https://drive.google.com/file/d/1w5ma4ROjyp6xmZfvnIQjsdH2I_K7lHoo/view?usp=sharing.
I want to map every sentences into its corresponding integer in the vocab file.
What I have tried to di is, firsti, I put all sentence into a list to this DataFrame:
import pandas as pd
f = open(f'./drive/MyDrive/[kepsdataset/train_preprocess.txt', "r")
output = []
dicts = {}
tokens = []
tags = []
for line in f:
if len(line.strip()) != 0:
fields = line.split('\t')
text = fields[0].lower()
tag = fields[1].strip()
tokens.append(text)
tags.append(tag)
else:
dicts['token'] = tokens # this is the sentences I want to map into integer
dicts['tag'] = tags
output.append(dicts)
dicts = {}
tokens = []
tags = []
df = pd.DataFrame(output)
df.head(10)
I have converted the vocabulary list (from vocab file) into list of integer
import numpy as np
my_file = open("vocab_uncased.txt", "r")
data = my_file.read()
data_into_list = data.split("\n")
print(data_into_list)
encoded_string = [np.where(np.array(list(dict.fromkeys(data_into_list)))==e)[0][0]for e in data_into_list]
print(encoded_string)
What I want to do is to put the encoded string into the DataFrame above. How can I do it? Example:
sentence (in token field in DataFrame): ['Setelah', 'melalui', 'proses', 'telepon', 'yang', 'panjang', 'tutup', 'sudah', 'kartu', 'kredit', 'bca', 'ribet']
encoded sentence (using vocab file): [2024, 1317, 1806, 2182, 2400, 1624, 2333, 2107, 1013, 1155, 317, 1853] --> to be put into a new dataframe column