0

I have this string vocab file: https://drive.google.com/file/d/1mL461QGC5KcA3M1r8AESaPjZ3D_ufgPA/view?usp=sharing.

I have this sentences file, made from all vocab file above: https://drive.google.com/file/d/1w5ma4ROjyp6xmZfvnIQjsdH2I_K7lHoo/view?usp=sharing.

I want to map every sentences into its corresponding integer in the vocab file.

What I have tried to di is, firsti, I put all sentence into a list to this DataFrame:

import pandas as pd

f = open(f'./drive/MyDrive/[kepsdataset/train_preprocess.txt', "r")
output = []
dicts = {}
tokens = []
tags = []

for line in f:
  if len(line.strip()) != 0:
    fields = line.split('\t')
    text = fields[0].lower()
    tag = fields[1].strip()
    tokens.append(text)
    tags.append(tag)
  else:
    dicts['token'] = tokens # this is the sentences I want to map into integer
    dicts['tag'] = tags
    output.append(dicts)
    dicts = {}
    tokens = []
    tags = []
    
df = pd.DataFrame(output)

df.head(10)

I have converted the vocabulary list (from vocab file) into list of integer

import numpy as np

my_file = open("vocab_uncased.txt", "r")
  
data = my_file.read()
  
data_into_list = data.split("\n")
print(data_into_list)

encoded_string = [np.where(np.array(list(dict.fromkeys(data_into_list)))==e)[0][0]for e in data_into_list]
print(encoded_string)

What I want to do is to put the encoded string into the DataFrame above. How can I do it? Example:

sentence (in token field in DataFrame): ['Setelah', 'melalui', 'proses', 'telepon', 'yang', 'panjang', 'tutup', 'sudah', 'kartu', 'kredit', 'bca', 'ribet'] 
encoded sentence (using vocab file): [2024, 1317, 1806, 2182, 2400, 1624, 2333, 2107, 1013, 1155, 317, 1853] --> to be put into a new dataframe column

1 Answer 1

1

IIUC:

df = pd.DataFrame(output)
vocab = pd.Series(encoded_string, index=data_into_list)

df['encoded'] = df.explode(df.columns.tolist())['token'] \
                  .map(vocab).groupby(level=0).agg(list)

Output:

>>> df
                                                 token                                                tag                                            encoded
0    [setelah, melalui, proses, telepon, yang, panj...               [O, B, B, I, O, O, B, O, B, I, I, B]  [2024, 1317, 1806, 2182, 2400, 1624, 2333, 210...
1    [@halobca, saya, mencoba, mengakses, menu, m-b...  [B, O, O, B, B, I, O, O, O, B, I, O, O, O, O, ...  [130, 1917, 1374, 1403, 1470, 1240, 1917, 1545...
2    [hanya, saya, atau, @halobca, klikbca, bisnis,...                           [O, O, O, B, B, I, O, B]        [857, 1917, 249, 130, 1130, 439, 1332, 767]
3    [teller, bank, bca, ini, menanyakan, kabar, sa...                        [O, O, O, O, O, O, O, B, O]  [2190, 288, 317, 918, 1365, 983, 1917, 2081, 1...
4    [bca, senantiasa, menjaga, rahasia, data, cust...                                 [B, O, B, B, B, I]                  [317, 1983, 1458, 1824, 575, 551]
..                                                 ...                                                ...                                                ...
794  [hi, cs, kenapa, pelayanan, di, bca, kodya, te...  [O, B, O, B, O, B, I, I, I, I, I, I, O, B, O, ...  [873, 540, 1077, 1657, 598, 317, 1136, 2175, 2...
795  [walau, sudah, prioritas, tetap, saja, antreny...            [O, O, B, O, O, B, B, O, O, B, O, O, O]  [2374, 2107, 1791, 2281, 1885, 231, 1183, 282,...
796  [selama, menggunakan, layanan, e-channel, bca,...  [O, B, O, B, I, O, O, O, B, I, B, I, B, B, B, ...  [1966, 1427, 1198, 746, 317, 1520, 2288, 1341,...
797  [mau, menabung, mau, simpan, uang, atau, pun, ...         [O, B, O, B, B, O, O, B, B, I, O, O, O, B]  [1306, 1361, 1306, 2055, 2335, 249, 1817, 1491...
798  [toko, daring, juga, kebanyakan, pakai, bca, m...  [B, I, O, O, B, I, I, O, O, O, B, B, I, I, B, ...  [2297, 569, 976, 1037, 1609, 317, 1238, 258, 1...

[799 rows x 3 columns]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.