1

I am writing a python script for data preprocessing. The data in question is read and stored within the script as a multi dimensional array consisting of data points similar to the ones below.

[['United', '-27.654379', '152.917741', 'e10', '1459', '2019-03-18'],
['United', '-27.654379', '152.917741', 'e10', '1449', '2019-03-19']]

Currently i need too remove values within the array that have identical dates so that

[['Costco', '-27.213607', '152.996416', 'e10', '1237', '2019-03-16'],
['United', '-25.607894', '150.367213', 'e10', '1297', '2019-03-16']]

Would become

[['Costco', '-27.213607', '152.996416', 'e10', '1237', '2019-03-16']]

My current method of doing so (shown below) appears to identify and remove entries with duplicate dates, but some can still be found within the output.

    for line in Data_text:
        for row in Data_text:
            if line[5] == row[5]:
                Data_text.remove(row)

Any insight into the faults in my algorithm and/or a better way of doing it would be greatly appreciated.

4 Answers 4

1

Using pure Python, you can leverage the power of set to work in this case:

lst = [['Costco', '-27.213607', '152.996416', 'e10', '1237', '2019-03-16'],
       ['Costco', '-27.213607', '152.996416', 'e10', '1297', '2019-03-16']]

seen = set()
print([x for x in lst if not (x[5] in seen or seen.add(x[5]))])

# [['Costco', '-27.213607', '152.996416', 'e10', '1237', '2019-03-16']]
Sign up to request clarification or add additional context in comments.

Comments

0

With python3.7, the code below just works. However, it reserves the last one.

data = [['Costco', '-27.213607', '152.996416', 'e10', '1237', '2019-03-16'],
        ['United', '-25.607894', '150.367213', 'e10', '1297', '2019-03-16']]

data = list({item[5]: item for item in data}.values())
# [['United', '-25.607894', '150.367213', 'e10', '1297', '2019-03-16']]

Comments

0

You might want to consider pandas for this type of data and operations:

a = [['Costco', '-27.213607', '152.996416', 'e10', '1237', '2019-03-16'],
     ['United', '-25.607894', '150.367213', 'e10', '1297', '2019-03-16']]

import pandas as pd

df = pd.DataFrame(a).drop_duplicates(5, keep='first')

Result:

df

        0           1           2    3     4           5
0  Costco  -27.213607  152.996416  e10  1237  2019-03-16

This is especially useful if the dates have different formats:

a2 = [['Costco', '-27.213607', '152.996416', 'e10', '1237', 'March 16, 2019'],
    ['United', '-25.607894', '150.367213', 'e10', '1297', '2019-03-16']]

df = pd.DataFrame(a2)
df[5] = pd.to_datetime(df[5])
df.drop_duplicates(5, keep='first')

Still gives correct result:

        0           1           2    3     4          5
0  Costco  -27.213607  152.996416  e10  1237 2019-03-16

Comments

0

Maybe some problem about this line : Data_text.remove(row) the Data_text length will decrease 1

Please try this , new a result_list = [], put the no duplicate record into result_list

result_list = []
length = len(Data_text);
for i in range(0, length):
    line = Data_text[i]
    is_exsit = False
    for row in result_list:
        if line[5] == row[5]:
            is_exsit = True
            break

    if is_exsit == False:
        result_list.append(line)


print(result_list)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.