Python3 UnicodeDecodeError on utf8

Question

No matter what I do I couldn't fix it. The script I need to fix is this;

# Read the original file and write to a new file
input_file = 'input.txt'
output_file = 'output.txt'

with open(input_file, 'rb') as f:
    content = f.read()

# Filter out non-UTF-8 characters
cleaned_content = content.decode('utf-8', errors='replace').replace('�','?')

# Split the cleaned content into lines
lines = cleaned_content.splitlines()

# Sort the lines
sorted_lines = sorted(lines)

# Write the sorted lines to a new file
with open(output_file, 'w', encoding='utf-8') as f:
    for line in sorted_lines:
        f.write(line + '\n')

What I want is to file to never give me UnicodeDecodeError when I do with open(file_path, 'r', encoding='utf-8') as file:

Long story short I have a byte-search script working on sorted file. If I do with open(file_path, 'r', encoding='utf-8', errors='replace') as file: It doesn't work properly because it's changing the character that would give UnicodeDecodeError normally. Imagine the file is like that it's reading it as that.

a
b
�
d

If it's searching for "c" and comes to the line starting with � then it would check if c comes before � or after and goes to incorrect direction (up instead of down let's say) because the file is sorted regarding utf-8.

I want to make sure the file wouldn't give me UnicodeDecodeError because all the characters that can give that error is changed by "?" then sorted.

No matter what I tried it's always having that weird characters.

How can I do that?

I do not understand your question. If you're getting UnicodeDecodeError, it means your file is not valid UTF-8. Therefore, I do not see how your file can be "sorted regarding utf-8". If you're wanting to just do bytewise operations on the file, use "rb" instead of "r", ditch the encoding, and work with byte strings instead of Unicode strings. — nneonneo
– nneonneo, Commented Nov 19, 2024 at 1:26
Are you sure your file isn't actual cp1252, latin-1, cp437, etc.? And don't just try those encodings out and hope; latin-1 in particular decodes anything, you need to know the encoding, or verify the results. If it contains non-UTF-8 characters, it usually means either: 1) It's some other encoding, and ignoring the errors is wrong, or 2) It's mostly UTF-8 with transcluded non-UTF-8 data in it (in which case you probably want to fix whatever is generating output with mixed encoding, or in rare cases, it's intended, and you need a parser for whatever format the file is really in). — ShadowRanger
– ShadowRanger, Commented Nov 19, 2024 at 1:33
Are you saying that output.txt has non-UTF-8 bytes in it? Because, umm... it definitely doesn't. There is no way that the script you wrote, if it runs to completion without an exception, will have produced an output.txt file with any encoding but UTF-8. The data might be garbled (because input.txt isn't actual UTF-8, and you're blithely changing the unrecognized bytes to filler characters), which might mess up your data, but your sorting would be correct, insofar as it is sorting the post-mangling version of the data. — ShadowRanger
– ShadowRanger, Commented Nov 19, 2024 at 1:43
I am 100% sure the output file is utf-8. "file -i file.txt" gives that. The code I wrote for fix should make it utf-8 while replacing all the non-utf8 characters with question-mark. However I am still getting the UnicodeDecodeError while doing with open(file_path, 'r', encoding='utf-8', errors='replace') as file: Yes it shouldn't have but it does and I can't figure out how for weeks. It doesn't matter if the input had non-utf8 characters because it should have been removed already after that script. — Random Guy
– Random Guy, Commented Nov 19, 2024 at 2:54
@pippo1980: I mistyped the first time; I clearly meant non-UTF-8 bytes (files don't store characters in the first place), and typed it correctly in the second comment. — ShadowRanger
– ShadowRanger, Commented Nov 20, 2024 at 22:08

pippo1980 · Accepted Answer · 2024-11-21 21:24:02Z

0

with open(input_file, 'r', encoding='utf-8', errors='replace') as f:
    lines = f.read().replace('�','?')

but I got different results than in your comment above:

file = open('output_.txt', 'wb')
try:
    ##### Write binary data to file

    file.write(b'\x61\x62\x63\x64\x65\x66\x67\x0A\xC0\xC1')
finally:
    ### Close the file

    file.close()

using file -i output_.txt I get:

output_.txt: text/plain; charset=iso-8859-1

edited Nov 21, 2024 at 21:24

answered Nov 21, 2024 at 21:07

pippo1980

3,3463 gold badges18 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python3 UnicodeDecodeError on utf8

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related