No matter what I do I couldn't fix it. The script I need to fix is this;
# Read the original file and write to a new file
input_file = 'input.txt'
output_file = 'output.txt'
with open(input_file, 'rb') as f:
content = f.read()
# Filter out non-UTF-8 characters
cleaned_content = content.decode('utf-8', errors='replace').replace('�','?')
# Split the cleaned content into lines
lines = cleaned_content.splitlines()
# Sort the lines
sorted_lines = sorted(lines)
# Write the sorted lines to a new file
with open(output_file, 'w', encoding='utf-8') as f:
for line in sorted_lines:
f.write(line + '\n')
What I want is to file to never give me UnicodeDecodeError when I do with open(file_path, 'r', encoding='utf-8') as file:
Long story short I have a byte-search script working on sorted file. If I do with open(file_path, 'r', encoding='utf-8', errors='replace') as file:
It doesn't work properly because it's changing the character that would give UnicodeDecodeError normally. Imagine the file is like that
it's reading it as that.
a
b
�
d
If it's searching for "c" and comes to the line starting with � then it would check if c comes before � or after and goes to incorrect direction (up instead of down let's say) because the file is sorted regarding utf-8.
I want to make sure the file wouldn't give me UnicodeDecodeError because all the characters that can give that error is changed by "?" then sorted.
No matter what I tried it's always having that weird characters.
How can I do that?
UnicodeDecodeError, it means your file is not valid UTF-8. Therefore, I do not see how your file can be "sorted regarding utf-8". If you're wanting to just do bytewise operations on the file, use"rb"instead of "r", ditch the encoding, and work with byte strings instead of Unicode strings.cp1252,latin-1,cp437, etc.? And don't just try those encodings out and hope;latin-1in particular decodes anything, you need to know the encoding, or verify the results. If it contains non-UTF-8 characters, it usually means either: 1) It's some other encoding, and ignoring the errors is wrong, or 2) It's mostly UTF-8 with transcluded non-UTF-8 data in it (in which case you probably want to fix whatever is generating output with mixed encoding, or in rare cases, it's intended, and you need a parser for whatever format the file is really in).output.txthas non-UTF-8 bytes in it? Because, umm... it definitely doesn't. There is no way that the script you wrote, if it runs to completion without an exception, will have produced anoutput.txtfile with any encoding but UTF-8. The data might be garbled (becauseinput.txtisn't actual UTF-8, and you're blithely changing the unrecognized bytes to filler characters), which might mess up your data, but your sorting would be correct, insofar as it is sorting the post-mangling version of the data.open(file_path, 'r', encoding='utf-8', errors='replace') as file:Yes it shouldn't have but it does and I can't figure out how for weeks. It doesn't matter if the input had non-utf8 characters because it should have been removed already after that script.