1

All my python source code is encoded in utf-8 and has this coding declared on the top of the file.

But sometimes the u before a unicode string is missing.

Example Umlauts = "üöä"

Above is a bytestring containing non-ascii characters and this makes trouble (UnicodeDecodeError).

I tried pylint and python -3 but I could not get a warning.

I search an automated way to find non-ascii characters in bytestrings.

My source code needs to support Python 2.6 and Python 2.7.

I get this well known error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 7: ordinal not in range(128)

BTW: This question is only about python source code, not about strings read from files or sockets.

Solution

  • for projects which need to support Python 2.6+ I will use __future__.unicode_literals
  • for projects which need to support 2.5 I will use the solution from thg435 (module ast)
5
  • 1
    Could you elaborate on "makes trouble"? Commented Sep 28, 2012 at 9:33
  • 1
    Finding those strings and sticking a u in front of them is not going to solve your problem. This error appears whenever you do something with your data (like printing) where the accepting function doesn't expect characters encoded that way. You need to make sure that all strings in your program are handled as Unicode as soon and as long as possible and only encoded to specific, matching encodings when exporting/printing etc. Commented Sep 28, 2012 at 9:44
  • 3
    First of all I love __future__.unicode_literals. Second: To find those I would probably try using grep like in this example. Of course this will find those characters out of a bytestring too, but I assume theres's not many variables with umlaut names is it? Commented Sep 28, 2012 at 9:45
  • @javex: Good point; it's devilishly hard to match all forms of strings in Python with regexes (think of strings like """'"'\""\n'''""")... Commented Sep 28, 2012 at 9:51
  • @TimPietzcker: correct, thats why you just search for a specific byte range. That will just find any non-ascii characters. Then you can change those that need a change. Commented Sep 28, 2012 at 9:53

1 Answer 1

2

Of course you want to use python for this!

import ast, re

with open("your_script.py") as fp:
    tree = ast.parse(fp.read())

for node in ast.walk(tree):
    if (isinstance(node, ast.Str) 
            and isinstance(node.s, str) 
            and  re.search(r'[\x80-\xFF]', node.s)):
        print 'bad string %r line %d col %d' % (node.s, node.lineno, node.col_offset)

Note that this doesn't distinguish between bare and escaped non-ascii chars (fuß and fu\xdf).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.