find non ascii bytestrings in python source code

Question

All my python source code is encoded in utf-8 and has this coding declared on the top of the file.

But sometimes the u before a unicode string is missing.

Example Umlauts = "üöä"

Above is a bytestring containing non-ascii characters and this makes trouble (UnicodeDecodeError).

I tried pylint and python -3 but I could not get a warning.

I search an automated way to find non-ascii characters in bytestrings.

My source code needs to support Python 2.6 and Python 2.7.

I get this well known error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 7: ordinal not in range(128)

BTW: This question is only about python source code, not about strings read from files or sockets.

Solution

for projects which need to support Python 2.6+ I will use __future__.unicode_literals
for projects which need to support 2.5 I will use the solution from thg435 (module ast)

Finding those strings and sticking a u in front of them is not going to solve your problem. This error appears whenever you do something with your data (like printing) where the accepting function doesn't expect characters encoded that way. You need to make sure that all strings in your program are handled as Unicode as soon and as long as possible and only encoded to specific, matching encodings when exporting/printing etc. — Tim Pietzcker
– Tim Pietzcker, Commented Sep 28, 2012 at 9:44
First of all I love __future__.unicode_literals. Second: To find those I would probably try using grep like in this example. Of course this will find those characters out of a bytestring too, but I assume theres's not many variables with umlaut names is it? — javex
– javex, Commented Sep 28, 2012 at 9:45
@javex: Good point; it's devilishly hard to match all forms of strings in Python with regexes (think of strings like """'"'\""\n'''""")... — Tim Pietzcker
– Tim Pietzcker, Commented Sep 28, 2012 at 9:51
@TimPietzcker: correct, thats why you just search for a specific byte range. That will just find any non-ascii characters. Then you can change those that need a change. — javex
– javex, Commented Sep 28, 2012 at 9:53

georg · Accepted Answer · 2012-09-28 13:01:55Z

2

Of course you want to use python for this!

import ast, re

with open("your_script.py") as fp:
    tree = ast.parse(fp.read())

for node in ast.walk(tree):
    if (isinstance(node, ast.Str) 
            and isinstance(node.s, str) 
            and  re.search(r'[\x80-\xFF]', node.s)):
        print 'bad string %r line %d col %d' % (node.s, node.lineno, node.col_offset)

Note that this doesn't distinguish between bare and escaped non-ascii chars (fuß and fu\xdf).

edited Sep 28, 2012 at 13:01

answered Sep 28, 2012 at 10:34

georg

216k57 gold badges324 silver badges401 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

find non ascii bytestrings in python source code

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related