Last week I decided to update one of my larger Python 2.7 projects to support Unicode fully and to run under Python 3.
Here are the steps that I took and some gotchas I ran into along the way.
Create and fully automate the unit test suite:
Identify all boundaries where data is being exchanged between the program and the outside environment. In my case:
openand file paths)
sys.getfilesystemencoding()is sometimes appropriate. Be sure to test on Windows, where this value is typically not UTF-8.
wb) or text mode (i.e.
jsonmodule. Even then, it’s a good idea to restrict text-based output to the ASCII character set since the native text encoding may not support the full Unicode set, particularly on Windows.
2to3converter on your program so that it can be run by Python 3.
StringIOneed to be migrated to
BytesIO, when used as a byte buffer. If used as a string buffer, the
StringIOuses need to be left at
StringIOin both Python 2 and 3. I put it in a utility module and the rest of the program imports the shims instead of using the standard library.
# BytesIO presents a stream interface to an in-memory bytestring. # # This is equivalent to StringIO in Python 2 and to BytesIO in Python 3. try: from io import BytesIO # Python 3 except ImportError: from StringIO import StringIO as BytesIO # StringIO presents a stream interface to an in-memory string # (which is a bytestring in Python 2 and a unicode string in Python 3). # # This is equivalent to StringIO in both Python 2 and 3. try: from StringIO import StringIO except ImportError: from io import StringIO # Python 3
output.write(bchr(byte_ordinal))and use the following shim for
# bchr() converts the specified byte integer value to a single character # bytestring. # # This is equivalent to chr() in Python 2 but requires special handling in # Python 3. if bytes == str: def bchr(byte_ordinal): return chr(byte_ordinal) else: def bchr(byte_ordinal): return bytes([byte_ordinal]) # Python 3
''. Some of them probably need to be converted to comparisons with
b''. For example, I had to adjust my
def at_eof(input): """ Returns whether the specified input stream is at EOF. """ with save_stream_position(input): at_eof = input.read(1) == b'' return at_eof
if type_code == b'APPL':
for c in data:will give back single-character bytestrings in Python 2 but byte ordinals in Python 3.
for b in iterord(data):always iterate over byte ordinals:
# iterord() iterates over the integer values of the bytes in the specified # bytestring. if bytes == str: def iterord(bytes_value): # Python 2 for b in bytes_value: yield ord(b) else: def iterord(bytes_value): # Python 3 return bytes_value
time.strftimereturns a bytestring in Python 2 and a Unicode string in Python 3. Usually you want a Unicode string. I didn’t create a shim for this since I only had one use of
now_string = time.strftime('%b %d %H:%M', time.localtime(now)) if not isinstance(now_string, unicode): now_string = now_string.decode('ascii')
Regular expressions that are intended to apply to bytestrings instead of Unicode strings may need updating.
re.compile(r'^Volume name is "(.*)"$')which is designed to be applied to a MacRoman-encoded bytestring would need to be updated to read
re.compile(br'^Volume name is "(.*)"$')
Don’t forget to update your documentation to specify where Unicode (or ASCII bytestring literals) are expected.