Last week I decided to update one of my larger Python 2.7 projects to support Unicode fully and to run under Python 3.
Here are the steps that I took and some gotchas I ran into along the way.
Create and fully automate the unit test suite:
Identify all boundaries where data is being exchanged between the program and the outside environment. In my case:
read and write)open and file paths)subprocess module)sys.argv)time module)At boundaries:
sys.getfilesystemencoding() is sometimes appropriate. Be sure to test on Windows, where this value is typically not UTF-8.rb or wb) or text mode (i.e. rt or wt).
json module. Even then, it’s a good idea to restrict text-based output to the ASCII character set since the native text encoding may not support the full Unicode set, particularly on Windows.Finally:
2to3 converter on your program so that it can be run by Python 3.StringIO need to be migrated to BytesIO, when used as a byte buffer. If used as a string buffer, the StringIO uses need to be left at StringIO.
BytesIO and StringIO in both Python 2 and 3. I put it in a utility module and the rest of the program imports the shims instead of using the standard library.
# BytesIO presents a stream interface to an in-memory bytestring.
#
# This is equivalent to StringIO in Python 2 and to BytesIO in Python 3.
try:
from io import BytesIO # Python 3
except ImportError:
from StringIO import StringIO as BytesIO
# StringIO presents a stream interface to an in-memory string
# (which is a bytestring in Python 2 and a unicode string in Python 3).
#
# This is equivalent to StringIO in both Python 2 and 3.
try:
from StringIO import StringIO
except ImportError:
from io import StringIO # Python 3
output.write(chr(byte_ordinal)) with output.write(bchr(byte_ordinal)) and use the following shim for bchr:
# bchr() converts the specified byte integer value to a single character
# bytestring.
#
# This is equivalent to chr() in Python 2 but requires special handling in
# Python 3.
if bytes == str:
def bchr(byte_ordinal):
return chr(byte_ordinal)
else:
def bchr(byte_ordinal):
return bytes([byte_ordinal]) # Python 3
''. Some of them probably need to be converted to comparisons with b''. For example, I had to adjust my at_eof utility function:
def at_eof(input):
"""
Returns whether the specified input stream is at EOF.
"""
with save_stream_position(input):
at_eof = input.read(1) == b''
return at_eof
b prefix appropriately.if type_code == b'APPL':
for c in data: will give back single-character bytestrings in Python 2 but byte ordinals in Python 3.for b in iterord(data): always iterate over byte ordinals:
# iterord() iterates over the integer values of the bytes in the specified
# bytestring.
if bytes == str:
def iterord(bytes_value): # Python 2
for b in bytes_value:
yield ord(b)
else:
def iterord(bytes_value): # Python 3
return bytes_value
time.strftime returns a bytestring in Python 2 and a Unicode string in Python 3. Usually you want a Unicode string. I didn’t create a shim for this since I only had one use of strftime.
now_string = time.strftime('%b %d %H:%M', time.localtime(now))
if not isinstance(now_string, unicode):
now_string = now_string.decode('ascii')
Regular expressions that are intended to apply to bytestrings instead of Unicode strings may need updating.
re.compile(r'^Volume name is "(.*)"$') which is designed to be applied to a MacRoman-encoded bytestring would need to be updated to read re.compile(br'^Volume name is "(.*)"$')Don’t forget to update your documentation to specify where Unicode (or ASCII bytestring literals) are expected.