Last week I decided to update one of my larger Python 2.7 projects to support Unicode fully and to run under Python 3.
Here are the steps that I took and some gotchas I ran into along the way.
Create and fully automate the unit test suite:
Identify all boundaries where data is being exchanged between the program and the outside environment. In my case:
read
and write
)open
and file paths)subprocess
module)sys.argv
)time
module)At boundaries:
sys.getfilesystemencoding()
is sometimes appropriate. Be sure to test on Windows, where this value is typically not UTF-8.rb
or wb
) or text mode (i.e. rt
or wt
).
json
module. Even then, it’s a good idea to restrict text-based output to the ASCII character set since the native text encoding may not support the full Unicode set, particularly on Windows.Finally:
2to3
converter on your program so that it can be run by Python 3.StringIO
need to be migrated to BytesIO
, when used as a byte buffer. If used as a string buffer, the StringIO
uses need to be left at StringIO
.
BytesIO
and StringIO
in both Python 2 and 3. I put it in a utility module and the rest of the program imports the shims instead of using the standard library.# BytesIO presents a stream interface to an in-memory bytestring. # # This is equivalent to StringIO in Python 2 and to BytesIO in Python 3. try: from io import BytesIO # Python 3 except ImportError: from StringIO import StringIO as BytesIO # StringIO presents a stream interface to an in-memory string # (which is a bytestring in Python 2 and a unicode string in Python 3). # # This is equivalent to StringIO in both Python 2 and 3. try: from StringIO import StringIO except ImportError: from io import StringIO # Python 3
output.write(chr(byte_ordinal))
with output.write(bchr(byte_ordinal))
and use the following shim for bchr
:# bchr() converts the specified byte integer value to a single character # bytestring. # # This is equivalent to chr() in Python 2 but requires special handling in # Python 3. if bytes == str: def bchr(byte_ordinal): return chr(byte_ordinal) else: def bchr(byte_ordinal): return bytes([byte_ordinal]) # Python 3
''
. Some of them probably need to be converted to comparisons with b''
. For example, I had to adjust my at_eof
utility function:def at_eof(input): """ Returns whether the specified input stream is at EOF. """ with save_stream_position(input): at_eof = input.read(1) == b'' return at_eof
b
prefix appropriately.if type_code == b'APPL':
for c in data:
will give back single-character bytestrings in Python 2 but byte ordinals in Python 3.for b in iterord(data):
always iterate over byte ordinals:# iterord() iterates over the integer values of the bytes in the specified # bytestring. if bytes == str: def iterord(bytes_value): # Python 2 for b in bytes_value: yield ord(b) else: def iterord(bytes_value): # Python 3 return bytes_value
time.strftime
returns a bytestring in Python 2 and a Unicode string in Python 3. Usually you want a Unicode string. I didn’t create a shim for this since I only had one use of strftime
.now_string = time.strftime('%b %d %H:%M', time.localtime(now)) if not isinstance(now_string, unicode): now_string = now_string.decode('ascii')
Regular expressions that are intended to apply to bytestrings instead of Unicode strings may need updating.
re.compile(r'^Volume name is "(.*)"$')
which is designed to be applied to a MacRoman-encoded bytestring would need to be updated to read re.compile(br'^Volume name is "(.*)"$')
Don’t forget to update your documentation to specify where Unicode (or ASCII bytestring literals) are expected.