Discussion:
[Tutor] unicode utf-16 and readlines
Poor Yorick
2003-01-04 15:22:01 UTC
Permalink
On Windows 2000, Python 2.2.1 open.readlines seems to read lines
fh = open('0022data2.txt')
a = fh.readlines()
print a
['\xff\xfe\xfaQ\r\x00\n', '\x00']

In this example, Python seems to have incorrectly parsed the \n\r
characters at the end of the line. It's an error that one can work
around by slicing off the last three characters of every other list
element, but it makes working with utf-16 files non-intuitive,
especially for beginners. Or am I missing something?

Poor Yorick
***@pooryorick.com
Danny Yoo
2003-01-05 04:23:01 UTC
Permalink
Post by Poor Yorick
On Windows 2000, Python 2.2.1 open.readlines seems to read lines
fh = open('0022data2.txt')
a = fh.readlines()
print a
['\xff\xfe\xfaQ\r\x00\n', '\x00']
Hi Poor Yorick,


You may want to use a "codec" to decode Unicode from a file. The 'codecs'
module is specifically designed for this:

http://www.python.org/doc/lib/module-codecs.html


For example:

###
Post by Poor Yorick
import codecs
f = codecs.open('foo.txt', 'w', 'utf-16')
f.write("hello world")
f.close()
open('foo.txt').read()
'\xff\xfeh\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00'
Post by Poor Yorick
f2 = codecs.open('foo.txt', 'r', 'utf-16')
f2.readlines()
[u'hello world']
###
Post by Poor Yorick
In this example, Python seems to have incorrectly parsed the \n\r
characters at the end of the line.
If you use 'codecs' and its open() function, you should be all set. I saw
a brief mention on it in a Unicode tutorial here:

http://www.reportlab.com/i18n/python_unicode_tutorial.html

I always wanted to know what 'codecs' did. Now I know. Cool. *grin*


Thanks for the question!
Poor Yorick
2003-01-07 07:01:02 UTC
Permalink
Post by Danny Yoo
You may want to use a "codec" to decode Unicode from a file. The 'codecs'
http://www.python.org/doc/lib/module-codecs.html
###
import codecs
f = codecs.open('foo.txt', 'w', 'utf-16')
f.write("hello world")
f.close()
open('foo.txt').read()
'\xff\xfeh\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00'
f2 = codecs.open('foo.txt', 'r', 'utf-16')
f2.readlines()
[u'hello world']
###
Thanks for the info! Using the codecs module is much better. It's
interesting to note, though, that when using the codecs module on a real
utf-16 text file, Python's automatic handling of new line characters
Post by Danny Yoo
import codecs
fh = codecs.open('0022data2.txt', 'r', 'utf-16')
a = fh.read()
a
u'\u51fa\r\n'
Post by Danny Yoo
print a
??
Post by Danny Yoo
a = a.strip()
print a
?
Danny Yoo
2003-01-08 00:34:17 UTC
Permalink
Post by Poor Yorick
Thanks for the info! Using the codecs module is much better. It's
interesting to note, though, that when using the codecs module on a real
utf-16 text file, Python's automatic handling of new line characters
import codecs
fh = codecs.open('0022data2.txt', 'r', 'utf-16')
a = fh.read()
a
u'\u51fa\r\n'
print a
??
a = a.strip()
print a
?
Hi Poor Yorick!

I have to admit I'm a bit confused; there shouldn't be any automatic
handling of newlines when we use read(), since read() sucks all the text
out of a file.

Can you explain more what you mean by automatic newline handling? Do you
mean a conversion of '\r\n' to '\n'?
Post by Poor Yorick
a
u'\u51fa\r\n'
print a
??
a = a.strip()
print a
?
I think it's working. If we look at the actual representation of the
strings, both the carriage return and the newline are being removed:

###
Post by Poor Yorick
a = u"\u51fa\r\n"
a
u'\u51fa\r\n'
Post by Poor Yorick
a.strip()
u'\u51fa'
###

I can't actually use print or str() to look at the Unicode characters on
my console, since that '\u51fa' character isn't ASCII.


Best of wishes to you!
Poor Yorick
2003-01-08 01:39:05 UTC
Permalink
Post by Danny Yoo
I have to admit I'm a bit confused; there shouldn't be any automatic
handling of newlines when we use read(), since read() sucks all the text
out of a file.
Can you explain more what you mean by automatic newline handling? Do you
mean a conversion of '\r\n' to '\n'?
As you mentioned, strip works correctly with the list items returned by
codec.readlines(), so my problem is entirely resolved. Yes, I meant
that codecs.readlines returns '\r\n' where a standard file object
Post by Danny Yoo
import codecs
fh = codecs.open('0022data2.txt', 'r', 'utf-16')
a = fh.readlines()
a
[u'\u51fa\r\n']
Post by Danny Yoo
fh = open('test1.txt', 'r')
a = fh.readlines()
a
['hello\n', 'goodbye\n', 'where\n', 'how\n', 'when']
Perhaps you could tell me if this inconsistency poses any implications
for the Python programmer.

Poor Yorick
***@pooryorick.com
Danny Yoo
2003-01-08 02:49:02 UTC
Permalink
Post by Poor Yorick
Post by Danny Yoo
I have to admit I'm a bit confused; there shouldn't be any automatic
handling of newlines when we use read(), since read() sucks all the
text out of a file.
Can you explain more what you mean by automatic newline handling? Do
you mean a conversion of '\r\n' to '\n'?
As you mentioned, strip works correctly with the list items returned by
codec.readlines(), so my problem is entirely resolved. Yes, I meant
that codecs.readlines returns '\r\n' where a standard file object
Ah, I see what you mean now. No, as far as I understand, Python doesn't
do this automatic conversion of newlines. However, Python 2.3's
"Universal Newline" support is probably what you're looking for.

Till then, it's still possible to tie in a "\r\n" --> "\n" sort of scheme,
using the mechanism of 'codecs'. It's not documented too well, but Python
does support a scheme similar to the nested Readers that Java uses.


Here is a module I cooked up called 'u_newlines.py' that provides such a
codec for newline conversions:

###
"""A small demonstration of codec-writing. Transparently converts
"\r\n" into "\n"

Danny Yoo (***@hkn.eecs.berkeley.edu)
"""

import codecs

class Codec(codecs.Codec):
def encode(self, input, errors="strict"):
return input.replace("\r\n", "\n"), len(input)

def decode(self, input, errors="strict"):
return input.replace("\r\n", "\n"), len(input)

class StreamWriter(Codec, codecs.StreamWriter): pass
class StreamReader(Codec, codecs.StreamWriter): pass

def getregentry():
return (Codec().encode,Codec().decode,StreamReader,StreamWriter)

codecs.register(getregentry)
###




Let's see this in action. Let's say that we had a sample file in UTF-16
format:

###
Post by Poor Yorick
Post by Danny Yoo
sample_file = open("foo.txt", 'w')
sample_file.write("hello world\r\nThis is a test\r\n".encode("utf-16"))
sample_file.close()
open("foo.txt").read()
'\xff\xfeh\x00e\x00l\x00l\x00o\x00
\x00w\x00o\x00r\x00l\x00d\x00\r\x00\n\x00T\x00h\x00i\x00s\x00
\x00i\x00s\x00 \x00a\x00 \x00t\x00e\x00s\x00t\x00\r\x00\n\x00'
###


'foo.txt' now contains two lines of UTF-16 encoded text.


Notice that a normal read() on this file isn't working too well now, since
Python doesn't know that it's reading from a UTF-16 source. However, we
can fix this by introducing a codecs.EncodedFile() wrapper around the
object:

###
Post by Poor Yorick
Post by Danny Yoo
import codecs
import u_newlines
f = codecs.EncodedFile(open("foo.txt"), 'ascii', 'utf-16')
f.read()
'hello world\r\nThis is a test\r\n'
###

This EncodedFile wraps around our original file, and converts our
ASCII'ish file using a UTF-16 converter.



Here's the kicker: these EncodedFiles can be "composed": we can combine
the effects of several codecs, including that of u_newlines.py above, by
doing a double wrapping of our input file!

###
Post by Poor Yorick
Post by Danny Yoo
f = codecs.EncodedFile(open("foo.txt"), 'ascii', 'utf-16')
f2 = codecs.EncodedFile(f, 'u_newlines')
f2.read()
'hello world\nThis is a test\n'
###


Muhahaha. *grin*




I hope this helps!
Poor Yorick
2003-01-08 03:22:34 UTC
Permalink
Post by Danny Yoo
Ah, I see what you mean now. No, as far as I understand, Python doesn't
do this automatic conversion of newlines. However, Python 2.3's
"Universal Newline" support is probably what you're looking for.
Thanks for the sample code! It'll take me a while to digest....

But about newlines, I thought that '\n' was already a sort of universal
newline for Python. On windows platforms, both open.read and
open.readlines already transform '\r\n' into '\n' unless you use binary
mode. That's why I thought it was a discrepancy for codecs.open to
return '\r\n'.

Poor Yorick
***@pooryorick.com
Danny Yoo
2003-01-08 02:57:04 UTC
Permalink
Post by Danny Yoo
###
"""A small demonstration of codec-writing. Transparently converts
"\r\n" into "\n"
"""
import codecs
return input.replace("\r\n", "\n"), len(input)
return input.replace("\r\n", "\n"), len(input)
class StreamWriter(Codec, codecs.StreamWriter): pass
class StreamReader(Codec, codecs.StreamWriter): pass
return (Codec().encode,Codec().decode,StreamReader,StreamWriter)
codecs.register(getregentry)
###
Small amendment: that last part of u_newlines.py is wrong; I should not
have touched codecs.register. Sorry about that! The corrected code,
along with a small test case, follows below:



###
"""A small demonstration of codec-writing. Transparently converts
"\r\n" into "\n"

Danny Yoo (***@hkn.eecs.berkeley.edu)
"""

import codecs

class Codec(codecs.Codec):
def encode(self, input, errors="strict"):
return input.replace("\r\n", "\n"), len(input)

def decode(self, input, errors="strict"):
return input.replace("\r\n", "\n"), len(input)

class StreamWriter(Codec, codecs.StreamWriter): pass
class StreamReader(Codec, codecs.StreamWriter): pass

def getregentry():
return (Codec().encode,Codec().decode,StreamReader,StreamWriter)


def _test():
import StringIO
f = codecs.EncodedFile(StringIO.StringIO("Hello\r\nWorld\r\n"),
"u_newlines")
assert f.readlines() == ["Hello\n", "World\n"]
print "All done!"

if __name__ == '__main__':
_test()
###




Hope this helps!
Poor Yorick
2003-01-21 20:12:08 UTC
Permalink
Post by Danny Yoo
Small amendment: that last part of u_newlines.py is wrong; I should not
have touched codecs.register. Sorry about that! The corrected code,
Danny, I've modified your code a little to make it platform-independent,
and added my own "overridden" version of codecs.open. My goal is to use
yCodecs.open in all my Python development and to abandon
__builtin__.open completely. For my projects, it's essential to be able
to handle files of various text encodings with grace and in a
platform-independent manner. I have two questions about the following code:

Why is it not necessary to register the codec using codec.register?
Isn't that essential to Python being able to find the codec?

I figured it out by trial and error, but I don't understand why, in my
class Codec, StreamWriter inherits from StreamReader and StreamReader
inherits from StreamWriter.


### yCodecs.py###
___________________


#! /usr/bin/env python

'''yCodecs.open provides an "wrapper" to codecs.open, adding transparent
newline conversion when files are opened in text mode. Note that
codecs.open
actually opens all files in binary mode. '''

import codecs

import yu_newlines

def open(filename, mode='rb', encoding=None, errors='strict', buffering=1):
'''same as codecs.open with addition of automatic conversion of
os.linesep to "\n" in Python
'''
fh1 = codecs.open(filename, mode, encoding, errors, buffering)
if 'b' not in mode:
fh2 = codecs.EncodedFile(fh1, 'yu_newlines')
return fh2
else:
return fh1

____________________


### yu_newlines.py ###
____________________

#! /usr/bin/env python

'''
used by yCodecs.py to provide transparent conversion from os.linesep
to "\n" in Python.

Based on code by Danny Yoo (***@hkn.eecs.berkeley.edu)
'''

import codecs
import os

class Codec(codecs.Codec):
def encode(self, input, errors="strict"):
linesep = os.linesep
final = input.replace(linesep, "\n"), len(input)
return final

def decode(self, input, errors="strict"):
linesep = os.linesep
final = input.replace("\n", linesep), len(input)
return final

class StreamWriter(Codec, codecs.StreamReader): pass
class StreamReader(Codec, codecs.StreamWriter): pass

def getregentry():
return (Codec().encode,Codec().decode,StreamReader,StreamWriter)



Poor Yorick
***@pooryorick.com

Danny Yoo
2003-01-08 20:14:03 UTC
Permalink
Post by Poor Yorick
Post by Danny Yoo
Ah, I see what you mean now. No, as far as I understand, Python
doesn't do this automatic conversion of newlines. However, Python
2.3's "Universal Newline" support is probably what you're looking for.
Thanks for the sample code! It'll take me a while to digest....
But about newlines, I thought that '\n' was already a sort of universal
newline for Python. On windows platforms, both open.read and
open.readlines already transform '\r\n' into '\n' unless you use binary
mode. That's why I thought it was a discrepancy for codecs.open to
return '\r\n'.
Ah! I completely forgot about that!


You're right: there's an platform-dependent automatic conversion of the
line-endings.

http://www.wdvl.com/Authoring/Languages/Python/Quick/python4_2.html

if we open a file in "text" mode, which is what we've been doing in the
past examples.


However, this "\r\n"->"\n" conversion won't take its expected effect
against UTF-16-encoded files because the character sequence isn't '\r\n'
in the file, but rather, the four byte sequence: '\x00\r\x00\n'.

That is, there's padding involved because each character now consists of
two bytes each! So even when we open the file in text mode, Python file
operations don't catch and do platform-dependent conversions here.


Even the Universal Newlines support of Python 2.3 won't help us here,
since by the time we read those four bytes, normalization will pass by
without touching those strings. Or even worse, may even convert what
looks like a lone "\r" into another newline since it looks like a
Macintosh newline. So we may really need to open UTF-16 files in binary
mode after all to be careful.


Hmmm... perhaps this is a bug! Perhaps the utf-16 decoder should really
do the '\r\n' normalization if it's running on a Windows platform. I
haven't been able to Google other pages talking about this issue, so I
have no idea what other people think about it. Maybe you might want to
bring it up on the i18n-sig?

http://www.python.org/sigs/i18n-sig/

It would be an interesting thing to talk about. I'm sorry I can't give a
definitive answer on this one; I really don't know what the "right" thing
to do is in this case.


Good luck to you!
Loading...