[Tutor] unicode utf-16 and readlines

Post by Poor Yorick
On Windows 2000, Python 2.2.1 open.readlines seems to read lines

fh = open('0022data2.txt')
a = fh.readlines()
print a

['\xff\xfe\xfaQ\r\x00\n', '\x00']

Hi Poor Yorick,

You may want to use a "codec" to decode Unicode from a file. The 'codecs'
module is specifically designed for this:

http://www.python.org/doc/lib/module-codecs.html

For example:

###

import codecs
f = codecs.open('foo.txt', 'w', 'utf-16')
f.write("hello world")
f.close()
open('foo.txt').read()

'\xff\xfeh\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00'

f2 = codecs.open('foo.txt', 'r', 'utf-16')
f2.readlines()

[u'hello world']
###

Post by Poor Yorick
In this example, Python seems to have incorrectly parsed the \n\r
characters at the end of the line.

If you use 'codecs' and its open() function, you should be all set. I saw
a brief mention on it in a Unicode tutorial here:

http://www.reportlab.com/i18n/python_unicode_tutorial.html

I always wanted to know what 'codecs' did. Now I know. Cool. *grin*

Thanks for the question!

Poor Yorick

2003-01-07 07:01:02 UTC

Post by Danny Yoo
You may want to use a "codec" to decode Unicode from a file. The 'codecs'
http://www.python.org/doc/lib/module-codecs.html
###

import codecs
f = codecs.open('foo.txt', 'w', 'utf-16')
f.write("hello world")
f.close()
open('foo.txt').read()

'\xff\xfeh\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00'

f2 = codecs.open('foo.txt', 'r', 'utf-16')
f2.readlines()

[u'hello world']
###

Thanks for the info! Using the codecs module is much better. It's
interesting to note, though, that when using the codecs module on a real
utf-16 text file, Python's automatic handling of new line characters

import codecs
fh = codecs.open('0022data2.txt', 'r', 'utf-16')
a = fh.read()
a

u'\u51fa\r\n'

print a

a = a.strip()
print a

Danny Yoo

2003-01-08 00:34:17 UTC

Post by Poor Yorick
Thanks for the info! Using the codecs module is much better. It's
interesting to note, though, that when using the codecs module on a real
utf-16 text file, Python's automatic handling of new line characters

import codecs
fh = codecs.open('0022data2.txt', 'r', 'utf-16')
a = fh.read()
a

u'\u51fa\r\n'

print a

a = a.strip()
print a

Hi Poor Yorick!

I have to admit I'm a bit confused; there shouldn't be any automatic
handling of newlines when we use read(), since read() sucks all the text
out of a file.

Can you explain more what you mean by automatic newline handling? Do you
mean a conversion of '\r\n' to '\n'?

u'\u51fa\r\n'

print a

a = a.strip()
print a

I think it's working. If we look at the actual representation of the
strings, both the carriage return and the newline are being removed:

###

a = u"\u51fa\r\n"
a

u'\u51fa\r\n'

a.strip()

u'\u51fa'
###

I can't actually use print or str() to look at the Unicode characters on
my console, since that '\u51fa' character isn't ASCII.

Best of wishes to you!

Poor Yorick

2003-01-08 01:39:05 UTC

Post by Danny Yoo
I have to admit I'm a bit confused; there shouldn't be any automatic
handling of newlines when we use read(), since read() sucks all the text
out of a file.
Can you explain more what you mean by automatic newline handling? Do you
mean a conversion of '\r\n' to '\n'?

As you mentioned, strip works correctly with the list items returned by
codec.readlines(), so my problem is entirely resolved. Yes, I meant
that codecs.readlines returns '\r\n' where a standard file object

import codecs
fh = codecs.open('0022data2.txt', 'r', 'utf-16')
a = fh.readlines()
a

[u'\u51fa\r\n']

fh = open('test1.txt', 'r')
a = fh.readlines()
a

['hello\n', 'goodbye\n', 'where\n', 'how\n', 'when']
Perhaps you could tell me if this inconsistency poses any implications
for the Python programmer.

Poor Yorick
***@pooryorick.com

Danny Yoo

2003-01-08 02:49:02 UTC

Post by Danny Yoo
I have to admit I'm a bit confused; there shouldn't be any automatic
handling of newlines when we use read(), since read() sucks all the
text out of a file.
Can you explain more what you mean by automatic newline handling? Do
you mean a conversion of '\r\n' to '\n'?

Ah, I see what you mean now. No, as far as I understand, Python doesn't
do this automatic conversion of newlines. However, Python 2.3's
"Universal Newline" support is probably what you're looking for.

Till then, it's still possible to tie in a "\r\n" --> "\n" sort of scheme,
using the mechanism of 'codecs'. It's not documented too well, but Python
does support a scheme similar to the nested Readers that Java uses.

Here is a module I cooked up called 'u_newlines.py' that provides such a
codec for newline conversions:

###
"""A small demonstration of codec-writing. Transparently converts
"\r\n" into "\n"

Danny Yoo (***@hkn.eecs.berkeley.edu)
"""

import codecs

class Codec(codecs.Codec):
def encode(self, input, errors="strict"):
return input.replace("\r\n", "\n"), len(input)

def decode(self, input, errors="strict"):
return input.replace("\r\n", "\n"), len(input)

class StreamWriter(Codec, codecs.StreamWriter): pass
class StreamReader(Codec, codecs.StreamWriter): pass

def getregentry():
return (Codec().encode,Codec().decode,StreamReader,StreamWriter)

codecs.register(getregentry)
###

Let's see this in action. Let's say that we had a sample file in UTF-16
format:

###

sample_file = open("foo.txt", 'w')
sample_file.write("hello world\r\nThis is a test\r\n".encode("utf-16"))
sample_file.close()
open("foo.txt").read()

'\xff\xfeh\x00e\x00l\x00l\x00o\x00
\x00w\x00o\x00r\x00l\x00d\x00\r\x00\n\x00T\x00h\x00i\x00s\x00
\x00i\x00s\x00 \x00a\x00 \x00t\x00e\x00s\x00t\x00\r\x00\n\x00'
###

'foo.txt' now contains two lines of UTF-16 encoded text.

Notice that a normal read() on this file isn't working too well now, since
Python doesn't know that it's reading from a UTF-16 source. However, we
can fix this by introducing a codecs.EncodedFile() wrapper around the
object:

###

import codecs
import u_newlines
f = codecs.EncodedFile(open("foo.txt"), 'ascii', 'utf-16')
f.read()

'hello world\r\nThis is a test\r\n'
###

This EncodedFile wraps around our original file, and converts our
ASCII'ish file using a UTF-16 converter.

Here's the kicker: these EncodedFiles can be "composed": we can combine
the effects of several codecs, including that of u_newlines.py above, by
doing a double wrapping of our input file!

###

f = codecs.EncodedFile(open("foo.txt"), 'ascii', 'utf-16')
f2 = codecs.EncodedFile(f, 'u_newlines')
f2.read()

'hello world\nThis is a test\n'
###

Muhahaha. *grin*

I hope this helps!

Poor Yorick

2003-01-08 03:22:34 UTC

Post by Danny Yoo
Ah, I see what you mean now. No, as far as I understand, Python doesn't
do this automatic conversion of newlines. However, Python 2.3's
"Universal Newline" support is probably what you're looking for.

Thanks for the sample code! It'll take me a while to digest....

But about newlines, I thought that '\n' was already a sort of universal
newline for Python. On windows platforms, both open.read and
open.readlines already transform '\r\n' into '\n' unless you use binary
mode. That's why I thought it was a discrepancy for codecs.open to
return '\r\n'.

Poor Yorick
***@pooryorick.com

Danny Yoo

2003-01-08 02:57:04 UTC

Post by Danny Yoo
###
"""A small demonstration of codec-writing. Transparently converts
"\r\n" into "\n"
"""
import codecs
return input.replace("\r\n", "\n"), len(input)
return input.replace("\r\n", "\n"), len(input)
class StreamWriter(Codec, codecs.StreamWriter): pass
class StreamReader(Codec, codecs.StreamWriter): pass
return (Codec().encode,Codec().decode,StreamReader,StreamWriter)
codecs.register(getregentry)
###

Small amendment: that last part of u_newlines.py is wrong; I should not
have touched codecs.register. Sorry about that! The corrected code,
along with a small test case, follows below:

###
"""A small demonstration of codec-writing. Transparently converts
"\r\n" into "\n"

Danny Yoo (***@hkn.eecs.berkeley.edu)
"""

import codecs

class Codec(codecs.Codec):
def encode(self, input, errors="strict"):
return input.replace("\r\n", "\n"), len(input)

def decode(self, input, errors="strict"):
return input.replace("\r\n", "\n"), len(input)

class StreamWriter(Codec, codecs.StreamWriter): pass
class StreamReader(Codec, codecs.StreamWriter): pass

def getregentry():
return (Codec().encode,Codec().decode,StreamReader,StreamWriter)

def _test():
import StringIO
f = codecs.EncodedFile(StringIO.StringIO("Hello\r\nWorld\r\n"),
"u_newlines")
assert f.readlines() == ["Hello\n", "World\n"]
print "All done!"

if __name__ == '__main__':
_test()
###

Hope this helps!

Poor Yorick

2003-01-21 20:12:08 UTC

Post by Danny Yoo
Small amendment: that last part of u_newlines.py is wrong; I should not
have touched codecs.register. Sorry about that! The corrected code,

Danny, I've modified your code a little to make it platform-independent,
and added my own "overridden" version of codecs.open. My goal is to use
yCodecs.open in all my Python development and to abandon
__builtin__.open completely. For my projects, it's essential to be able
to handle files of various text encodings with grace and in a
platform-independent manner. I have two questions about the following code:

Why is it not necessary to register the codec using codec.register?
Isn't that essential to Python being able to find the codec?

I figured it out by trial and error, but I don't understand why, in my
class Codec, StreamWriter inherits from StreamReader and StreamReader
inherits from StreamWriter.

### yCodecs.py###
___________________

#! /usr/bin/env python

'''yCodecs.open provides an "wrapper" to codecs.open, adding transparent
newline conversion when files are opened in text mode. Note that
codecs.open
actually opens all files in binary mode. '''

import codecs

import yu_newlines

def open(filename, mode='rb', encoding=None, errors='strict', buffering=1):
'''same as codecs.open with addition of automatic conversion of
os.linesep to "\n" in Python
'''
fh1 = codecs.open(filename, mode, encoding, errors, buffering)
if 'b' not in mode:
fh2 = codecs.EncodedFile(fh1, 'yu_newlines')
return fh2
else:
return fh1

____________________

### yu_newlines.py ###
____________________

#! /usr/bin/env python

'''
used by yCodecs.py to provide transparent conversion from os.linesep
to "\n" in Python.

Based on code by Danny Yoo (***@hkn.eecs.berkeley.edu)
'''

import codecs
import os

class Codec(codecs.Codec):
def encode(self, input, errors="strict"):
linesep = os.linesep
final = input.replace(linesep, "\n"), len(input)
return final

def decode(self, input, errors="strict"):
linesep = os.linesep
final = input.replace("\n", linesep), len(input)
return final

class StreamWriter(Codec, codecs.StreamReader): pass
class StreamReader(Codec, codecs.StreamWriter): pass

def getregentry():
return (Codec().encode,Codec().decode,StreamReader,StreamWriter)

Poor Yorick
***@pooryorick.com

Danny Yoo

2003-01-08 20:14:03 UTC