Post by Poor YorickPost by Danny YooI have to admit I'm a bit confused; there shouldn't be any automatic
handling of newlines when we use read(), since read() sucks all the
text out of a file.
Can you explain more what you mean by automatic newline handling? Do
you mean a conversion of '\r\n' to '\n'?
As you mentioned, strip works correctly with the list items returned by
codec.readlines(), so my problem is entirely resolved. Yes, I meant
that codecs.readlines returns '\r\n' where a standard file object
Ah, I see what you mean now. No, as far as I understand, Python doesn't
do this automatic conversion of newlines. However, Python 2.3's
"Universal Newline" support is probably what you're looking for.
Till then, it's still possible to tie in a "\r\n" --> "\n" sort of scheme,
using the mechanism of 'codecs'. It's not documented too well, but Python
does support a scheme similar to the nested Readers that Java uses.
Here is a module I cooked up called 'u_newlines.py' that provides such a
codec for newline conversions:
###
"""A small demonstration of codec-writing. Transparently converts
"\r\n" into "\n"
Danny Yoo (***@hkn.eecs.berkeley.edu)
"""
import codecs
class Codec(codecs.Codec):
def encode(self, input, errors="strict"):
return input.replace("\r\n", "\n"), len(input)
def decode(self, input, errors="strict"):
return input.replace("\r\n", "\n"), len(input)
class StreamWriter(Codec, codecs.StreamWriter): pass
class StreamReader(Codec, codecs.StreamWriter): pass
def getregentry():
return (Codec().encode,Codec().decode,StreamReader,StreamWriter)
codecs.register(getregentry)
###
Let's see this in action. Let's say that we had a sample file in UTF-16
format:
###
Post by Poor YorickPost by Danny Yoosample_file = open("foo.txt", 'w')
sample_file.write("hello world\r\nThis is a test\r\n".encode("utf-16"))
sample_file.close()
open("foo.txt").read()
'\xff\xfeh\x00e\x00l\x00l\x00o\x00
\x00w\x00o\x00r\x00l\x00d\x00\r\x00\n\x00T\x00h\x00i\x00s\x00
\x00i\x00s\x00 \x00a\x00 \x00t\x00e\x00s\x00t\x00\r\x00\n\x00'
###
'foo.txt' now contains two lines of UTF-16 encoded text.
Notice that a normal read() on this file isn't working too well now, since
Python doesn't know that it's reading from a UTF-16 source. However, we
can fix this by introducing a codecs.EncodedFile() wrapper around the
object:
###
Post by Poor YorickPost by Danny Yooimport codecs
import u_newlines
f = codecs.EncodedFile(open("foo.txt"), 'ascii', 'utf-16')
f.read()
'hello world\r\nThis is a test\r\n'
###
This EncodedFile wraps around our original file, and converts our
ASCII'ish file using a UTF-16 converter.
Here's the kicker: these EncodedFiles can be "composed": we can combine
the effects of several codecs, including that of u_newlines.py above, by
doing a double wrapping of our input file!
###
Post by Poor YorickPost by Danny Yoof = codecs.EncodedFile(open("foo.txt"), 'ascii', 'utf-16')
f2 = codecs.EncodedFile(f, 'u_newlines')
f2.read()
'hello world\nThis is a test\n'
###
Muhahaha. *grin*
I hope this helps!