Discussion:
[Tutor] How to parse large files
Mark Lawrence
2015-10-27 22:24:04 UTC
Permalink
Hi!
I want to reads two files and create simple dictionary. Input file contain more than 10000 rows
diz5 = {}
lines = i.rstrip("\n").split("\t")
diz5.setdefault(lines[0],set()).add(lines[1])
diz3 = {}
lines = i.rstrip("\n").split("\t")
diz3.setdefault(lines[0],set()).add(lines[1])
how can manage better this reading and writing?
thanks so much
https://docs.python.org/3/library/csv.html
https://docs.python.org/3/library/csv.html#csv.DictReader
https://docs.python.org/3/library/csv.html#csv.DictWriter
https://docs.python.org/3/library/csv.html#csv.excel_tab
--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Steven D'Aprano
2015-10-27 22:57:23 UTC
Permalink
Hi!
I want to reads two files and create simple dictionary. Input file contain more than 10000 rows
Since you are talking about a tab-delimited CSV file, you should use the
csv module:

https://pymotw.com/2/csv/

https://docs.python.org/2/library/csv.html

Unfortunately the version 2 csv module doesn't deal with non-ASCII
characters very well, but the Python 3 version does:

https://docs.python.org/3/library/csv.html
--
Steve
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Danny Yoo
2015-10-28 02:32:06 UTC
Permalink
Hi!
I want to reads two files and create simple dictionary. Input file contain more than 10000 rows
diz5 = {}
lines = i.rstrip("\n").split("\t")
diz5.setdefault(lines[0],set()).add(lines[1])
diz3 = {}
lines = i.rstrip("\n").split("\t")
diz3.setdefault(lines[0],set()).add(lines[1])
10000 rows today is not a lot of data, since typical computer memories
have grown quite a bit. I get the feeling your program should be able
to handle this all in-memory.

But let's assume, for the moment, that you do need to deal with a lot
of data, where you can't hold the whole thing in memory. Ideally,
you'd like to have access to its contents in a key/value store,
because that feels most like a Python dict. If that's the case, then
what you're looking for is a on on-disk database.

There are several out there; one that comes standard in Python 3 is
the "dbm" module:

https://docs.python.org/3.5/library/dbm.html

Instead of doing:

diz5 = {}
...

we'd do something like this:

with diz5 = dbm.open('diz5, 'c'):
...

And otherwise, your code will look very similar! This dictionary-like
object will store its data on disk, rather than in-memory, so that it
can grow fairly large. The other nice thing is that you can do the
dbm creation up front. If you run your program again, you might add a
bit of logic to *reuse* the dbm that's already on disk, so that you
don't have to process your input files all over again.


Databases, too, have capacity limits, but you're unlikely to hit them
unless you're really doing something hefty. And that's out of scope
for ***@python.org. :P
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Danny Yoo
2015-10-28 02:34:30 UTC
Permalink
diz5 = {}
...
Sorry, I'm getting my syntax completely wrong here. My apologies.
This should be:

with dbm.open('diz5', 'c') as diz5:
...

Apologies: I just got back from work and my brain is mush.
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Peter Otten
2015-10-28 09:02:17 UTC
Permalink
Post by Danny Yoo
There are several out there; one that comes standard in Python 3 is
https://docs.python.org/3.5/library/dbm.html
diz5 = {}
...
...
And otherwise, your code will look very similar! This dictionary-like
object will store its data on disk, rather than in-memory, so that it
can grow fairly large. The other nice thing is that you can do the
dbm creation up front. If you run your program again, you might add a
bit of logic to *reuse* the dbm that's already on disk, so that you
don't have to process your input files all over again.
dbm operates on byte strings for both keys and values, so there are a few
changes. Fortunately there's a wrapper around dbm called shelve that uses
string keys and allows objects that can be pickled as values:

https://docs.python.org/dev/library/shelve.html

With that your code may become

with shelve.open("diz5") as db:
with open("tmp1.txt") as instream:
for line in instream:
assert line.count("\t") == 1
key, _tab, value = line.rstrip("\n").partition("\t")
values = db.get(key) or set()
values.add(value)
db[key] = values

Note that while shelve has a setdefault() method it will only work as
expected when you set writeback=True which in turn may require arbitrary
amounts of memory.

_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Danny Yoo
2015-11-01 18:58:23 UTC
Permalink
AttributeError Traceback (most recent call last)
<ipython-input-3-f1c2a78eeb9a> in <module>()
4 assert line.count("\t") == 1
5 key, _tab, value = line.rstrip("\n").partition("\t")
AttributeError: DbfilenameShelf instance has no attribute '__exit__'
The error that you're seeing is on this line:

with shelve.open("diz5") as db:

so we should focus our efforts to know why this line is failing.


The with statement in Python has a requirement, that the resource
supports context management. Context managers have to have a few
methods, according to:

https://docs.python.org/3/reference/compound_stmts.html#the-with-statement

https://docs.python.org/3/reference/datamodel.html#context-managers


However, the error message reports that it can't find a method that
it's looking for, "__exit__". it looks like shelves don't have the
methods "__enter__" or "__exit__", which context managers must have.

It looks like a deficiency in 'shelve', but one that we can work
around without too much difficulty. We can use the
contextlib.closing() function to adapt a thing that knows how to
close(), so that it works as a context manager.
https://docs.python.org/3/library/contextlib.html#contextlib.closing

It should be a matter of saying:

import contextlib
...
with contextlib.closing(shelve.open("diz5")) as db: ...


If you have questions, please feel free to ask.
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Peter Otten
2015-11-01 19:19:25 UTC
Permalink
Post by Danny Yoo
AttributeError Traceback (most recent call
last) <ipython-input-3-f1c2a78eeb9a> in <module>()
4 assert line.count("\t") == 1
5 key, _tab, value = line.rstrip("\n").partition("\t")
AttributeError: DbfilenameShelf instance has no attribute '__exit__'
so we should focus our efforts to know why this line is failing.
The with statement in Python has a requirement, that the resource
supports context management. Context managers have to have a few
https://docs.python.org/3/reference/compound_stmts.html#the-with-statement
https://docs.python.org/3/reference/datamodel.html#context-managers
However, the error message reports that it can't find a method that
it's looking for, "__exit__". it looks like shelves don't have the
methods "__enter__" or "__exit__", which context managers must have.
It looks like a deficiency in 'shelve', but one that we can work
around without too much difficulty.
You need 3.4 or higher to use with... Quoting
<https://docs.python.org/dev/library/shelve.html#shelve.Shelf>:

"""
class shelve.Shelf(dict, protocol=None, writeback=False,
keyencoding='utf-8')
...
Changed in version 3.4: Added context manager support.
"""
Post by Danny Yoo
We can use the
contextlib.closing() function to adapt a thing that knows how to
close(), so that it works as a context manager.
https://docs.python.org/3/library/contextlib.html#contextlib.closing
import contextlib
...
with contextlib.closing(shelve.open("diz5")) as db: ...
Yes, either use contextlib.closing() or close the db manually:

db = shelve.open("diz5")
try:
... # use db
finally:
db.close()


_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Peter Otten
2015-11-01 23:17:19 UTC
Permalink
Thanks!!
I use python2.7 Can Also use in that version?
Yes.
I don't understand why use partition and not split(). what is the reason
for that?
If there is exactly one "\t" in the line

key, value = line.split("\t")

and

key, _tab, value = line.partition("\t")

are equivalent, so it's fine to use split(). I used partition() because the
method name indicates that I want two parts, but with the extra assert

parts = line.split("\t")
if len(parts) != 2:
raise ValueError
key, value = parts

would probably have been the better choice because it works the same with
enabled and disabled asserts.


_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Danny Yoo
2015-11-02 00:46:44 UTC
Permalink
Thanks!!
I use python2.7 Can Also use in that version?
Yes.
Hi Jarod,


Also for reference, here are the equivalent 2.7 doc links:

https://docs.python.org/2/reference/compound_stmts.html#the-with-statement

https://docs.python.org/2/reference/datamodel.html#context-managers

https://docs.python.org/2/library/contextlib.html#contextlib.closing


Good luck!
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Alan Gauld
2015-11-01 23:56:25 UTC
Permalink
Thanks!!
I use python2.7 Can Also use in that version?
I don't understand why use partition and not split(). what is the reason for
that?
I'm confused? Where did partition v split enter into things?
Am I missing a message some place?
--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Peter Otten
2015-11-02 00:42:09 UTC
Permalink
Post by Alan Gauld
Thanks!!
I use python2.7 Can Also use in that version?
I don't understand why use partition and not split(). what is the reason
for that?
I'm confused? Where did partition v split enter into things?
Am I missing a message some place?
I used partition() instead of split() in my shelve example:

https://mail.python.org/pipermail/tutor/2015-October/107103.html

Sorry for that distraction ;)

_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Continue reading on narkive:
Loading...