Discussion:
[Tutor] memory error
Joshua Valdez
2015-06-30 15:10:37 UTC
Permalink
So I wrote this script to go over a large wiki XML dump and pull out the
pages I want. However, every time I run it the kernel displays 'Killed' I'm
assuming this is a memory issue after reading around but I'm not sure where
the memory problem is in my script and if there were any tricks to reduce
the virtual memory usage.
Here is my code (I assume the problem is in the BeautifulSoup partition as
the pages file is pretty small.

from bs4 import BeautifulSoup
import sys

pages_file = open('pages_file.txt', 'r')

#Preprocessing
pages = pages_file.readlines()
pages = map(lambda s: s.strip(), pages)
page_titles=[]
for item in pages:
item = ''.join([i for i in item if not i.isdigit()])
item = ''.join([i for i in item if ord(i)<126 and ord(i)>31])
item = item.replace(" ","")
item = item.replace("_"," ")
page_titles.append(item)

#####################################

with open(sys.argv[1], 'r') as wiki:
soup = BeautifulSoup(wiki)
wiki.closed

wiki_page = soup.find_all("page")
del soup
for item in wiki_page:
title = item.title.get_text()
if title in page_titles:
print item
del title





*Joshua Valdez*
*Computational Linguist : Cognitive Scientist
*

(440)-231-0479
***@case.edu <***@uw.edu> | ***@uw.edu | ***@armsandanchors.com
<http://www.linkedin.com/in/valdezjoshua/>
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Danny Yoo
2015-06-30 23:27:27 UTC
Permalink
Post by Joshua Valdez
So I wrote this script to go over a large wiki XML dump and pull out the
pages I want. However, every time I run it the kernel displays 'Killed' I'm
assuming this is a memory issue after reading around but I'm not sure where
the memory problem is in my script and if there were any tricks to reduce
the virtual memory usage.
Yes. Unfortunately, this is a common problem with representing a
potentially large stream of data with a single XML document. The
straightforward approach to load an XML, to read it all into memory at
once, doesn't work when files get large.

We can work around this by using a parser that knows how to
progressively read chunks of the document in a streaming or "pulling"
approach. Although I don't think Beautiful Soup knows how to do this,
however, if you're working with XML, there are other libraries that
work similarly to Beautiful Soup that can work in a streaming way.

There was a thread about this about a year ago that has good
references, the "XML Parsing from XML" thread:

https://mail.python.org/pipermail/tutor/2014-May/101227.html

Stefan Behnel's contribution to that thread is probably the most
helpful in seeing example code:

https://mail.python.org/pipermail/tutor/2014-May/101270.html

I think you'll probably want to use xml.etree.cElementTree; I expect
the code for your situation will look something like (untested
though!):

###############################
from xml.etree.cElementTree import iterparse, tostring

## ... later in your code, something like this...

doc = iterparse(wiki)
for _, node in doc:
if node.tag == "page":
title = node.find("title").text
if title in page_titles:
print tostring(node)
node.clear()
###############################


Also, don't use "del" unless you know what you're doing. It's not
particularly helpful in your particular scenario, and it is cluttering
up the code.


Let us know if this works out or if you're having difficulty, and I'm
sure folks would be happy to help out.


Good luck!
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Danny Yoo
2015-07-01 01:20:38 UTC
Permalink
Please use reply to all: I'm currently not in front of a keyboard at the
moment. Others on the mailing list should be able to help.
Hi Danny,
So I tried that code snippet you pointed me too and I'm not getting any
output.
I tried playing around with the code and when I tried
doc = etree.iterparse(wiki)
print node
<Element '{http://www.mediawiki.org/xml/export-0.10/}sitename' at
0x100602410>
<Element '{http://www.mediawiki.org/xml/export-0.10/}dbname' at
0x1006024d0>
<Element '{http://www.mediawiki.org/xml/export-0.10/}base' at 0x100602590>
<Element '{http://www.mediawiki.org/xml/export-0.10/}generator' at
0x100602710>
<Element '{http://www.mediawiki.org/xml/export-0.10/}case' at 0x100602750>
<Element '{http://www.mediawiki.org/xml/export-0.10/}namespace' at
0x1006027d0>
<Element '{http://www.mediawiki.org/xml/export-0.10/}namespace' at
0x100602810>
<Element '{http://www.mediawiki.org/xml/export-0.10/}namespace' at
0x100602850>
<Element '{http://www.mediawiki.org/xml/export-0.10/}namespace' at
0x100602890>
<Element '{http://www.mediawiki.org/xml/export-0.10/}namespace' at
0x1006028d0>
so the .tag function is capturing everything in the string. Do you know
why this and how I can get around it?
*Joshua Valdez*
*Computational Linguist : Cognitive Scientist
*
(440)-231-0479
<http://www.linkedin.com/in/valdezjoshua/>
Post by Joshua Valdez
Post by Joshua Valdez
So I wrote this script to go over a large wiki XML dump and pull out the
pages I want. However, every time I run it the kernel displays 'Killed'
I'm
Post by Joshua Valdez
assuming this is a memory issue after reading around but I'm not sure
where
Post by Joshua Valdez
the memory problem is in my script and if there were any tricks to
reduce
Post by Joshua Valdez
the virtual memory usage.
Yes. Unfortunately, this is a common problem with representing a
potentially large stream of data with a single XML document. The
straightforward approach to load an XML, to read it all into memory at
once, doesn't work when files get large.
We can work around this by using a parser that knows how to
progressively read chunks of the document in a streaming or "pulling"
approach. Although I don't think Beautiful Soup knows how to do this,
however, if you're working with XML, there are other libraries that
work similarly to Beautiful Soup that can work in a streaming way.
There was a thread about this about a year ago that has good
https://mail.python.org/pipermail/tutor/2014-May/101227.html
Stefan Behnel's contribution to that thread is probably the most
https://mail.python.org/pipermail/tutor/2014-May/101270.html
I think you'll probably want to use xml.etree.cElementTree; I expect
the code for your situation will look something like (untested
###############################
from xml.etree.cElementTree import iterparse, tostring
## ... later in your code, something like this...
doc = iterparse(wiki)
title = node.find("title").text
print tostring(node)
node.clear()
###############################
Also, don't use "del" unless you know what you're doing. It's not
particularly helpful in your particular scenario, and it is cluttering
up the code.
Let us know if this works out or if you're having difficulty, and I'm
sure folks would be happy to help out.
Good luck!
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Danny Yoo
2015-07-01 05:17:27 UTC
Permalink
Hi Joshua,



The issue you're encountering sounds like XML namespace issues.
So I tried that code snippet you pointed me too and I'm not getting any output.
This is probably because the tag names of the XML are being prefixed
with namespaces. This would make the original test for node.tag to be
too stingy: it wouldn't exactly match the string we want, because
there's a namespace prefix in front that's making the string mismatch.


Try relaxing the condition from:

if node.tag == "page": ...

to something like:

if node.tag.endswith("page"): ...


This isn't quite technically correct, but we want to confirm whether
namespaces are the issue that's preventing you from seeing those
pages.


If namespaces are the issue, then read:

http://effbot.org/zone/element-namespaces.htm
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Joshua Valdez
2015-07-01 14:13:08 UTC
Permalink
Hi Danny,

So I got my code workin now and it looks like this

TAG = '{http://www.mediawiki.org/xml/export-0.10/}page'
doc = etree.iterparse(wiki)

for _, node in doc:
if node.tag == TAG:
title = node.find("{http://www.mediawiki.org/xml/export-0.10/}title
").text
if title in page_titles:
print (etree.tostring(node))
node.clear()
Its mostly giving me what I want. However it is adding extra formatting (I
believe name_spaces and attributes). I was wondering if there was a way to
strip these out when I'm printing the node tostring?

Here is an example of the last few lines of my output:

[[Category:Asteroids| ]]
[[Category:Spaceflight]]</ns0:text>
<ns0:sha1>h4rxxfq37qg30eqegyf4vfvkqn3r142</ns0:sha1>
</ns0:revision>
</ns0:page>






*Joshua Valdez*
*Computational Linguist : Cognitive Scientist
*

(440)-231-0479
***@case.edu <***@uw.edu> | ***@uw.edu | ***@armsandanchors.com
<http://www.linkedin.com/in/valdezjoshua/>
Post by Danny Yoo
Hi Joshua,
The issue you're encountering sounds like XML namespace issues.
So I tried that code snippet you pointed me too and I'm not getting any
output.
This is probably because the tag names of the XML are being prefixed
with namespaces. This would make the original test for node.tag to be
too stingy: it wouldn't exactly match the string we want, because
there's a namespace prefix in front that's making the string mismatch.
if node.tag == "page": ...
if node.tag.endswith("page"): ...
This isn't quite technically correct, but we want to confirm whether
namespaces are the issue that's preventing you from seeing those
pages.
http://effbot.org/zone/element-namespaces.htm
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Danny Yoo
2015-07-02 17:15:44 UTC
Permalink
So I got my code working now and it looks like this
TAG = '{http://www.mediawiki.org/xml/export-0.10/}page'
doc = etree.iterparse(wiki)
title = node.find("{http://www.mediawiki.org/xml/export-0.10/}title").text
print (etree.tostring(node))
node.clear()
Its mostly giving me what I want. However it is adding extra formatting (I believe name_spaces and attributes). I was wondering if there was a way to strip these out when I'm printing the node tostring?
I suspect that you'll want to do an explicit walk over the node.
Rather than use etree.tostring(), which indiscriminately walks the
entire tree, you'll probably want to write a function to walk over
selected portions of the tree structure. You can see:

https://docs.python.org/2/library/xml.etree.elementtree.html#tutorial

for an introduction to navigating portions of the tree, given a node.


As a more general response: you have significantly more information
about the problem than we do. At the moment, we don't have enough
context to effectively help; we need more information. Do you have a
sample *input* file that folks here can use to execute on your
program? Providing sample input is important if you want
reproducibility. Reproducibility is important because then we'll be
on the same footing in terms of knowing what the problem's inputs are.
See: http://sscce.org/

As for the form of the desired output: can you say more precisely what
parts of the document you want? Rather than just say: "this doesn't
look the way I want it to", it may be more helpful to say: "here's
*exactly* what I'd like it to look like..." and show us the desired
text output.

That is: by expressing what you want as a collection of concrete
input/output examples, you gain the added benefit that once you have
revised your program, you can re-run it and see if what it's producing
is what you anticipate. That is, you can use these concrete examples
as a "regression test suite". This technique is something that
software engineers use regularly in their day-to-day work, to make
believe that their incomplete programs are working, to write out
explicitly what those programs should do, and then to work on their
programs until they do what they want them to.


Good luck to you!
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Danny Yoo
2015-07-02 17:38:53 UTC
Permalink
Hi so I figured out my problem, with this code and its working great but its still taking a very long time to process...I was wondering if there was a way to do this with just regular expressions instead of parsing the text with lxml...
Be careful: there are assumptions here that may not be true.

To be clear: regular expressions are not magic. Just because
something uses regular expressions does not make it fast. Nor are
regular expressions appropriate for parsing tree-structured content.

For a humorous discussion of this, see:
http://blog.codinghorror.com/parsing-html-the-cthulhu-way/
the idea would be to identify a <page> tag and then move to the next line of a file to see if there is a match between the title text and the pages in my pages file.
This makes another assumption about the input that isn't necessarily
true. Just because you see tags and content on separate lines now
doesn't mean that this won't change in the future. XML tree structure
does not depend on newlines. Don't try parsing XML files
line-by-line.
wiki --> XML file
page_titles -> array of strings corresponding to titles
tag = r '(<page>)'
wiki = wiki.readlines()
page = re.search(tag,line)
......(I'm not sure what to do here)
is it possible to look ahead in a loop to discover other lines and then backtrack?
I think this may be the solution but again I'm not sure how I would execute such a command structure...
You should probably abandon this line of thinking. From your
initial problem description, the approach you have now should be
computationally *linear* in the size of the input. So maybe your
program is fine. The fact that your program was exhausting your
computer's entire memory in your initial attempt suggests that your
input file is large. But how large? It is much more likely that your
program is slow simply because your input is honking huge.


To support this hypothesis, we need more knowledge about the input
size. How large is your input file? How large is your collection of
page_titles?
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Alan Gauld
2015-06-30 23:54:08 UTC
Permalink
Post by Joshua Valdez
So I wrote this script to go over a large wiki XML dump and pull out the
pages I want. However, every time I run it the kernel displays 'Killed' I'm
assuming this is a memory issue after reading around but I'm not sure where
the memory problem is in my script
That's quite a big assumption.
How big is the wiki file? How much RAM do you have?
What do your system resource monitoring tools (eg top) say?
Post by Joshua Valdez
and if there were any tricks to reduce
the virtual memory usage.
Of course, but as always be sure what you are tweaking before you start.
Otherwise you can waste a lot of time doing nothing useful.
Post by Joshua Valdez
from bs4 import BeautifulSoup
import sys
pages_file = open('pages_file.txt', 'r')
....
Post by Joshua Valdez
#####################################
soup = BeautifulSoup(wiki)
wiki.closed
Is that really what you mean? Or should it be

wiki.close()?
Post by Joshua Valdez
wiki_page = soup.find_all("page")
del soup
title = item.title.get_text()
print item
del title
--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Continue reading on narkive:
Loading...