Joshua Valdez
2015-06-30 15:10:37 UTC
So I wrote this script to go over a large wiki XML dump and pull out the
pages I want. However, every time I run it the kernel displays 'Killed' I'm
assuming this is a memory issue after reading around but I'm not sure where
the memory problem is in my script and if there were any tricks to reduce
the virtual memory usage.
Here is my code (I assume the problem is in the BeautifulSoup partition as
the pages file is pretty small.
from bs4 import BeautifulSoup
import sys
pages_file = open('pages_file.txt', 'r')
#Preprocessing
pages = pages_file.readlines()
pages = map(lambda s: s.strip(), pages)
page_titles=[]
for item in pages:
item = ''.join([i for i in item if not i.isdigit()])
item = ''.join([i for i in item if ord(i)<126 and ord(i)>31])
item = item.replace(" ","")
item = item.replace("_"," ")
page_titles.append(item)
#####################################
with open(sys.argv[1], 'r') as wiki:
soup = BeautifulSoup(wiki)
wiki.closed
wiki_page = soup.find_all("page")
del soup
for item in wiki_page:
title = item.title.get_text()
if title in page_titles:
print item
del title
*Joshua Valdez*
*Computational Linguist : Cognitive Scientist
*
(440)-231-0479
***@case.edu <***@uw.edu> | ***@uw.edu | ***@armsandanchors.com
<http://www.linkedin.com/in/valdezjoshua/>
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
pages I want. However, every time I run it the kernel displays 'Killed' I'm
assuming this is a memory issue after reading around but I'm not sure where
the memory problem is in my script and if there were any tricks to reduce
the virtual memory usage.
Here is my code (I assume the problem is in the BeautifulSoup partition as
the pages file is pretty small.
from bs4 import BeautifulSoup
import sys
pages_file = open('pages_file.txt', 'r')
#Preprocessing
pages = pages_file.readlines()
pages = map(lambda s: s.strip(), pages)
page_titles=[]
for item in pages:
item = ''.join([i for i in item if not i.isdigit()])
item = ''.join([i for i in item if ord(i)<126 and ord(i)>31])
item = item.replace(" ","")
item = item.replace("_"," ")
page_titles.append(item)
#####################################
with open(sys.argv[1], 'r') as wiki:
soup = BeautifulSoup(wiki)
wiki.closed
wiki_page = soup.find_all("page")
del soup
for item in wiki_page:
title = item.title.get_text()
if title in page_titles:
print item
del title
*Joshua Valdez*
*Computational Linguist : Cognitive Scientist
*
(440)-231-0479
***@case.edu <***@uw.edu> | ***@uw.edu | ***@armsandanchors.com
<http://www.linkedin.com/in/valdezjoshua/>
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor