Joshua Valdez
2015-06-21 20:04:39 UTC
I'm having trouble making this script work to scrape information from a
series of Wikipedia articles.
What I'm trying to do is iterate over a series of wiki URLs and pull out
the page links on a wiki portal category (e.g.
https://en.wikipedia.org/wiki/Category:Electronic_design).
I know that all the wiki pages I'm going through have a page links section.
However when I try to iterate through them I get this error message:
Traceback (most recent call last):
File "./wiki_parent.py", line 37, in <module>
cleaned = pages.get_text()AttributeError: 'NoneType' object has no
attribute 'get_text'
Why do I get this error?
The file I'm reading in the first part looks like this:
1 Category:Abrahamic_mythology2 Category:Abstraction3
Category:Academic_disciplines4 Category:Activism5 Category:Activists6
Category:Actors7 Category:Aerobics8 Category:Aerospace_engineering9
Category:Aesthetics
and it is stored in a the port_ID dict like so:
{1: 'Category:Abrahamic_mythology', 2: 'Category:Abstraction', 3:
'Category:Academic_disciplines', 4: 'Category:Activism', 5:
'Category:Activists', 6: 'Category:Actors', 7: 'Category:Aerobics', 8:
'Category:Aerospace_engineering', 9: 'Category:Aesthetics', 10:
'Category:Agnosticism', 11: 'Category:Agriculture'...}
The desired output is:
parent_num, page_ID, page_num
I realize the code is a little hackish but I'm just trying to get this
working:
#!/usr/bin/env pythonimport os,re,nltkfrom bs4 import
BeautifulSoupfrom urllib import urlopen
url = "https://en.wikipedia.org/wiki/"+'Category:Furniture'
rootdir = '/Users/joshuavaldez/Desktop/L1/en.wikipedia.org/wiki'
reg = re.compile('[\w]+:[\w]+')
number=1
port_ID = {}for root,dirs,files in os.walk(rootdir):
for file in files:
if reg.match(file):
port_ID[number]=file
number+=1
test_file = open('test_file.csv', 'w')
for key, value in port_ID.iteritems():
url = "https://en.wikipedia.org/wiki/"+str(value)
raw = urlopen(url).read()
soup=BeautifulSoup(raw)
pages = soup.find("div" , { "id" : "mw-pages" })
cleaned = pages.get_text()
cleaned = cleaned.encode('utf-8')
pages = cleaned.split('\n')
pages = pages[4:-2]
test = test = port_ID.items()[0]
page_ID = 1
for item in pages:
test_file.write('%s %s %s\n' % (test[0],item,page_ID))
page_ID+=1
page_ID = 1
Hi I posted this on stackoverflow and didn't really get any help so I
though I would ask here to see if someone could help! Thanks!
*Joshua Valdez*
*Computational Linguist : Cognitive Scientist
*
(440)-231-0479
***@case.edu <***@uw.edu> | ***@uw.edu | ***@armsandanchors.com
<http://www.linkedin.com/in/valdezjoshua/>
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
series of Wikipedia articles.
What I'm trying to do is iterate over a series of wiki URLs and pull out
the page links on a wiki portal category (e.g.
https://en.wikipedia.org/wiki/Category:Electronic_design).
I know that all the wiki pages I'm going through have a page links section.
However when I try to iterate through them I get this error message:
Traceback (most recent call last):
File "./wiki_parent.py", line 37, in <module>
cleaned = pages.get_text()AttributeError: 'NoneType' object has no
attribute 'get_text'
Why do I get this error?
The file I'm reading in the first part looks like this:
1 Category:Abrahamic_mythology2 Category:Abstraction3
Category:Academic_disciplines4 Category:Activism5 Category:Activists6
Category:Actors7 Category:Aerobics8 Category:Aerospace_engineering9
Category:Aesthetics
and it is stored in a the port_ID dict like so:
{1: 'Category:Abrahamic_mythology', 2: 'Category:Abstraction', 3:
'Category:Academic_disciplines', 4: 'Category:Activism', 5:
'Category:Activists', 6: 'Category:Actors', 7: 'Category:Aerobics', 8:
'Category:Aerospace_engineering', 9: 'Category:Aesthetics', 10:
'Category:Agnosticism', 11: 'Category:Agriculture'...}
The desired output is:
parent_num, page_ID, page_num
I realize the code is a little hackish but I'm just trying to get this
working:
#!/usr/bin/env pythonimport os,re,nltkfrom bs4 import
BeautifulSoupfrom urllib import urlopen
url = "https://en.wikipedia.org/wiki/"+'Category:Furniture'
rootdir = '/Users/joshuavaldez/Desktop/L1/en.wikipedia.org/wiki'
reg = re.compile('[\w]+:[\w]+')
number=1
port_ID = {}for root,dirs,files in os.walk(rootdir):
for file in files:
if reg.match(file):
port_ID[number]=file
number+=1
test_file = open('test_file.csv', 'w')
for key, value in port_ID.iteritems():
url = "https://en.wikipedia.org/wiki/"+str(value)
raw = urlopen(url).read()
soup=BeautifulSoup(raw)
pages = soup.find("div" , { "id" : "mw-pages" })
cleaned = pages.get_text()
cleaned = cleaned.encode('utf-8')
pages = cleaned.split('\n')
pages = pages[4:-2]
test = test = port_ID.items()[0]
page_ID = 1
for item in pages:
test_file.write('%s %s %s\n' % (test[0],item,page_ID))
page_ID+=1
page_ID = 1
Hi I posted this on stackoverflow and didn't really get any help so I
though I would ask here to see if someone could help! Thanks!
*Joshua Valdez*
*Computational Linguist : Cognitive Scientist
*
(440)-231-0479
***@case.edu <***@uw.edu> | ***@uw.edu | ***@armsandanchors.com
<http://www.linkedin.com/in/valdezjoshua/>
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor