Discussion:
[Tutor] python and Beautiful soup question
Joshua Valdez
2015-06-21 20:04:39 UTC
Permalink
I'm having trouble making this script work to scrape information from a
series of Wikipedia articles.

What I'm trying to do is iterate over a series of wiki URLs and pull out
the page links on a wiki portal category (e.g.
https://en.wikipedia.org/wiki/Category:Electronic_design).

I know that all the wiki pages I'm going through have a page links section.
However when I try to iterate through them I get this error message:

Traceback (most recent call last):
File "./wiki_parent.py", line 37, in <module>
cleaned = pages.get_text()AttributeError: 'NoneType' object has no
attribute 'get_text'

Why do I get this error?

The file I'm reading in the first part looks like this:

1 Category:Abrahamic_mythology2 Category:Abstraction3
Category:Academic_disciplines4 Category:Activism5 Category:Activists6
Category:Actors7 Category:Aerobics8 Category:Aerospace_engineering9
Category:Aesthetics

and it is stored in a the port_ID dict like so:

{1: 'Category:Abrahamic_mythology', 2: 'Category:Abstraction', 3:
'Category:Academic_disciplines', 4: 'Category:Activism', 5:
'Category:Activists', 6: 'Category:Actors', 7: 'Category:Aerobics', 8:
'Category:Aerospace_engineering', 9: 'Category:Aesthetics', 10:
'Category:Agnosticism', 11: 'Category:Agriculture'...}

The desired output is:

parent_num, page_ID, page_num

I realize the code is a little hackish but I'm just trying to get this
working:

#!/usr/bin/env pythonimport os,re,nltkfrom bs4 import
BeautifulSoupfrom urllib import urlopen
url = "https://en.wikipedia.org/wiki/"+'Category:Furniture'

rootdir = '/Users/joshuavaldez/Desktop/L1/en.wikipedia.org/wiki'

reg = re.compile('[\w]+:[\w]+')
number=1
port_ID = {}for root,dirs,files in os.walk(rootdir):
for file in files:
if reg.match(file):
port_ID[number]=file
number+=1


test_file = open('test_file.csv', 'w')
for key, value in port_ID.iteritems():

url = "https://en.wikipedia.org/wiki/"+str(value)
raw = urlopen(url).read()
soup=BeautifulSoup(raw)
pages = soup.find("div" , { "id" : "mw-pages" })
cleaned = pages.get_text()
cleaned = cleaned.encode('utf-8')
pages = cleaned.split('\n')
pages = pages[4:-2]
test = test = port_ID.items()[0]

page_ID = 1
for item in pages:
test_file.write('%s %s %s\n' % (test[0],item,page_ID))
page_ID+=1
page_ID = 1


Hi I posted this on stackoverflow and didn't really get any help so I
though I would ask here to see if someone could help! Thanks!




*Joshua Valdez*
*Computational Linguist : Cognitive Scientist
*

(440)-231-0479
***@case.edu <***@uw.edu> | ***@uw.edu | ***@armsandanchors.com
<http://www.linkedin.com/in/valdezjoshua/>
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Mark Lawrence
2015-06-21 22:55:34 UTC
Permalink
Post by Joshua Valdez
I'm having trouble making this script work to scrape information from a
series of Wikipedia articles.
What I'm trying to do is iterate over a series of wiki URLs and pull out
the page links on a wiki portal category (e.g.
https://en.wikipedia.org/wiki/Category:Electronic_design).
I know that all the wiki pages I'm going through have a page links section.
File "./wiki_parent.py", line 37, in <module>
cleaned = pages.get_text()AttributeError: 'NoneType' object has no
attribute 'get_text'
Presumably because this line
Post by Joshua Valdez
pages = soup.find("div" , { "id" : "mw-pages" })
doesn't find anything, pages is set to None and hence the attribute
error on the next line. I'm suspicious of { "id" : "mw-pages" } as it's
a Python dict comprehension with one entry of key "id" and value "mw-pages".
--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Alex Kleider
2015-06-22 01:41:54 UTC
Permalink
Post by Mark Lawrence
Post by Joshua Valdez
I'm having trouble making this script work to scrape information from a
series of Wikipedia articles.
What I'm trying to do is iterate over a series of wiki URLs and pull out
the page links on a wiki portal category (e.g.
https://en.wikipedia.org/wiki/Category:Electronic_design).
I know that all the wiki pages I'm going through have a page links section.
File "./wiki_parent.py", line 37, in <module>
cleaned = pages.get_text()AttributeError: 'NoneType' object has no
attribute 'get_text'
Presumably because this line
Post by Joshua Valdez
pages = soup.find("div" , { "id" : "mw-pages" })
doesn't find anything, pages is set to None and hence the attribute
error on the next line. I'm suspicious of { "id" : "mw-pages" } as
it's a Python dict comprehension with one entry of key "id" and value
"mw-pages".
Why do you refer to { "id" : "mw-pages" } as a dict comprehension?
Is that what a simple dict declaration is?

_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Mark Lawrence
2015-06-22 13:39:07 UTC
Permalink
Post by Alex Kleider
Post by Mark Lawrence
Post by Joshua Valdez
I'm having trouble making this script work to scrape information from a
series of Wikipedia articles.
What I'm trying to do is iterate over a series of wiki URLs and pull out
the page links on a wiki portal category (e.g.
https://en.wikipedia.org/wiki/Category:Electronic_design).
I know that all the wiki pages I'm going through have a page links section.
File "./wiki_parent.py", line 37, in <module>
cleaned = pages.get_text()AttributeError: 'NoneType' object has no
attribute 'get_text'
Presumably because this line
Post by Joshua Valdez
pages = soup.find("div" , { "id" : "mw-pages" })
doesn't find anything, pages is set to None and hence the attribute
error on the next line. I'm suspicious of { "id" : "mw-pages" } as
it's a Python dict comprehension with one entry of key "id" and value
"mw-pages".
Why do you refer to { "id" : "mw-pages" } as a dict comprehension?
Is that what a simple dict declaration is?
No, I'm simply wrong, it's just a plain dict. Please don't ask, as I've
no idea how it got into my head :)
--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Timo
2015-06-22 10:11:30 UTC
Permalink
Post by Joshua Valdez
I'm having trouble making this script work to scrape information from a
series of Wikipedia articles.
What I'm trying to do is iterate over a series of wiki URLs and pull out
the page links on a wiki portal category (e.g.
https://en.wikipedia.org/wiki/Category:Electronic_design).
Instead of scraping the webpage, I'd have a look at the API. This might
give much better and more reliable results than to rely on parsing HTML.

https://www.mediawiki.org/wiki/API:Main_page

You can try out the huge amount of different options (with small
descriptions) on the sandbox page:

https://en.wikipedia.org/wiki/Special:ApiSandbox

Timo
Post by Joshua Valdez
*Joshua Valdez*
*Computational Linguist : Cognitive Scientist
*
(440)-231-0479
<http://www.linkedin.com/in/valdezjoshua/>
_______________________________________________
https://mail.python.org/mailman/listinfo/tutor
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Steven D'Aprano
2015-06-23 00:44:35 UTC
Permalink
Post by Timo
Post by Joshua Valdez
I'm having trouble making this script work to scrape information from a
series of Wikipedia articles.
What I'm trying to do is iterate over a series of wiki URLs and pull out
the page links on a wiki portal category (e.g.
https://en.wikipedia.org/wiki/Category:Electronic_design).
Instead of scraping the webpage, I'd have a look at the API. This might
give much better and more reliable results than to rely on parsing HTML.
https://www.mediawiki.org/wiki/API:Main_page
Seconded, thirded and fourthed!

Please don't scrape wikipedia. It is hard enough for them to deal with
bandwidth requirements and remain responsive for browsers without
badly-written bots trying to suck down pieces of the site. Use the API.

Not only is it the polite thing to do, but it protects you too:
Wikipedia is entitled to block your bot if they think it is not
following the rules.
Post by Timo
You can try out the huge amount of different options (with small
https://en.wikipedia.org/wiki/Special:ApiSandbox
--
Steve
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Continue reading on narkive:
Loading...