[Tutor] python and Beautiful soup question

Discussion:

Joshua Valdez

2015-06-21 20:04:39 UTC

I'm having trouble making this script work to scrape information from a
series of Wikipedia articles.

What I'm trying to do is iterate over a series of wiki URLs and pull out
the page links on a wiki portal category (e.g.
https://en.wikipedia.org/wiki/Category:Electronic_design).

I know that all the wiki pages I'm going through have a page links section.
However when I try to iterate through them I get this error message:

Traceback (most recent call last):
File "./wiki_parent.py", line 37, in <module>
cleaned = pages.get_text()AttributeError: 'NoneType' object has no
attribute 'get_text'

Why do I get this error?

The file I'm reading in the first part looks like this:

1 Category:Abrahamic_mythology2 Category:Abstraction3
Category:Academic_disciplines4 Category:Activism5 Category:Activists6
Category:Actors7 Category:Aerobics8 Category:Aerospace_engineering9
Category:Aesthetics

and it is stored in a the port_ID dict like so:

{1: 'Category:Abrahamic_mythology', 2: 'Category:Abstraction', 3:
'Category:Academic_disciplines', 4: 'Category:Activism', 5:
'Category:Activists', 6: 'Category:Actors', 7: 'Category:Aerobics', 8:
'Category:Aerospace_engineering', 9: 'Category:Aesthetics', 10:
'Category:Agnosticism', 11: 'Category:Agriculture'...}

The desired output is:

parent_num, page_ID, page_num

I realize the code is a little hackish but I'm just trying to get this
working:

#!/usr/bin/env pythonimport os,re,nltkfrom bs4 import
BeautifulSoupfrom urllib import urlopen
url = "https://en.wikipedia.org/wiki/"+'Category:Furniture'

rootdir = '/Users/joshuavaldez/Desktop/L1/en.wikipedia.org/wiki'

reg = re.compile('[\w]+:[\w]+')
number=1
port_ID = {}for root,dirs,files in os.walk(rootdir):
for file in files:
if reg.match(file):
port_ID[number]=file
number+=1

test_file = open('test_file.csv', 'w')
for key, value in port_ID.iteritems():

url = "https://en.wikipedia.org/wiki/"+str(value)
raw = urlopen(url).read()
soup=BeautifulSoup(raw)
pages = soup.find("div" , { "id" : "mw-pages" })
cleaned = pages.get_text()
cleaned = cleaned.encode('utf-8')
pages = cleaned.split('\n')
pages = pages[4:-2]
test = test = port_ID.items()[0]

page_ID = 1
for item in pages:
test_file.write('%s %s %s\n' % (test[0],item,page_ID))
page_ID+=1
page_ID = 1

Hi I posted this on stackoverflow and didn't really get any help so I
though I would ask here to see if someone could help! Thanks!

*Joshua Valdez*
*Computational Linguist : Cognitive Scientist
*

(440)-231-0479
***@case.edu <***@uw.edu> | ***@uw.edu | ***@armsandanchors.com
<http://www.linkedin.com/in/valdezjoshua/>
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Mark Lawrence

2015-06-21 22:55:34 UTC

Permalink

Post by Joshua Valdez
I'm having trouble making this script work to scrape information from a
series of Wikipedia articles.
What I'm trying to do is iterate over a series of wiki URLs and pull out
the page links on a wiki portal category (e.g.
https://en.wikipedia.org/wiki/Category:Electronic_design).
I know that all the wiki pages I'm going through have a page links section.
File "./wiki_parent.py", line 37, in <module>
cleaned = pages.get_text()AttributeError: 'NoneType' object has no
attribute 'get_text'

Presumably because this line

Post by Joshua Valdez
pages = soup.find("div" , { "id" : "mw-pages" })

doesn't find anything, pages is set to None and hence the attribute
error on the next line. I'm suspicious of { "id" : "mw-pages" } as it's
a Python dict comprehension with one entry of key "id" and value "mw-pages".
--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Alex Kleider

2015-06-22 01:41:54 UTC

Permalink

Post by Mark Lawrence

Presumably because this line

Post by Joshua Valdez
pages = soup.find("div" , { "id" : "mw-pages" })

Why do you refer to { "id" : "mw-pages" } as a dict comprehension?
Is that what a simple dict declaration is?

_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Mark Lawrence

2015-06-22 13:39:07 UTC

Permalink

Post by Alex Kleider

Post by Mark Lawrence

Presumably because this line

Post by Joshua Valdez
pages = soup.find("div" , { "id" : "mw-pages" })

Why do you refer to { "id" : "mw-pages" } as a dict comprehension?
Is that what a simple dict declaration is?

No, I'm simply wrong, it's just a plain dict. Please don't ask, as I've
no idea how it got into my head :)
--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Timo

2015-06-22 10:11:30 UTC

Permalink

Instead of scraping the webpage, I'd have a look at the API. This might
give much better and more reliable results than to rely on parsing HTML.

https://www.mediawiki.org/wiki/API:Main_page

You can try out the huge amount of different options (with small
descriptions) on the sandbox page:

https://en.wikipedia.org/wiki/Special:ApiSandbox

Timo

Post by Joshua Valdez
*Joshua Valdez*
*Computational Linguist : Cognitive Scientist
*
(440)-231-0479
<http://www.linkedin.com/in/valdezjoshua/>
_______________________________________________
https://mail.python.org/mailman/listinfo/tutor

_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Steven D'Aprano

2015-06-23 00:44:35 UTC

Permalink

Post by Timo

Instead of scraping the webpage, I'd have a look at the API. This might
give much better and more reliable results than to rely on parsing HTML.
https://www.mediawiki.org/wiki/API:Main_page

Seconded, thirded and fourthed!

Please don't scrape wikipedia. It is hard enough for them to deal with
bandwidth requirements and remain responsive for browsers without
badly-written bots trying to suck down pieces of the site. Use the API.

Not only is it the polite thing to do, but it protects you too:
Wikipedia is entitled to block your bot if they think it is not
following the rules.

Post by Timo
You can try out the huge amount of different options (with small
https://en.wikipedia.org/wiki/Special:ApiSandbox

--
Steve
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Continue reading on narkive:

Search results for '[Tutor] python and Beautiful soup question' (Questions and Answers)

replies

Woma Python or Hog Island Boa?

started 2011-01-22 21:25:33 UTC

reptiles

replies

Is a Colt Python a Sturdy Gun?

started 2012-06-19 15:25:23 UTC

hunting

replies

Early December weather is awful in Massachusetts compared to South Florida?