Discussion:
[Tutor] weather scraping with Beautiful Soup
Che M
2009-07-17 03:21:14 UTC
Permalink
Hi,

I am interested in gathering simple weather data using Beautiful Soup, but am having trouble understanding what I'm doing. I have searched the archives and so far haven't found enough to get me moving forward.

Basically I am trying to start off this example:

Grabbing Weather Underground Data with BeautifulSoup
http://flowingdata.com/2007/07/09/grabbing-weather-underground-data-with-beautifulsoup/

But I get to the exact same problem that this other person got to in this post:
http://groups.google.com/group/beautifulsoup/browse_thread/thread/13eb3dbf713b8a4a

Unfortunately, that post never gives enough help for me to understand how to solve that person's or my problem.

What I want to understand is how to find the bits of data you want--in this case, say, today's average temperature and whether it was clear or cloudy--within a web page, and then indicate that to Beautiful Soup.

Thanks,
Che

_________________________________________________________________
Windows Live™: Keep your life in sync.
http://windowslive.com/explore?ocid=TXT_TAGLM_WL_BR_life_in_synch_062009
Alan Gauld
2009-07-17 06:45:55 UTC
Permalink
"Che M" <***@hotmail.com> wrote

Grabbing Weather Underground Data with BeautifulSoup
http://flowingdata.com/2007/07/09/grabbing-weather-underground-data-with-beautifulsoup/

But I get to the exact same problem that this other person got to in this
post:
http://groups.google.com/group/beautifulsoup/browse_thread/thread/13eb3dbf713b8a4a
Post by Che M
Unfortunately, that post never gives enough help for me to
understand how to solve that person's or my problem.
The posts basocally say go and look at the HTML and find the
right tags for the data you need. This is fubndamental to any kind of web
scraping, you need to understand the HTML tree well enough to identify
where yourt data exists.

How familiar are you with HTML and its structures?
Can you view the source in your browser and identify the heirarchy
of tags to the place where your data lives?
Post by Che M
What I want to understand is how to find the bits of data you
want--in this case, say, today's average temperature and whether
it was clear or cloudy--within a web page,
Go to the web page and use View Source from your browser.

Then its down to you to interpret the HTML.
Somne experimentation with print statements might be necessary
to check you have the right place.

HTH,
--
Alan Gauld
Author of the Learn to Program web site
http://www.alan-g.me.uk/
Che M
2009-07-17 08:01:29 UTC
Permalink
Post by Alan Gauld
The posts basocally say go and look at the HTML and find the
right tags for the data you need. This is fubndamental to any kind of web
scraping, you need to understand the HTML tree well enough to identify
where yourt data exists.
How familiar are you with HTML and its structures?
Reasonably familiar.
Post by Alan Gauld
Can you view the source in your browser and identify the heirarchy
of tags to the place where your data lives?
I can view the source, and have made my own web pages (HTML, CSS).
I am less sure about the hierarchy of tags. For example, here is the
section around the current temperature:
<div class="blueBox">
<div id="curcondbox">
<div class="subG b">West of Town, Jamestown, Pennsylvania (PWS)</div>
<div class="bm10">Updated: <span class="pwsrt" pwsid="KPAJAMES1" pwsunit="english" pwsvariable="lu" value="1247814018">3:00 AM EDT on July 17, 2009</span></div>
<table cellspacing="0" cellpadding="0" class="full">
<tr>
<td class="vaT full">
<table cellspacing="0" cellpadding="5" class="full">
<tr>
<td class="vaM taC"><img src="Loading Image..." width="42" height="42" alt="Clear" class="condIcon" /></td>
<td class="vaM taC full">
<div style="font-size: 17px;"><span class="pwsrt" pwsid="KPAJAMES1" pwsunit="english" pwsvariable="tempf" english="&deg;F" metric="&deg;C" value="60.3">
<span class="nobr"><span class="b">60.3</span>&nbsp;&#176;F</span>
</span></div>

The 60.3 is the value I want to extract. It appears to be down within a hierarchy
something like:

<body
<div class="blueBox">
<div id="curcondbox">
<table
<table
<div>
<span class="nobr">
<span class="b">


But I am far from sure I got all that right; it is not easy to
look at HTML and match <div> with </div>. Unless I am missing
something? Do I have to use all of the above in my Beautiful
Soup?
CM

_________________________________________________________________
Windows Live™ SkyDrive™: Get 25 GB of free online storage.
http://windowslive.com/online/skydrive?ocid=TXT_TAGLM_WL_SD_25GB_062009
Michiel Overtoom
2009-07-17 10:27:36 UTC
Permalink
Post by Che M
The 60.3 is the value I want to extract.
soup.find("div",id="curcondbox").findNext("span","b").renderContents()
--
"The ability of the OSS process to collect and harness
the collective IQ of thousands of individuals across
the Internet is simply amazing." - Vinod Valloppillil
http://www.catb.org/~esr/halloween/halloween4.html
Che M
2009-07-17 17:33:03 UTC
Permalink
Date: Fri, 17 Jul 2009 12:27:36 +0200
Subject: Re: [Tutor] weather scraping with Beautiful Soup
Post by Che M
The 60.3 is the value I want to extract.
soup.find("div",id="curcondbox").findNext("span","b").renderContents()
Thanks, but that isn't working for me. Here's my code:

---------
import urllib2
from BeautifulSoup import BeautifulSoup

url = "http://www.wunderground.com/cgi-bin/findweather/getForecast?query=43085"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)

daytemp = soup.find("div",id="curcondbox").findNext("span","b").renderContents()

print "Today's temperature in Worthington is: ", daytemp
---------

When I try this, I get this error:

HTMLParseError: malformed start tag, at line 1516, column 60

Could someone suggest how to proceed?

Thanks.
CM



_________________________________________________________________
Windows Live™ SkyDrive™: Get 25 GB of free online storage.
http://windowslive.com/online/skydrive?ocid=TXT_TAGLM_WL_SD_25GB_062009
Michiel Overtoom
2009-07-17 17:40:17 UTC
Permalink
Post by Che M
Thanks, but that isn't working for me.
That's because BeautifulSoup isn't able to parse that webpage, not
because the statement I posted doesn't work. I had BeautifulSoup parse
the HTML fragment you posted earlier instead of the live webpage.

This is actually the first time I see that BeautifulSoup is NOT able to
parse a webpage...

Greetings,
Sander Sweers
2009-07-17 18:04:00 UTC
Permalink
Post by Michiel Overtoom
This is actually the first time I see that BeautifulSoup is NOT able to
parse a webpage...
Depends on which version is used. If 3.1 then it is much worse with
malformed html than prior releases. See [1] for more info.

Greets
Sander

[1] http://www.crummy.com/software/BeautifulSoup/3.1-problems.html
Che M
2009-07-17 18:42:27 UTC
Permalink
Post by Sander Sweers
Post by Michiel Overtoom
This is actually the first time I see that BeautifulSoup is NOT able to
parse a webpage...
Depends on which version is used. If 3.1 then it is much worse with
malformed html than prior releases. See [1] for more info.
Greets
Sander
[1] http://www.crummy.com/software/BeautifulSoup/3.1-problems.html
Yes....I just read about the 3.1 problems, and switched to 3.07a, re-ran
the exact same code as befor...and it worked perfectly this time.

(For those novices who are unsure of how to install a previous version,
I found that I only had to simply clear out what BS files I had already
under site-packages, and just put the Beautiful Soup.py 3.0.7a file found
on the website into that site-packages folder...very easy).

Lots of useful information today, thank you to everyone.

Che

_________________________________________________________________
Windows Live™ SkyDrive™: Get 25 GB of free online storage.
http://windowslive.com/online/skydrive?ocid=TXT_TAGLM_WL_SD_25GB_062009
Che M
2009-07-17 21:27:33 UTC
Permalink
OK, got the very basic case to work when using Beautiful Soup 3.0.7a and scraping the Weather Underground Lite page. That gives me the current temperature, etc. Good.

But now I would like the average temperature from, say, yesterday. I am again having trouble finding elements in the soup. This is the code:

----------------------
import urllib2
from BeautifulSoup import BeautifulSoup

url = "http://www.wunderground.com/weatherstation/WXDailyHistory.asp?ID=KPAJAMES1&month=7&day=16&year=2009"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)

table = soup.find("td",id="dataTable tm10")
print table

------------------------

When I look at the page source for that page, there is this section, which contains the "dataTable tm10" table I want to zoom in on:
------------
<table cellspacing="0" cellpadding="0" class="dataTable tm10">
<thead>
<tr>
<td style="width: 25%;">&nbsp;</td>
<td>Current:</td>
<td>High:</td>
<td>Low:</td>
<td>Average:</td>
</tr>
</thead>
<tbody>
<tr>
<td>Temperature:</td>
<td>
<span class="nobr"><span class="b">73.6</span>&nbsp;&#176;F</span>
</td>
<td>
<span class="nobr"><span class="b">83.3</span>&nbsp;&#176;F</span>
</td>
<td>
<span class="nobr"><span class="b">64.2</span>&nbsp;&#176;F</span>
</td>
<td>
<span class="nobr"><span class="b">74.1</span>&nbsp;&#176;F</span>
</td>
--------------
None
In other words, it is not finding that table. Why?

Help appreciated again.
Che


_________________________________________________________________
Lauren found her dream laptop. Find the PC that’s right for you.
http://www.microsoft.com/windows/choosepc/?ocid=ftp_val_wl_290
Sander Sweers
2009-07-17 23:09:32 UTC
Permalink
Post by Che M
table = soup.find("td",id="dataTable tm10")
Almost right. attrs should normall be a dict so {'class':'dataTable
tm10'} but you can use a shortcut, read on.
Post by Che M
------------------------
When I look at the page source for that page, there is this section, which
------------
<table cellspacing="0" cellpadding="0" class="dataTable tm10">
<thead>
<tr>
<td style="width: 25%;">&nbsp;</td>
<td>Current:</td>
<td>High:</td>
<td>Low:</td>
<td>Average:</td>
</tr>
</thead>
<tbody>
<tr>
<td>Temperature:</td>
<td>
<span class="nobr"><span class="b">73.6</span>&nbsp;&#176;F</span>
</td>
<td>
<span class="nobr"><span class="b">83.3</span>&nbsp;&#176;F</span>
</td>
<td>
<span class="nobr"><span class="b">64.2</span>&nbsp;&#176;F</span>
</td>
<td>
<span class="nobr"><span class="b">74.1</span>&nbsp;&#176;F</span>
</td>
--------------
The tag you are looking for is table not td. The tag td is inside the
table tag. So with shortcut it looks like,

table = soup.find("table","dataTable tm10")

or without shortcut,

table = soup.find("table",{'class':'dataTable tm10'})

Greets
Sander
Che M
2009-07-18 00:32:50 UTC
Permalink
Date: Sat, 18 Jul 2009 01:09:32 +0200
Subject: Re: [Tutor] weather scraping with Beautiful Soup
Post by Che M
table = soup.find("td",id="dataTable tm10")
Almost right. attrs should normall be a dict so {'class':'dataTable
tm10'} but you can use a shortcut, read on.
Post by Che M
------------------------
When I look at the page source for that page, there is this section, which
------------
<table cellspacing="0" cellpadding="0" class="dataTable tm10">
<thead>
<tr>
<td style="width: 25%;">&nbsp;</td>
<td>Current:</td>
<td>High:</td>
<td>Low:</td>
<td>Average:</td>
</tr>
</thead>
<tbody>
<tr>
<td>Temperature:</td>
<td>
<span class="nobr"><span class="b">73.6</span>&nbsp;&#176;F</span>
</td>
<td>
<span class="nobr"><span class="b">83.3</span>&nbsp;&#176;F</span>
</td>
<td>
<span class="nobr"><span class="b">64.2</span>&nbsp;&#176;F</span>
</td>
<td>
<span class="nobr"><span class="b">74.1</span>&nbsp;&#176;F</span>
</td>
--------------
The tag you are looking for is table not td. The tag td is inside the
table tag. So with shortcut it looks like,
table = soup.find("table","dataTable tm10")
or without shortcut,
table = soup.find("table",{'class':'dataTable tm10'})
Greets
Sander
Thank you. I was able to find the table in the soup this way.

After a surprising amount of tinkering (for some reason this Soup is more like
chowder than broth for me still), I was able to get my goal, that 74.1 above,
using this:

-------------------
import urllib2
from BeautifulSoup import BeautifulSoup

url = "http://www.wunderground.com/weatherstation/WXDailyHistory.asp?ID=KPAJAMES1&month=7&day=16&year=2009"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)

table = soup.find("table","dataTable tm10") #find the table
tbody = table.find("tbody") #find the table's body

alltd = tbody.findAll('td') #find all the td's
temp_full = alltd[4] #identify the 4th td, the one I want.
print 'temp_full = ', temp_full
temp = temp_full.findNext('span','b').renderContents() #into the span and b and render
print 'temp = ', temp
----------------------

Does this seem like the right (most efficient/readable) way to do this?

Thanks for your time.
CM


_________________________________________________________________
Hotmail® has ever-growing storage! Don’t worry about storage limits.
http://windowslive.com/Tutorial/Hotmail/Storage?ocid=TXT_TAGLM_WL_HM_Tutorial_Storage_062009
Sander Sweers
2009-07-18 12:28:38 UTC
Permalink
table = soup.find("table","dataTable tm10")  #find the table
tbody = table.find("tbody")                  #find the table's body
You can do this in one step.

tbody = soup.find('tbody')

Greets
Sander
Michiel Overtoom
2009-07-18 12:37:56 UTC
Permalink
Post by Sander Sweers
Post by Che M
table = soup.find("table","dataTable tm10") #find the table
tbody = table.find("tbody") #find the table's body
You can do this in one step.
tbody = soup.find('tbody')
Yeah, but what if the document contains multiple tables, and you want
explicitely the one with the class 'dataTable tm10'?

In such a case it doesn't hurt to progressively narrow down on the
wanted element.

It also makes the scrape code more robust against page changes. Someone
could add an extra table to the start of the page to host a search box,
for example.

Stefan Behnel
2009-07-17 12:40:40 UTC
Permalink
Post by Che M
<div class="blueBox">
<div id="curcondbox">
<div class="subG b">West of Town, Jamestown, Pennsylvania (PWS)</div>
<div class="bm10">Updated: <span class="pwsrt" pwsid="KPAJAMES1" pwsunit="english" pwsvariable="lu" value="1247814018">3:00 AM EDT on July 17, 2009</span></div>
<table cellspacing="0" cellpadding="0" class="full">
<tr>
<td class="vaT full">
<table cellspacing="0" cellpadding="5" class="full">
<tr>
<td class="vaM taC"><img src="http://icons-pe.wxug.com/i/c/a/nt_clear.gif" width="42" height="42" alt="Clear" class="condIcon" /></td>
<td class="vaM taC full">
<div style="font-size: 17px;"><span class="pwsrt" pwsid="KPAJAMES1" pwsunit="english" pwsvariable="tempf" english="&deg;F" metric="&deg;C" value="60.3">
<span class="nobr"><span class="b">60.3</span>&nbsp;&#176;F</span>
</span></div>
The 60.3 is the value I want to extract. It appears to be down within a hierarchy
<body
<div class="blueBox">
<div id="curcondbox">
<table
<table
<div>
<span class="nobr">
<span class="b">
You may consider using lxml's cssselect module:

from lxml import html
doc = html.parse("http://some/url/to/parse.html")
spans = doc.cssselect("div.bluebox > #curcondbox span.b")
print spans[0].text

However, I'd rather go for the other "60.3" value using XPath:

print doc.xpath('//span[@pwsvariable="tempf"]/@value')

Stefan
Kent Johnson
2009-07-17 12:09:10 UTC
Permalink
Post by Che M
Hi,
I am interested in gathering simple weather data using Beautiful Soup, but
am having trouble understanding what I'm doing.  I have searched the
archives and so far haven't found enough to get me moving forward.
Grabbing Weather Underground Data with BeautifulSoup
http://flowingdata.com/2007/07/09/grabbing-weather-underground-data-with-beautifulsoup/
http://groups.google.com/group/beautifulsoup/browse_thread/thread/13eb3dbf713b8a4a
Unfortunately, that post never gives enough help for me to understand how to
solve that person's or my problem.
What I want to understand is how to find the bits of data you want--in this
case, say, today's average temperature and whether it was clear or
cloudy--within a web page, and then indicate that to Beautiful Soup.
One thing that might help is to use the Lite page, if you are not
already. It has much less formatting and extraneous information to
wade through. You might also look for a site that has weather data
formatted for computer. For example the NOAA has forcast data
available as plain text:
http://forecast.weather.gov/product.php?site=NWS&issuedby=BOX&product=CCF&format=txt&version=1&glossary=0

Kent
Che M
2009-07-17 17:50:42 UTC
Permalink
Date: Fri, 17 Jul 2009 08:09:10 -0400
Subject: Re: [Tutor] weather scraping with Beautiful Soup
Post by Che M
Hi,
I am interested in gathering simple weather data using Beautiful Soup, but
am having trouble understanding what I'm doing. I have searched the
archives and so far haven't found enough to get me moving forward.
Grabbing Weather Underground Data with BeautifulSoup
http://flowingdata.com/2007/07/09/grabbing-weather-underground-data-with-beautifulsoup/
http://groups.google.com/group/beautifulsoup/browse_thread/thread/13eb3dbf713b8a4a
Unfortunately, that post never gives enough help for me to understand how to
solve that person's or my problem.
What I want to understand is how to find the bits of data you want--in this
case, say, today's average temperature and whether it was clear or
cloudy--within a web page, and then indicate that to Beautiful Soup.
One thing that might help is to use the Lite page, if you are not
already. It has much less formatting and extraneous information to
wade through.
I was not aware Weather Underground had a Lite page; thank you, that
is good to know. It was easier to figure things out in that HTML.

I am getting closer, but still a bit stuck. Here is my code for the Lite page:

------------
import urllib2
from BeautifulSoup import BeautifulSoup

url = "http://www.wund.com/cgi-bin/findweather/getForecast?query=Worthington%2C+OH"
page = urllib2.urlopen(url)

soup = BeautifulSoup(page)
daytemp = soup.find("div",id="main").findNext("h3").renderContents()

print "Today's temperature in Worthington is: ", daytemp
-------------

This works, but gives this output:
Today's temperature in Worthington is:
<span>75</span>&nbsp;&#176;F

Of course, I just want the 75, not the HTML tags, etc. around it. But I am not sure
how to indicate that in Beautiful Soup. So, for example, if I change the soup.find
line above to this (to incorporate the <span>):

daytemp = soup.find("div",id="main").findNext("h3", "span").renderContents()

then I get the following error:

AttributeError: 'NoneType' object has no attribute 'renderContents'

(I also don't understand what the point of having a <span> tag with no style
content in the page?)

Any help is appreciated. This still feels kind of arcane, but I want to understand
the general approach to doing this, as later I want to try other weather facts
or screen scraping generally.

Thanks.
CM




You might also look for a site that has weather data
formatted for computer. For example the NOAA has forcast data
http://forecast.weather.gov/product.php?site=NWS&issuedby=BOX&product=CCF&format=txt&version=1&glossary=0
Kent
_________________________________________________________________
Lauren found her dream laptop. Find the PC that’s right for you.
http://www.microsoft.com/windows/choosepc/?ocid=ftp_val_wl_290
Michiel Overtoom
2009-07-17 18:02:22 UTC
Permalink
Post by Che M
"http://www.wund.com/cgi-bin/findweather/getForecast?query=Worthington%2C+OH"
Any help is appreciated.
That would be:

daytemp = soup.find("div",id="main").findNext("span").renderContents()
--
"The ability of the OSS process to collect and harness
the collective IQ of thousands of individuals across
the Internet is simply amazing." - Vinod Valloppillil
http://www.catb.org/~esr/halloween/halloween4.html
Che M
2009-07-17 18:28:58 UTC
Permalink
Date: Fri, 17 Jul 2009 20:02:22 +0200
Subject: Re: [Tutor] weather scraping with Beautiful Soup
Post by Che M
"http://www.wund.com/cgi-bin/findweather/getForecast?query=Worthington%2C+OH"
Any help is appreciated.
daytemp = soup.find("div",id="main").findNext("span").renderContents()
Thank you, that works! I'll go try some more things and read more of
the documentation and if I bump into more confusions I may have more
questions.

Che

_________________________________________________________________
Lauren found her dream laptop. Find the PC that’s right for you.
http://www.microsoft.com/windows/choosepc/?ocid=ftp_val_wl_290
Continue reading on narkive:
Loading...