Discussion:
[Tutor] Beautiful Soup
Crusier
2015-09-29 15:47:02 UTC
Permalink
Hi

I have recently finished reading "Starting out with Python" and I
really want to do some web scraping. Please kindly advise where I can
get more information about BeautifulSoup. It seems that Documentation
is too hard for me.

Furthermore, I have tried to scrap this site but it seems that there
is an error (<http.client.HTTPResponse object at 0x02C09F90>). Please
advise what I should do in order to overcome this.


from bs4 import BeautifulSoup
import urllib.request

HKFile = urllib.request.urlopen("https://bochk.etnet.com.hk/content/bochkweb/tc/quote_transaction_daily_history.php?code=2388")
HKHtml = HKFile.read()
HKFile.close()

print(HKFile)

Thank you
Hank
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Joel Goldstick
2015-09-29 16:05:07 UTC
Permalink
Post by Crusier
Hi
I have recently finished reading "Starting out with Python" and I
really want to do some web scraping. Please kindly advise where I can
get more information about BeautifulSoup. It seems that Documentation
is too hard for me.
Furthermore, I have tried to scrap this site but it seems that there
is an error (<http.client.HTTPResponse object at 0x02C09F90>). Please
advise what I should do in order to overcome this.
from bs4 import BeautifulSoup
import urllib.request
HKFile = urllib.request.urlopen("
https://bochk.etnet.com.hk/content/bochkweb/tc/quote_transaction_daily_history.php?code=2388
")
HKHtml = HKFile.read()
HKFile.close()
print(HKFile)
Thank you
Hank
_______________________________________________
https://mail.python.org/mailman/listinfo/tutor
many people find this package to be easier to use than the built in python
support for reading urls:

http://docs.python-requests.org/en/latest/
--
Joel Goldstick
http://joelgoldstick.com
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Laura Creighton
2015-09-29 18:48:49 UTC
Permalink
Post by Crusier
Hi
I have recently finished reading "Starting out with Python" and I
really want to do some web scraping. Please kindly advise where I can
get more information about BeautifulSoup. It seems that Documentation
is too hard for me.
Furthermore, I have tried to scrap this site but it seems that there
is an error (<http.client.HTTPResponse object at 0x02C09F90>). Please
advise what I should do in order to overcome this.
from bs4 import BeautifulSoup
import urllib.request
HKFile = urllib.request.urlopen("
https://bochk.etnet.com.hk/content/bochkweb/tc/quote_transaction_daily_history.php?code=2388
")
HKHtml = HKFile.read()
HKFile.close()
print(HKFile)
<http.client.HTTPResponse object at 0x02C09F90> is not an error.

If you want to print your file change print(HKFile)
to print(HKHtml.decode("some-encoding")) where some-encoding is what
the website is encoded in, these days utf-8 is most likely.

If you want a tutorial on webscraping, not Beautiful Soup
try:
http://doc.scrapy.org/en/latest/intro/tutorial.html

which is about using scrapy, a set of useful webscraping tools.

the scrapy wiki is also useful
https://github.com/scrapy/scrapy/wiki

and there are many video tutorials available if you like that sort of thing.
Just google for python scrapy tutorial.

Laura
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Peter Otten
2015-09-29 16:51:20 UTC
Permalink
Post by Crusier
I have recently finished reading "Starting out with Python" and I
really want to do some web scraping. Please kindly advise where I can
get more information about BeautifulSoup. It seems that Documentation
is too hard for me.
If you tell us what you don't understand and what you want to achieve we may
be able to help you.
Post by Crusier
from bs4 import BeautifulSoup
import urllib.request
HKFile =
urllib.request.urlopen("https://bochk.etnet.com.hk/content/bochkweb/tc/quote_transaction_daily_history.php?code=2388")
Post by Crusier
HKHtml = HKFile.read()
HKFile.close()
print(HKFile)
Furthermore, I have tried to scrap this site but it seems that there
is an error (<http.client.HTTPResponse object at 0x02C09F90>).
That's not an error, that's what urlopen() returns. If an error occurs
Python libraries are usually explicit an throw an exception. If the
exception is not handled by your script by default Python prints a traceback
Post by Crusier
Post by Crusier
import urllib.request
urllib.request.urlopen("http://httpbin.org/status/404")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 469, in open
response = meth(req, response)
File "/usr/lib/python3.4/urllib/request.py", line 579, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.4/urllib/request.py", line 507, in error
return self._call_chain(*args)
File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 587, in
http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: NOT FOUND

That's what a well-behaved error looks like ;)
Post by Crusier
Please advise what I should do in order to overcome this.
If you want to print the contents of the page just replace the line
Post by Crusier
print(HKFile)
in your code with

print(HKHtml.decode("utf-8"))




_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Continue reading on narkive:
Loading...