[Tutor] Beautiful Soup

Discussion:

Crusier

2015-09-29 15:47:02 UTC

Hi

I have recently finished reading "Starting out with Python" and I
really want to do some web scraping. Please kindly advise where I can
get more information about BeautifulSoup. It seems that Documentation
is too hard for me.

Furthermore, I have tried to scrap this site but it seems that there
is an error (<http.client.HTTPResponse object at 0x02C09F90>). Please
advise what I should do in order to overcome this.

from bs4 import BeautifulSoup
import urllib.request

HKFile = urllib.request.urlopen("https://bochk.etnet.com.hk/content/bochkweb/tc/quote_transaction_daily_history.php?code=2388")
HKHtml = HKFile.read()
HKFile.close()

print(HKFile)

Thank you
Hank
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Joel Goldstick

2015-09-29 16:05:07 UTC

Permalink

Post by Crusier
Hi
I have recently finished reading "Starting out with Python" and I
really want to do some web scraping. Please kindly advise where I can
get more information about BeautifulSoup. It seems that Documentation
is too hard for me.
Furthermore, I have tried to scrap this site but it seems that there
is an error (<http.client.HTTPResponse object at 0x02C09F90>). Please
advise what I should do in order to overcome this.
from bs4 import BeautifulSoup
import urllib.request
HKFile = urllib.request.urlopen("
https://bochk.etnet.com.hk/content/bochkweb/tc/quote_transaction_daily_history.php?code=2388
")
HKHtml = HKFile.read()
HKFile.close()
print(HKFile)
Thank you
Hank
_______________________________________________
https://mail.python.org/mailman/listinfo/tutor

many people find this package to be easier to use than the built in python
support for reading urls:

http://docs.python-requests.org/en/latest/

--
Joel Goldstick
http://joelgoldstick.com
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Laura Creighton

2015-09-29 18:48:49 UTC

Permalink

<http.client.HTTPResponse object at 0x02C09F90> is not an error.

If you want to print your file change print(HKFile)
to print(HKHtml.decode("some-encoding")) where some-encoding is what
the website is encoded in, these days utf-8 is most likely.

If you want a tutorial on webscraping, not Beautiful Soup
try:
http://doc.scrapy.org/en/latest/intro/tutorial.html

which is about using scrapy, a set of useful webscraping tools.

the scrapy wiki is also useful
https://github.com/scrapy/scrapy/wiki

and there are many video tutorials available if you like that sort of thing.
Just google for python scrapy tutorial.

Laura
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Peter Otten

2015-09-29 16:51:20 UTC

Permalink

Post by Crusier
I have recently finished reading "Starting out with Python" and I
really want to do some web scraping. Please kindly advise where I can
get more information about BeautifulSoup. It seems that Documentation
is too hard for me.

If you tell us what you don't understand and what you want to achieve we may
be able to help you.

Post by Crusier
from bs4 import BeautifulSoup
import urllib.request
HKFile =

urllib.request.urlopen("https://bochk.etnet.com.hk/content/bochkweb/tc/quote_transaction_daily_history.php?code=2388")

Post by Crusier
HKHtml = HKFile.read()
HKFile.close()
print(HKFile)
Furthermore, I have tried to scrap this site but it seems that there
is an error (<http.client.HTTPResponse object at 0x02C09F90>).

That's not an error, that's what urlopen() returns. If an error occurs
Python libraries are usually explicit an throw an exception. If the
exception is not handled by your script by default Python prints a traceback

Post by Crusier

Post by Crusier
import urllib.request
urllib.request.urlopen("http://httpbin.org/status/404")

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 469, in open
response = meth(req, response)
File "/usr/lib/python3.4/urllib/request.py", line 579, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.4/urllib/request.py", line 507, in error
return self._call_chain(*args)
File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 587, in
http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: NOT FOUND

That's what a well-behaved error looks like ;)

Post by Crusier
Please advise what I should do in order to overcome this.

If you want to print the contents of the page just replace the line

Post by Crusier
print(HKFile)

in your code with

print(HKHtml.decode("utf-8"))

_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Continue reading on narkive:

Search results for '[Tutor] Beautiful Soup' (Questions and Answers)

replies

Good yaoi sites....?

started 2008-03-23 17:09:59 UTC

comics & animation

replies

Does Anyone Know If Cheap Pot Noodles puts Weight On?? I Can Only Afford To Live On Tins Of Soups & Cups Of So?

started 2013-01-16 16:00:52 UTC

cooking & recipes

replies

Does hamburger taste good in a potato soup?