[Tutor] Problem using lxml

Discussion:

Anthony Papillion

2015-08-22 21:05:53 UTC

Hello Everyone,

I'm pretty new to lxml but I pretty much thought I'd understood the basics.
However, for some reason, my first attempt at using it is failing miserably.

Here's the deal:

I'm parsing specific page on Craigslist (
http://joplin.craigslist.org/search/rea) and trying to retreive the text of
each link on that page. When I do an "inspect element" in Firefox, a sample
anchor link looks like this:

<a href="/reb/5185592209.html" data-id="5185592209" class="hdrlnk">FIRST
OPEN HOUSE TOMORROW 2:00pm-4:00pm!!! (8-23-15)</a>

The code I'm using to try to get the link text is this:

from lxml import html
import requests

page = requests.get("http://joplin.craigslist.org/search/rea")
titles = tree.xpath('//a[@title="hdrlnk"]/text()')
print titles

The last line, where it supposedly will print the text of each anchor
returns [].

I can't seem to figure out what I'm doing wrong. lmxml seems pretty
straightforward but I can't seem to get this down.

Can anyone make any suggestions?

Thanks!
Anthony
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Joel Goldstick

2015-08-22 21:14:43 UTC

Permalink

Post by Anthony Papillion
Hello Everyone,
I'm pretty new to lxml but I pretty much thought I'd understood the basics.
However, for some reason, my first attempt at using it is failing miserably.
I'm parsing specific page on Craigslist (
http://joplin.craigslist.org/search/rea) and trying to retreive the text of
each link on that page. When I do an "inspect element" in Firefox, a sample
<a href="/reb/5185592209.html" data-id="5185592209" class="hdrlnk">FIRST
OPEN HOUSE TOMORROW 2:00pm-4:00pm!!! (8-23-15)</a>
from lxml import html
import requests
page = requests.get("http://joplin.craigslist.org/search/rea")
print titles
The last line, where it supposedly will print the text of each anchor
returns [].
I can't seem to figure out what I'm doing wrong. lmxml seems pretty
straightforward but I can't seem to get this down.
Can anyone make any suggestions?
Thanks!
Anthony

Not an answer, but have you checked out Beautiful Soup? It is a great
html parsing tool, with a good tutorial:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Post by Anthony Papillion
_______________________________________________
https://mail.python.org/mailman/listinfo/tutor

--
Joel Goldstick
http://joelgoldstick.com
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Martin A. Brown

2015-08-22 21:20:23 UTC

Permalink

Hi there Anthony,

Post by Anthony Papillion
I'm pretty new to lxml but I pretty much thought I'd understood
the basics. However, for some reason, my first attempt at using it
is failing miserably.
I'm parsing specific page on Craigslist (
http://joplin.craigslist.org/search/rea) and trying to retreive the text of
each link on that page. When I do an "inspect element" in Firefox, a sample
<a href="/reb/5185592209.html" data-id="5185592209" class="hdrlnk">FIRST
OPEN HOUSE TOMORROW 2:00pm-4:00pm!!! (8-23-15)</a>
from lxml import html
import requests
page = requests.get("http://joplin.craigslist.org/search/rea")

You are missing something here that takes the page.content, parses
it and creates variable called tree.
And, your xpath is incorrect. Play with this in the interactive
browser and you will be able to correct your xpath. I think you
will notice from the example anchor link above that the attribute of
the <a/> HTML elements you want to grab is "class", not "title".
Therefore:

titles = tree.xpath('//a[@class="hdrlnk"]/text()')

Is probably closer.

Post by Anthony Papillion
print titles
The last line, where it supposedly will print the text of each anchor
returns [].
I can't seem to figure out what I'm doing wrong. lmxml seems pretty
straightforward but I can't seem to get this down.

Again, I'd recommend playing with the data in an interactive console
session. You will be able to figure out exactly which xpath gets
you the data you would like, and then you can drop it into your
script.

Good luck,

-Martin

--
Martin A. Brown
http://linux-ip.net/
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Anthony Papillion

2015-08-22 23:16:34 UTC

Permalink

Many thanks, Martin! I had indeed skipped creating the tree object and a
few other things you pointed out. Here is my finished simple code that
actually works:

from lxml import html
import requests

page = requests.get("http://joplin.craigslist.org/search/w4m")
tree = html.fromstring(page.text)
titles = tree.xpath('//a[@class="hdrlnk"]/text()')
try:
for title in titles:
print title
except:
pass

Pretty simple. Thanks for the help!

Post by Martin A. Brown
Hi there Anthony,

text of

Post by Anthony Papillion
each link on that page. When I do an "inspect element" in Firefox, a

sample

Post by Anthony Papillion
<a href="/reb/5185592209.html" data-id="5185592209" class="hdrlnk">FIRST
OPEN HOUSE TOMORROW 2:00pm-4:00pm!!! (8-23-15)</a>
from lxml import html
import requests
page = requests.get("http://joplin.craigslist.org/search/rea")

_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Stefan Behnel

2015-08-23 08:10:53 UTC

Permalink

Post by Anthony Papillion
from lxml import html
import requests
page = requests.get("http://joplin.craigslist.org/search/w4m")
tree = html.fromstring(page.text)

While requests has its merits, this can be simplified to

tree = html.parse("http://joplin.craigslist.org/search/w4m")

Post by Anthony Papillion
print title

This only works as long as the link tags only contain plain text, no other
tags, because "text()" selects individual text nodes in XPath. Also, using
@class="hdrlnk" will not match link tags that use class=" hdrlnk " or
class="abc hdrlnk other".

If you want to be on the safe side, I'd use cssselect instead and then
serialise the complete text content of the link tag to a string, i.e.

from lxml.etree import tostring

for link_element in tree.cssselect("a.hdrlnk"):
title = tostring(
link_element,
method="text", encoding="unicode", with_tail=False)
print(title.strip())

Note that the "cssselect()" feature requires the external "cssselect"
package to be installed. "pip install cssselect" should handle that.

Post by Anthony Papillion
pass

Oh, and bare "except:" clauses are generally frowned upon because they can
easily hide bugs by also catching unexpected exceptions. Better be explicit
about the exception type(s) you want to catch.

Stefan

_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor