[Tutor] Another regular expression question

Discussion:

Bernard Lebel

2005-09-13 21:57:33 UTC

Hello, yet another regular expression question :-)

So I have this xml file that I'm trying to find a specific tag in. For
this I'm using a regular expression. Right now, the tag I'm trying to
find looks like this:

<sceneobject name="Camera_Root_bernard" type="CameraRoot">

So I'm using a regular expression to find:
sceneobject
type="CameraRoot"

My code looks like this:

import os, re

def searchTag( sPattern, sFile ):

"""
Scans a xml file to try to find a line that matches search criterias.

ARGUMENTS:
sPattern (string): regular expression pattern string
sFile (string): full file path to scan

RETURN VALUE: text line (string) or None
"""

oRe = re.compile( sPattern )

if os.path.exists( sFile ) == False: return None
else:
oFile = file( sFile, 'r' )

for sLine in oFile.xreadlines(): # read text
oMatch = oRe.search( sLine ) # attempt a search
if oMatch != None: # check if search returned success
oFile.close()
return sLine

# Scan has yield no result, return None
oFile.close()
return None

sLine = searchTag( r'(sceneobject)(type="CameraRoot")', sFile )

The thing is that I suspect my regular expression pattern to be
incorrect because I always get None, but am at a loss here. Any advice
would be welcomed.

Thanks
Bernard

Kent Johnson

2005-09-13 22:18:51 UTC

Permalink

Post by Bernard Lebel
Hello, yet another regular expression question :-)
So I have this xml file that I'm trying to find a specific tag in. For
this I'm using a regular expression. Right now, the tag I'm trying to
<sceneobject name="Camera_Root_bernard" type="CameraRoot">
sceneobject
type="CameraRoot"
import os, re
"""
Scans a xml file to try to find a line that matches search criterias.
sPattern (string): regular expression pattern string
sFile (string): full file path to scan
RETURN VALUE: text line (string) or None
"""
oRe = re.compile( sPattern )
if os.path.exists( sFile ) == False: return None

No need to compare to False, you can just say
if not os.path.exists( sFile ): return None

Post by Bernard Lebel
oFile = file( sFile, 'r' )
for sLine in oFile.xreadlines(): # read text

for sLine in oFile:
is more idiomatic and avoids reading the whole file at once.

Post by Bernard Lebel
oMatch = oRe.search( sLine ) # attempt a search
if oMatch != None: # check if search returned success
oFile.close()
return sLine
# Scan has yield no result, return None
oFile.close()
return None
sLine = searchTag( r'(sceneobject)(type="CameraRoot")', sFile )
The thing is that I suspect my regular expression pattern to be
incorrect because I always get None, but am at a loss here. Any advice
would be welcomed.

You need something in the regex to match the part between 'sceneobject' and 'type="CameraRoot"'. The regex you are using expects them to be adjacent. Try
sLine = searchTag( r'(sceneobject).*?(type="CameraRoot")', sFile )

which means, match anything between the two strings, but the smallest amount possible (non-greedy).

It's also possible that the tag you are looking for spans multiple lines. In this case you should look at an XML parsing library.

Kent

Alan G

2005-09-14 07:43:58 UTC

Permalink

Hi Bernard,

Post by Bernard Lebel
Hello, yet another regular expression question :-)
So I have this xml file that I'm trying to find a
specific tag in.

I'm always suspicious when I see regular expression
and xml/html in the same context. regex are not good
for parsing xml/html files and it's usually much easier
to use a proper parser - such as beautiful soup.

http://www.crummy.com/software/BeautifulSoup/

Is there any special reason why you are using a regex
sledgehammer to crack this particular nut? Or is it
just to gain experience using regex?

Alan G.

Bernard Lebel

2005-09-14 13:08:36 UTC

Permalink

Thanks Alan,

I'll check BeautifulSoup asap.

I'm using regex simply because I have no clue where to start to parse
XML. I have read the various xml tools available in the Python
library, however I'm a complete loss at what to make out of them. Many
of them seem to use some programming standards, wich I am completely
unfamiliar with (this is the first time that I dig into XML writing
and parsing).

I don't know where to start to learn about all these standards, and as
usual with new programming things, the documentation is hard to
swallow (it usually is written more as a reference than a proper user
guide/tutorial). I have to admit this is very frustrating, so if I'm
looking at things from a wrong perspective please advise me, I need
it.

So right now I'm just taking a shortcut and using ultra-simple
re-based parser to retrieve the tags I'm looking for. I know it will
probably be slow, but hopefully I'll get familiar with sophisticated
parsing in the future and improve my code. As it stands right now,
even the re syntax is not super easy to learn.

Kent: That works (of course!). Thanks a bunch once again!

Thanks
Bernard

Post by Alan G
Hi Bernard,

Post by Bernard Lebel
Hello, yet another regular expression question :-)
So I have this xml file that I'm trying to find a
specific tag in.

I'm always suspicious when I see regular expression
and xml/html in the same context. regex are not good
for parsing xml/html files and it's usually much easier
to use a proper parser - such as beautiful soup.
http://www.crummy.com/software/BeautifulSoup/
Is there any special reason why you are using a regex
sledgehammer to crack this particular nut? Or is it
just to gain experience using regex?
Alan G.

Kent Johnson

2005-09-14 13:36:47 UTC

Permalink

Post by Bernard Lebel
Thanks Alan,
I'll check BeautifulSoup asap.
I'm using regex simply because I have no clue where to start to parse
XML. I have read the various xml tools available in the Python
library, however I'm a complete loss at what to make out of them. Many
of them seem to use some programming standards, wich I am completely
unfamiliar with (this is the first time that I dig into XML writing
and parsing).
I don't know where to start to learn about all these standards, and as
usual with new programming things, the documentation is hard to
swallow (it usually is written more as a reference than a proper user
guide/tutorial). I have to admit this is very frustrating, so if I'm
looking at things from a wrong perspective please advise me, I need
it.

I agree that the Python XML story is confusing even for the files in the standard library. Worse, the (IMO) best solutions are not to be found in the standard lib or PyXML at all.

The std lib and PyXML are based on the DOM and SAX standards. These standards were designed to be "language-neutral" - there are implementations in Python, Java and other languages. The good side of this is, if you learn how to use them, the knowledge is pretty portable to other languages. The bad side is, the APIs defined by the standard are IMO clunky and painful to use, especially in Python.

There is a current thread on comp.lang.python discussing this with good suggestions and pointers to more info:
http://groups.google.com/group/comp.lang.python/browse_frm/thread/a48891aa645ead13/dcd8fdc20b4b191b?hl=en#dcd8fdc20b4b191b

My personal preference is ElementTree. Beautiful Soup is good too though I have only tried it with HTML. If I was running on Linux I would try lxml which uses the ElementTree API and adds full XPath support. Amara looks like the Cadillac solution - big and cushy. I haven't tried it. Uche's articles (referenced in the thread above) have pointers to many other choices but these seem to be the most popular.

My favorite XML lib is actually dom4j which is in Java. It works great with Jython.

Kent

Post by Bernard Lebel
So right now I'm just taking a shortcut and using ultra-simple
re-based parser to retrieve the tags I'm looking for. I know it will
probably be slow, but hopefully I'll get familiar with sophisticated
parsing in the future and improve my code. As it stands right now,
even the re syntax is not super easy to learn.

For what you are doing re seems fine to me. You can get in trouble using re's with XML because of nested tags, variations in spelling and order, probably a bunch of other things. But for simple stuff it can work fine.

Kent

Post by Bernard Lebel
Kent: That works (of course!). Thanks a bunch once again!
Thanks
Bernard

Post by Alan G
Hi Bernard,

Post by Bernard Lebel
Hello, yet another regular expression question :-)
So I have this xml file that I'm trying to find a
specific tag in.

I'm always suspicious when I see regular expression
and xml/html in the same context. regex are not good
for parsing xml/html files and it's usually much easier
to use a proper parser - such as beautiful soup.
http://www.crummy.com/software/BeautifulSoup/
Is there any special reason why you are using a regex
sledgehammer to crack this particular nut? Or is it
just to gain experience using regex?
Alan G.

_______________________________________________
http://mail.python.org/mailman/listinfo/tutor

Bernard Lebel

2005-09-14 14:35:38 UTC

Permalink

Thanks for that pointer Kent, I'll check it out. Also thanks for
letting me know I'm not nuts! :-)

Alan's suggestion about BeautifulSoup is actually excellent. The
documentation is nice and the tool is very easy to use.

However is it normal that to parse a 2618 lines xml file it takes
20-30 seconds or so?

Thanks
Bernard

Post by Kent Johnson

I agree that the Python XML story is confusing even for the files in the standard library. Worse, the (IMO) best solutions are not to be found in the standard lib or PyXML at all.
The std lib and PyXML are based on the DOM and SAX standards. These standards were designed to be "language-neutral" - there are implementations in Python, Java and other languages. The good side of this is, if you learn how to use them, the knowledge is pretty portable to other languages. The bad side is, the APIs defined by the standard are IMO clunky and painful to use, especially in Python.
http://groups.google.com/group/comp.lang.python/browse_frm/thread/a48891aa645ead13/dcd8fdc20b4b191b?hl=en#dcd8fdc20b4b191b
My personal preference is ElementTree. Beautiful Soup is good too though I have only tried it with HTML. If I was running on Linux I would try lxml which uses the ElementTree API and adds full XPath support. Amara looks like the Cadillac solution - big and cushy. I haven't tried it. Uche's articles (referenced in the thread above) have pointers to many other choices but these seem to be the most popular.
My favorite XML lib is actually dom4j which is in Java. It works great with Jython.
Kent

Post by Bernard Lebel
Kent: That works (of course!). Thanks a bunch once again!
Thanks
Bernard

Post by Alan G
Hi Bernard,

Post by Bernard Lebel
Hello, yet another regular expression question :-)
So I have this xml file that I'm trying to find a
specific tag in.

I'm always suspicious when I see regular expression
and xml/html in the same context. regex are not good
for parsing xml/html files and it's usually much easier
to use a proper parser - such as beautiful soup.
http://www.crummy.com/software/BeautifulSoup/
Is there any special reason why you are using a regex
sledgehammer to crack this particular nut? Or is it
just to gain experience using regex?
Alan G.

_______________________________________________
http://mail.python.org/mailman/listinfo/tutor

Kent Johnson

2005-09-14 14:55:43 UTC

Permalink

Post by Bernard Lebel
Thanks for that pointer Kent, I'll check it out. Also thanks for
letting me know I'm not nuts! :-)
Alan's suggestion about BeautifulSoup is actually excellent. The
documentation is nice and the tool is very easy to use.
However is it normal that to parse a 2618 lines xml file it takes
20-30 seconds or so?

That seems slow to me unless the lines are really long! How many bytes is the file? But I don't have much experience with BeautifulSoup.

ElementTree is fast and cElementTree (the C implementation) is really fast. I have used it to read, process and write a 28 MB XML file, it took about 10 seconds.

Kent

Post by Bernard Lebel
Thanks
Bernard

Post by Kent Johnson

I agree that the Python XML story is confusing even for the files in the standard library. Worse, the (IMO) best solutions are not to be found in the standard lib or PyXML at all.
The std lib and PyXML are based on the DOM and SAX standards. These standards were designed to be "language-neutral" - there are implementations in Python, Java and other languages. The good side of this is, if you learn how to use them, the knowledge is pretty portable to other languages. The bad side is, the APIs defined by the standard are IMO clunky and painful to use, especially in Python.
http://groups.google.com/group/comp.lang.python/browse_frm/thread/a48891aa645ead13/dcd8fdc20b4b191b?hl=en#dcd8fdc20b4b191b
My personal preference is ElementTree. Beautiful Soup is good too though I have only tried it with HTML. If I was running on Linux I would try lxml which uses the ElementTree API and adds full XPath support. Amara looks like the Cadillac solution - big and cushy. I haven't tried it. Uche's articles (referenced in the thread above) have pointers to many other choices but these seem to be the most popular.
My favorite XML lib is actually dom4j which is in Java. It works great with Jython.
Kent

Post by Bernard Lebel
Kent: That works (of course!). Thanks a bunch once again!
Thanks
Bernard

Post by Alan G
Hi Bernard,

Post by Bernard Lebel
Hello, yet another regular expression question :-)
So I have this xml file that I'm trying to find a
specific tag in.

I'm always suspicious when I see regular expression
and xml/html in the same context. regex are not good
for parsing xml/html files and it's usually much easier
to use a proper parser - such as beautiful soup.
http://www.crummy.com/software/BeautifulSoup/
Is there any special reason why you are using a regex
sledgehammer to crack this particular nut? Or is it
just to gain experience using regex?
Alan G.

_______________________________________________
http://mail.python.org/mailman/listinfo/tutor

Bernard Lebel

2005-09-14 15:01:59 UTC

Permalink

The file size is 112 Kb. Most lines look this way:

<parameter name="roty" type="Parameter" sourceclassname="nosource">

I'll give a try to ElementTree.

Bernard

Post by Kent Johnson

That seems slow to me unless the lines are really long! How many bytes is the file? But I don't have much experience with BeautifulSoup.
ElementTree is fast and cElementTree (the C implementation) is really fast. I have used it to read, process and write a 28 MB XML file, it took about 10 seconds.
Kent

Post by Bernard Lebel
Thanks
Bernard

Post by Kent Johnson

I agree that the Python XML story is confusing even for the files in the standard library. Worse, the (IMO) best solutions are not to be found in the standard lib or PyXML at all.
The std lib and PyXML are based on the DOM and SAX standards. These standards were designed to be "language-neutral" - there are implementations in Python, Java and other languages. The good side of this is, if you learn how to use them, the knowledge is pretty portable to other languages. The bad side is, the APIs defined by the standard are IMO clunky and painful to use, especially in Python.
http://groups.google.com/group/comp.lang.python/browse_frm/thread/a48891aa645ead13/dcd8fdc20b4b191b?hl=en#dcd8fdc20b4b191b
My personal preference is ElementTree. Beautiful Soup is good too though I have only tried it with HTML. If I was running on Linux I would try lxml which uses the ElementTree API and adds full XPath support. Amara looks like the Cadillac solution - big and cushy. I haven't tried it. Uche's articles (referenced in the thread above) have pointers to many other choices but these seem to be the most popular.
My favorite XML lib is actually dom4j which is in Java. It works great with Jython.
Kent

Post by Bernard Lebel
Kent: That works (of course!). Thanks a bunch once again!
Thanks
Bernard

Post by Alan G
Hi Bernard,

Post by Bernard Lebel
Hello, yet another regular expression question :-)
So I have this xml file that I'm trying to find a
specific tag in.

I'm always suspicious when I see regular expression
and xml/html in the same context. regex are not good
for parsing xml/html files and it's usually much easier
to use a proper parser - such as beautiful soup.
http://www.crummy.com/software/BeautifulSoup/
Is there any special reason why you are using a regex
sledgehammer to crack this particular nut? Or is it
just to gain experience using regex?
Alan G.

_______________________________________________
http://mail.python.org/mailman/listinfo/tutor

Kent Johnson

2005-09-14 15:19:44 UTC

Permalink

Post by Bernard Lebel
<parameter name="roty" type="Parameter" sourceclassname="nosource">
I'll give a try to ElementTree.

To get you started:

from elementtree import ElementTree
doc = ElementTree.parse('myfile.xml')
for sceneobject in doc.findall('//sceneobject'):
if sceneobject.get('type') == 'CameraRoot':
# this is a sceneobject that you want
print sceneobject.get('name')

One gotcha - if your XML uses namespaces, you have to prefix the namespace to the tag name in findall(). It will look something like
d.findall('//{http://www.imsproject.org/xsd/imscp_rootv1p1p2}resource')

Let us know how long that takes...

Kent

Post by Bernard Lebel
Bernard

Post by Kent Johnson

That seems slow to me unless the lines are really long! How many bytes is the file? But I don't have much experience with BeautifulSoup.
ElementTree is fast and cElementTree (the C implementation) is really fast. I have used it to read, process and write a 28 MB XML file, it took about 10 seconds.
Kent

Post by Bernard Lebel
Thanks
Bernard

Post by Kent Johnson

I agree that the Python XML story is confusing even for the files in the standard library. Worse, the (IMO) best solutions are not to be found in the standard lib or PyXML at all.
The std lib and PyXML are based on the DOM and SAX standards. These standards were designed to be "language-neutral" - there are implementations in Python, Java and other languages. The good side of this is, if you learn how to use them, the knowledge is pretty portable to other languages. The bad side is, the APIs defined by the standard are IMO clunky and painful to use, especially in Python.
http://groups.google.com/group/comp.lang.python/browse_frm/thread/a48891aa645ead13/dcd8fdc20b4b191b?hl=en#dcd8fdc20b4b191b
My personal preference is ElementTree. Beautiful Soup is good too though I have only tried it with HTML. If I was running on Linux I would try lxml which uses the ElementTree API and adds full XPath support. Amara looks like the Cadillac solution - big and cushy. I haven't tried it. Uche's articles (referenced in the thread above) have pointers to many other choices but these seem to be the most popular.
My favorite XML lib is actually dom4j which is in Java. It works great with Jython.
Kent

Post by Bernard Lebel
Kent: That works (of course!). Thanks a bunch once again!
Thanks
Bernard

Post by Alan G
Hi Bernard,

Post by Bernard Lebel
Hello, yet another regular expression question :-)
So I have this xml file that I'm trying to find a
specific tag in.

I'm always suspicious when I see regular expression
and xml/html in the same context. regex are not good
for parsing xml/html files and it's usually much easier
to use a proper parser - such as beautiful soup.
http://www.crummy.com/software/BeautifulSoup/
Is there any special reason why you are using a regex
sledgehammer to crack this particular nut? Or is it
just to gain experience using regex?
Alan G.

_______________________________________________
http://mail.python.org/mailman/listinfo/tutor

Bernard Lebel

2005-09-14 15:53:21 UTC

Permalink

Hi Kent,

Well even before reading your last email I gave it a go, just parsing
the xml file and trying out some basic functions. It ran in less than
two seconds. I don't know why BeautifulSoup is taking so long...

Thanks for the "to get you started"!

Bernard

Post by Kent Johnson

Post by Bernard Lebel
<parameter name="roty" type="Parameter" sourceclassname="nosource">
I'll give a try to ElementTree.

from elementtree import ElementTree
doc = ElementTree.parse('myfile.xml')
# this is a sceneobject that you want
print sceneobject.get('name')
One gotcha - if your XML uses namespaces, you have to prefix the namespace to the tag name in findall(). It will look something like
d.findall('//{http://www.imsproject.org/xsd/imscp_rootv1p1p2}resource')
Let us know how long that takes...
Kent

Post by Bernard Lebel
Bernard

Post by Kent Johnson

That seems slow to me unless the lines are really long! How many bytes is the file? But I don't have much experience with BeautifulSoup.
ElementTree is fast and cElementTree (the C implementation) is really fast. I have used it to read, process and write a 28 MB XML file, it took about 10 seconds.
Kent

Post by Bernard Lebel
Thanks
Bernard

Post by Kent Johnson

I agree that the Python XML story is confusing even for the files in the standard library. Worse, the (IMO) best solutions are not to be found in the standard lib or PyXML at all.
The std lib and PyXML are based on the DOM and SAX standards. These standards were designed to be "language-neutral" - there are implementations in Python, Java and other languages. The good side of this is, if you learn how to use them, the knowledge is pretty portable to other languages. The bad side is, the APIs defined by the standard are IMO clunky and painful to use, especially in Python.
http://groups.google.com/group/comp.lang.python/browse_frm/thread/a48891aa645ead13/dcd8fdc20b4b191b?hl=en#dcd8fdc20b4b191b
My personal preference is ElementTree. Beautiful Soup is good too though I have only tried it with HTML. If I was running on Linux I would try lxml which uses the ElementTree API and adds full XPath support. Amara looks like the Cadillac solution - big and cushy. I haven't tried it. Uche's articles (referenced in the thread above) have pointers to many other choices but these seem to be the most popular.
My favorite XML lib is actually dom4j which is in Java. It works great with Jython.
Kent

Post by Bernard Lebel
Kent: That works (of course!). Thanks a bunch once again!
Thanks
Bernard

Post by Alan G
Hi Bernard,

Post by Bernard Lebel
Hello, yet another regular expression question :-)
So I have this xml file that I'm trying to find a
specific tag in.

I'm always suspicious when I see regular expression
and xml/html in the same context. regex are not good
for parsing xml/html files and it's usually much easier
to use a proper parser - such as beautiful soup.
http://www.crummy.com/software/BeautifulSoup/
Is there any special reason why you are using a regex
sledgehammer to crack this particular nut? Or is it
just to gain experience using regex?
Alan G.

_______________________________________________
http://mail.python.org/mailman/listinfo/tutor

Alan G

2005-09-14 16:45:28 UTC

Permalink

Post by Bernard Lebel
However is it normal that to parse a 2618 lines xml file
it takes 20-30 seconds or so?

Only if you are running it on an original Palm Pilot!

Seriously, I'd expect it to be more like 2-3 seconds.
Something fishy there.

Alan G.

Continue reading on narkive:

Search results for '[Tutor] Another regular expression question' (Questions and Answers)

replies

a friendly but long question for Muslims?

started 2010-06-10 15:04:45 UTC

religion & spirituality