Discussion:
[Tutor] find second occurance of string in line
richard kappler
2015-09-08 16:00:35 UTC
Permalink
I need to find the index of the second occurance of a string in an xml file
for parsing. I understand re well enough to do what I want to do on the
first instance, but despite many Google searches have yet to find something
to get the index of the second instance, because split won't really work on
my xml file (if I understand split properly) as there are no spaces.
Specifically I'm looking for the second <timestamp> in an objectdata line.
Not all lines are objectdata lines, though all objectdata lines do have
more than one <timestamp>.

What I have so far:

import re

with open("example.xml", 'r') as f:
for line in f:
if "objectdata" in line:
if "<timestamp>" in line:
x = "<timestamp>"
first = x.index(line)
second = x[first+1:].index(line)
print first, second
else:
print "no timestamp"
else:
print "no objectdata"

my traceback:

Traceback (most recent call last):
File "2iter.py", line 10, in <module>
first = x.index(line)
ValueError: substring not found



--

All internal models of the world are approximate. ~ Sebastian Thrun
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Alan Gauld
2015-09-08 16:23:31 UTC
Permalink
On 08/09/15 17:00, richard kappler wrote:
> I need to find the index of the second occurance of a string in an xml file
> for parsing.

Do you want to find just the second occurence in the *file*
or the second occurence within a given tag in the file (and
there could be multiple such tags)?


> I understand re well enough to do what I want to do

Using re to parse XML is usually the wrong way to go about it.
Fortunately you are not using re in the code below.
However, a real XML parser such as etree(from the std lib)
or lxml might work better.

> first instance, but despite many Google searches have yet to find something
> to get the index of the second instance, because split won't really work on
> my xml file (if I understand split properly) as there are no spaces.

split can split on any character you want, whitespace
just happens to be the default.

> Specifically I'm looking for the second <timestamp> in an objectdata line.

Is objectdata within a specific tag? Usually when parsing XML its
the tags you look for first since "lines" can be broken over
multiple lines and multiple tags can exist on one literal line.

> Not all lines are objectdata lines, though all objectdata lines do have
> more than one <timestamp>.

This implies there are many objectdata lines within your file? See the
first comment above... do you want the second index for the first
objectdata line or do you want it for every objectdata line?

> import re

You don't use this.

> with open("example.xml", 'r') as f:
> for line in f:
> if "objectdata" in line:
> if "<timestamp>" in line:
> x = "<timestamp>"

You should assign this once above the loops, it saves a lot of
duplicated work.

> first = x.index(line)

This is looking for the index of line within x.
I suspect you really want

first = line.index(x)

> second = x[first+1:].index(line)

You can specify a start position for index directly:

second = line.index(x,first+1)


> print first, second
> else:
> print "no timestamp"
> else:
> print "no objectdata"
>
> my traceback:
>
> Traceback (most recent call last):
> File "2iter.py", line 10, in <module>
> first = x.index(line)
> ValueError: substring not found

That's what you get when the search fails.
You should use try/except when using index()

Alternatively try using str.find() which returns -1
when no index is found. But you need to check before
using it because -1 is, of course, a valid string index!

HTH

--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
richard kappler
2015-09-08 16:56:43 UTC
Permalink
> Do you want to find just the second occurence in the *file* or the second
occurence within a given tag in the file (and there could be multiple such
tags)?

There are multiple objectdata lines in the file and I wish to find the
second occurence of timestamp in each of those lines.

> Is objectdata within a specific tag? Usually when parsing XML its the
tags you look for first since "lines" can be broken over multiple lines and
multiple tags can exist on one literal line.

objectdata is within a tag as is timestamp. Here's an example:

<objectdata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="Logging.xsd"
version="1.0"><devicename>0381UDI1</devicename><deviceid>32</deviceid><timestamp>2015-06-18T14:28:06.570</timestamp><incr>53163</incr><tokenid>0381UDI12015-06-18T14:27:50379</tokenid><seqnb>1306</seqnb><general
oi="607360" on="379" ox="02503" oc="0" is="49787" ie="50312" lftf="N"
lfts="7" errornb="0"
iostate="DC00"><timestamp>2015-06-18T14:27:50.811</timestamp><otl
unit="inch"><value>51.45</value></otl><tt
unit="ms"><value>0</value></tt><oga
unit="inch"><value>-10.66</value></oga><devicelist>FFFFF001</devicelist><udi
udi1="2" udi2="1" udi3="20" udi4="10" udi5="0"/><uds uds1="5000"
uds2="1720" uds3="1540" uds5="OK"/><usertag1
name="SE">SE0381icdim01322142C0050.00017.20015.2IN1000000000
0044]C00379622001900000175168000643903637408</usertag1></general><condition>AcSuccess,AC,LFT,Cubic,ValidDim</condition><barcode
cc="1" vcc="1"><codevalid
id="j"><st>0</st><wud>0</wud><cs>495</cs><bdn>18</bdn><cl>34</cl><bc>9622001900000175168000643903637408</bc><condition>FxgEpic,RefBarcode,ScannersTop,Ref,ValidRead,SortableBarcode,EpicBarcode</condition><position
unit="mm" x="119" y="544" z="375" xmin="97"
xmax="174"/><readlist>70066000</readlist><devices><device
dn="14"><dcs>101</dcs><position unit="mm" x="99" y="544"
z="439"/></device><device dn="15"><dcs>99</dcs><position unit="mm" x="174"
y="533" z="312"/></device><device dn="18"><dcs>112</dcs><position unit="mm"
x="119" y="544" z="375"/></device><device dn="19"><dcs>81</dcs><position
unit="mm" x="97" y="606" z="390"/></device><device
dn="2"><dcs>37</dcs><position unit="mm" x="117" y="539"
z="374"/></device><device dn="4"><dcs>29</dcs><position unit="mm" x="119"
y="547" z="356"/></device><device dn="3"><dcs>36</dcs><position unit="mm"
x="119" y="540"
z="358"/></device></devices></codevalid></barcode><volumetric oms1="0000"
oms2="00000000" oms3="00000001"><size unit="inch" ole="50.00" owi="17.20"
ohe="15.20"/><oa unit="degree/10"><value>-50</value></oa><obv
unit="inch3/10"><value>130720</value></obv><orv
unit="inch3/10"><value>0</value></orv><otve
unit="mm/sec"><value>0</value></otve><polygon unit="inch" oc1x="0.14"
oc1y="8.80" oc1z="15.20" oc2x="1.70" oc2y="26.20" oc2z="15.20" oc3x="50.05"
oc3y="4.32" oc3z="15.20" oc4x="51.61" oc4y="21.73"
oc4z="15.20"/></volumetric><sorterstate state="started"><speed
unit="ft/min"><value>84.06</value></speed></sorterstate><sortstate
session="SOS" sortname="2015-06-18
12:47:57"/><hostmessage>]C00379622001900000175168000643903637408</hostmessage></objectdata>

> import re

I know I don't use this in the snippet of code I sent, it is used elsewhere
in the script and I missed that when trimming.

> You should assign this once above the loops, it saves a lot of duplicated
work.

Yes, I understand, again, it is a snippet of a much larger piece of code.

> second = line.index(x,first+1)

I think that is what I was looking for, off to test now. Thanks Alan!

regards, Richard
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Peter Otten
2015-09-08 17:40:03 UTC
Permalink
richard kappler wrote:

>> Do you want to find just the second occurence in the *file* or the second
> occurence within a given tag in the file (and there could be multiple such
> tags)?
>
> There are multiple objectdata lines in the file and I wish to find the
> second occurence of timestamp in each of those lines.
>
>> Is objectdata within a specific tag? Usually when parsing XML its the
> tags you look for first since "lines" can be broken over multiple lines
> and multiple tags can exist on one literal line.
>
> objectdata is within a tag as is timestamp. Here's an example:
>
> <objectdata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
> xsi:noNamespaceSchemaLocation="Logging.xsd"
> version="1.0"><devicename>0381UDI1</devicename><deviceid>32
> </deviceid><timestamp>2015-06-18T14:28:06.570</timestamp>
> <incr>53163</incr><tokenid>0381UDI12015-06-18T14:27:50379</tokenid>
> <seqnb>1306</seqnb><general> oi="607360" on="379" ox="02503" oc="0"
is="49787" ie="50312" lftf="N"
> lfts="7" errornb="0"
> iostate="DC00"><timestamp>2015-06-18T14:27:50.811</timestamp><otl
> unit="inch"><value>51.45</value></otl><tt
> unit="ms"><value>0</value></tt><oga

[snip]

I'm inferring from the above that you do not really want the "second"
timestamp in the line -- there is no line left intace anyway;) -- but rather
the one in the <general>...</general> part.

Here's a way to get these (all of them) with lxml:

import lxml.etree

tree = lxml.etree.parse("example.xml")
print tree.xpath("//objectdata/general/timestamp/text()")


_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Albert-Jan Roskam
2015-09-08 19:08:22 UTC
Permalink
> To: ***@python.org
> From: ***@web.de
> Date: Tue, 8 Sep 2015 19:40:03 +0200
> Subject: Re: [Tutor] Fwd: find second occurance of string in line
>
> richard kappler wrote:
>
> >> Do you want to find just the second occurence in the *file* or the second
> > occurence within a given tag in the file (and there could be multiple such
> > tags)?
> >
> > There are multiple objectdata lines in the file and I wish to find the
> > second occurence of timestamp in each of those lines.
> >
> >> Is objectdata within a specific tag? Usually when parsing XML its the
> > tags you look for first since "lines" can be broken over multiple lines
> > and multiple tags can exist on one literal line.
> >
> > objectdata is within a tag as is timestamp. Here's an example:
> >
> > <objectdata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
> > xsi:noNamespaceSchemaLocation="Logging.xsd"
> > version="1.0"><devicename>0381UDI1</devicename><deviceid>32
> > </deviceid><timestamp>2015-06-18T14:28:06.570</timestamp>
> > <incr>53163</incr><tokenid>0381UDI12015-06-18T14:27:50379</tokenid>
> > <seqnb>1306</seqnb><general> oi="607360" on="379" ox="02503" oc="0"
> is="49787" ie="50312" lftf="N"
> > lfts="7" errornb="0"
> > iostate="DC00"><timestamp>2015-06-18T14:27:50.811</timestamp><otl
> > unit="inch"><value>51.45</value></otl><tt
> > unit="ms"><value>0</value></tt><oga
>
> [snip]
>
> I'm inferring from the above that you do not really want the "second"
> timestamp in the line -- there is no line left intace anyway;) -- but rather
> the one in the <general>...</general> part.
>
> Here's a way to get these (all of them) with lxml:
>
> import lxml.etree
>
> tree = lxml.etree.parse("example.xml")
> print tree.xpath("//objectdata/general/timestamp/text()")

Nice. I do need to try lxml some time. Is the "text()" part xpath as well?
When I try it with elementtree, I get

* a SyntaxError.

root.find("//objectdata/general/timestamp/text()")
File "<string>", line unknown
SyntaxError: cannot use absolute path on element

root.find("./general/timestamp/text()")
Traceback (most recent call last):

File "<ipython-input-47-af8fc012f0f0>", line 1, in <module>
root.find("./general/timestamp/text()")

File "/usr/lib/python2.7/xml/etree/ElementPath.py", line 285, in find
return iterfind(elem, path, namespaces).next()

File "/usr/lib/python2.7/xml/etree/ElementPath.py", line 263, in iterfind
selector.append(ops[token[0]](next, token))

* a KeyError

root.find("./general/timestamp/text()")
Traceback (most recent call last):

File "<ipython-input-47-af8fc012f0f0>", line 1, in <module>
root.find("./general/timestamp/text()")

File "/usr/lib/python2.7/xml/etree/ElementPath.py", line 285, in find
return iterfind(elem, path, namespaces).next()

File "/usr/lib/python2.7/xml/etree/ElementPath.py", line 263, in iterfind
selector.append(ops[token[0]](next, token))

KeyError: '()'


Anyway, I would have done it like below. I added the pretty-printing because I think it's a really useful function.

import xml.etree.cElementTree as et
import xml.dom.minidom

data = """\
* copy-paste xml here
"""
data = xml.dom.minidom.parseString(data).toprettyxml()
print data
root = et.fromstring(data)
#root = et.parse(xml_filename).getroot()
timestamp = root.findtext(".//general/timestamp")


Best wishes,
Albert-Jan






_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Peter Otten
2015-09-08 19:37:07 UTC
Permalink
Albert-Jan Roskam wrote:

>> import lxml.etree
>>
>> tree = lxml.etree.parse("example.xml")
>> print tree.xpath("//objectdata/general/timestamp/text()")
>
> Nice. I do need to try lxml some time. Is the "text()" part xpath as well?

Yes. I think ElementTree supports a subset of XPath.

_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Albert-Jan Roskam
2015-09-09 13:31:22 UTC
Permalink
> To: ***@python.org
> From: ***@web.de
> Date: Tue, 8 Sep 2015 21:37:07 +0200
> Subject: Re: [Tutor] Fwd: find second occurance of string in line
>
> Albert-Jan Roskam wrote:
>
> >> import lxml.etree
> >>
> >> tree = lxml.etree.parse("example.xml")
> >> print tree.xpath("//objectdata/general/timestamp/text()")
> >
> > Nice. I do need to try lxml some time. Is the "text()" part xpath as well?
>
> Yes. I think ElementTree supports a subset of XPath.

aha, I see. I studied lxml.de a bit last night and it seems to be better in many ways. Writing appears to be mmmuch faster than cElementtree while many other situations are comparable to cElementtree. I love the objectify part of the package. Would you say that (given iterparse) lxml is also the module to process giant (ie. larger than RAM) xml files)?

The webpage recommends a cascade of try-except ImportError statementsL first lxml, then cElementtree, then elementtree. But given that there are slight API differences, is that really a good idea? How would you test whether the code runs under both lxml and under the alternatives? Would you uninstall lxml to force that one of the alternatives is used?

Best wishes,
Albert-Jan

_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
richard kappler
2015-09-09 13:40:16 UTC
Permalink
> It looks likes I was not clear enough: XML doesn't have the concept of lines.
When you process XML "by line" you have buggy code.

No Peter, I'm pretty sure it was I who was less than clear. The xml data is
generated by events, one line in a log for each event, so while xml doesn't
have the concept of lines, each event creates a new line of xml data in the
log from which I will be reading. Sorry about the confusion. I should have
phrased it better, perhaps calling them events instead of lines?

> Just for fun take the five minutes to install lxml and compare the output
of the two scripts. If it's the same now there's no harm switching to lxml,
and you are making future failures less likely.

I'll take a look at it, thanks.

regards, Richard
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Peter Otten
2015-09-09 13:53:33 UTC
Permalink
Albert-Jan Roskam wrote:

>> To: ***@python.org
>> From: ***@web.de
>> Date: Tue, 8 Sep 2015 21:37:07 +0200
>> Subject: Re: [Tutor] Fwd: find second occurance of string in line
>>
>> Albert-Jan Roskam wrote:
>>
>> >> import lxml.etree
>> >>
>> >> tree = lxml.etree.parse("example.xml")
>> >> print tree.xpath("//objectdata/general/timestamp/text()")
>> >
>> > Nice. I do need to try lxml some time. Is the "text()" part xpath as
>> > well?
>>
>> Yes. I think ElementTree supports a subset of XPath.
>
> aha, I see. I studied lxml.de a bit last night and it seems to be better
> in many ways. Writing appears to be mmmuch faster than cElementtree while
> many other situations are comparable to cElementtree. I love the objectify
> part of the package. Would you say that (given iterparse) lxml is also the
> module to process giant (ie. larger than RAM) xml files)?

Yes; but that would not be backed by experience ;)

> The webpage recommends a cascade of try-except ImportError statementsL
> first lxml, then cElementtree, then elementtree. But given that there are
> slight API differences, is that really a good idea?

I tend to shy away from such complications, just as I write Python-3-only
code now.

> How would you test
> whether the code runs under both lxml and under the alternatives? Would
> you uninstall lxml to force that one of the alternatives is used?

Again, I have not much experience with code that must cope with different
environments.

I have simulated a missing module (not lxml) with the following trick:

$ mkdir missing_lxml
$ echo 'raise ImportError' > missing_lxml/lxml.py
$ python3 -c 'import lxml'
$ PYTHONPATH=missing_lxml python3 -c 'import lxml'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/peter/missing_lxml/lxml.py", line 1, in <module>
raise ImportError
ImportError

Those who regularly need different configurations probably use virtualenv,
or virtual machines when the differences are not limited to Python.


_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Laura Creighton
2015-09-09 15:16:09 UTC
Permalink
Peter Otten
>Those who regularly need different configurations probably use virtualenv,
>or virtual machines when the differences are not limited to Python.

Use tox for this.
https://testrun.org/tox/latest/

However for development purposes it often helps to have a
--force the_one_that_I_want option (for command lines) or
a global variable, or a config file for modules.

How badly you want this depends on your own personal development
style, and how happy you are popping in and out of virtualenvs. Many
people prefer to write their whole new thing for one library (say
elementtree) and then test it/port it against the other 2, one at a
time, making a complete set of patches for one adaptation at a time.
Other people prefer to write their code so that, feature by feature
they first get it to work with one library, and then with another, and
then with the third, and then they write the next new bit of code, so
that they never have to do a real port.

Life is messy enough that you often do a bit of this and a bit of the
other thing, even if you would prefer to not need to, especially if
hungry customers are demanding exactly what they need (and we don't
care about the other ways it will eventually work for other people).

Laura
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Peter Otten
2015-09-09 07:55:24 UTC
Permalink
richard kappler wrote:

> On Tue, Sep 8, 2015 at 1:40 PM, Peter Otten <***@web.de> wrote:

>> I'm inferring from the above that you do not really want the "second"
>> timestamp in the line -- there is no line left intace anyway;) -- but
>> rather
>> the one in the <general>...</general> part.
>>
>> Here's a way to get these (all of them) with lxml:
>>
>> import lxml.etree
>>
>> tree = lxml.etree.parse("example.xml")
>> print tree.xpath("//objectdata/general/timestamp/text()")

> No no, I'm not sure how I can be much more clear, that is one (1) line of
> xml that I provided, not many, and I really do want what I said in the
> very beginning, the second instance of <timestamp> for each of those
> lines.

It looks likes I was not clear enough: XML doesn't have the concept of
lines. When you process XML "by line" you have buggy code.

> Got it figured out with guidance from Alan's response though:
>
> #!/usr/bin/env python
>
> with open("example.xml", 'r') as f:
> for line in f:
> if "objectdata" in line:
> if "<timestamp>" in line:
> x = "<timestamp>"
> a = "</timestamp>"
> first = line.index(x)
> second = line.index(x, first+1)
> b = line.index(a)
> c = line.index(a, b+1)
> y = second + 11
> timestamp = line[y:c]
> print timestamp

Just for fun take the five minutes to install lxml and compare the output of
the two scripts. If it's the same now there's no harm switching to lxml, and
you are making future failures less likely.

_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Continue reading on narkive:
Loading...