Discussion:
[Tutor] removing nodes using ElementTree
s***@mailworks.org
2015-06-27 00:03:48 UTC
Permalink
Hello all,

I'm trying to merge and filter some xml. This is working well, but I'm
getting one node that's not in my list to include. Python version is
3.4.0.

The goal is to merge multiple xml files and then write a new one based
on whether or not <pid> is in an include list. In the mock data below,
the 3 xml files have a total of 8 <rec> nodes, and I have 4 <pid> values
in my list. The output is correctly formed xml, but it includes 5 <rec>
nodes; the 4 in the list, plus 89012 from input1.xml. It runs without
error. I've used used type() to compare
rec.find('part').find('pid').text and the items in the list, they're
strings. When the first for loop is done, xmlet has 8 rec nodes. Is
there a problem in the iteration in the second for? Any other
recommendations also welcome. Thanks!


The code itself was cobbled together from two sources,
http://stackoverflow.com/questions/9004135/merge-multiple-xml-files-from-command-line/11315257#11315257
and http://bryson3gps.wordpress.com/tag/elementtree/

Here's the code and data:

#!/usr/bin/env python3

import os, glob
from xml.etree import ElementTree as ET

xmls = glob.glob('input*.xml')
ilf = os.path.join(os.path.expanduser('~'),'include_list.txt')
xo = os.path.join(os.path.expanduser('~'),'mergedSortedOutput.xml')

il = [x.strip() for x in open(ilf)]

xmlet = None

for xml in xmls:
d = ET.parse(xml).getroot()
for rec in d.iter('inv'):
if xmlet is None:
xmlet = d
else:
xmlet.extend(rec)

for rec in xmlet:
if rec.find('part').find('pid').text not in il:
xmlet.remove(rec)

ET.ElementTree(xmlet).write(xo)

quit()





include_list.txt

12345
34567
56789
67890

input1.xml

<inv>
<rec>
<part>
<pid>67890</pid>
<tid>67890t</tid>
</part>
<detail>
<did>67890d</did>
</detail>
</rec>
<rec>
<part>
<pid>78901</pid>
<tid>78901t</tid>
</part>
<detail>
<did>78901d</did>
</detail>
</rec>
<rec>
<part>
<pid>89012</pid>
<tid>89012t</tid>
</part>
<detail>
<did>89012d</did>
</detail>
</rec>
</inv>

input2.xml

<inv>
<rec>
<part>
<pid>45678</pid>
<tid>45678t</tid>
</part>
<detail>
<did>45678d</did>
</detail>
</rec>
<rec>
<part>
<pid>56789</pid>
<tid>56789t</tid>
</part>
<detail>
<did>56789d</did>
</detail>
</rec>
</inv>

input3.xml

<inv>
<rec>
<part>
<pid>12345</pid>
<tid>12345t</tid>
</part>
<detail>
<did>12345d</did>
</detail>
</rec>
<rec>
<part>
<pid>23456</pid>
<tid>23456t</tid>
</part>
<detail>
<did>23456d</did>
</detail>
</rec>
<rec>
<part>
<pid>34567</pid>
<tid>34567t</tid>
</part>
<detail>
<did>34567d</did>
</detail>
</rec>
</inv>

mergedSortedOutput.xml:

<inv>
<rec>
<part>
<pid>67890</pid>
<tid>67890t</tid>
</part>
<detail>
<did>67890d</did>
</detail>
</rec>
<rec>
<part>
<pid>89012</pid>
<tid>89012t</tid>
</part>
<detail>
<did>89012d</did>
</detail>
</rec>
<rec>
<part>
<pid>12345</pid>
<tid>12345t</tid>
</part>
<detail>
<did>12345d</did>
</detail>
</rec>
<rec>
<part>
<pid>34567</pid>
<tid>34567t</tid>
</part>
<detail>
<did>34567d</did>
</detail>
</rec>
<rec>
<part>
<pid>56789</pid>
<tid>56789t</tid>
</part>
<detail>
<did>56789d</did>
</detail>
</rec>
</inv>
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Alan Gauld
2015-06-29 00:02:56 UTC
Permalink
Post by s***@mailworks.org
I'm trying to merge and filter some xml. This is working well, but I'm
getting one node that's not in my list to include.
This may be similar to a recent post with similar issues.
Post by s***@mailworks.org
xmlet.remove(rec)
You're removing records from the thing you are iterating over.
That's never a good idea, and often causes the next record to
be stepped over without being processed.

Try taking a copy before iterating and then modify the original.

for rec in xmlet[:]: # slice creates a copy
if rec....
xmlet.remove(rec)

HTH
--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Peter Otten
2015-06-29 08:58:26 UTC
Permalink
Post by s***@mailworks.org
Hello all,
I'm trying to merge and filter some xml. This is working well, but I'm
getting one node that's not in my list to include. Python version is
3.4.0.
The goal is to merge multiple xml files and then write a new one based
on whether or not <pid> is in an include list. In the mock data below,
the 3 xml files have a total of 8 <rec> nodes, and I have 4 <pid> values
in my list. The output is correctly formed xml, but it includes 5 <rec>
nodes; the 4 in the list, plus 89012 from input1.xml. It runs without
error. I've used used type() to compare
rec.find('part').find('pid').text and the items in the list, they're
strings. When the first for loop is done, xmlet has 8 rec nodes. Is
there a problem in the iteration in the second for? Any other
recommendations also welcome. Thanks!
The code itself was cobbled together from two sources,
http://stackoverflow.com/questions/9004135/merge-multiple-xml-files-from-> command-line/11315257#11315257
and http://bryson3gps.wordpress.com/tag/elementtree/
#!/usr/bin/env python3
import os, glob
from xml.etree import ElementTree as ET
xmls = glob.glob('input*.xml')
ilf = os.path.join(os.path.expanduser('~'),'include_list.txt')
xo = os.path.join(os.path.expanduser('~'),'mergedSortedOutput.xml')
il = [x.strip() for x in open(ilf)]
xmlet = None
d = ET.parse(xml).getroot()
xmlet = d
xmlet.extend(rec)
xmlet.remove(rec)
ET.ElementTree(xmlet).write(xo)
quit()
I believe Alan answered your question; I just want to thank you for
taking the time to describe your problem clearly and for providing
all the necessary parts to reproduce it.

Bonus part:

Other options to filter a mutable sequence:

(1) assign to the slice:

items[:] = [item for item in items if is_match(item)]

(2) iterate over it in reverse order:

for item in reversed(items):
if not ismatch(item):
items.remove(item)

Below is a way to integrate method 1 in your code:

[...]

# set lookup is more efficient than lookup in a list
il = set(x.strip() for x in open(ilf))


def matching_recs(recs):
return (rec for rec in recs if rec.find("part/pid").text in il)


xmlet = None
for xml in xmls:
inv = ET.parse(xml).getroot()
if xmlet is None:
xmlet = inv
# replace all recs with matching recs
xmlet[:] = matching_recs(inv)
else:
# append only matching recs
xmlet.extend(matching_recs(inv))


ET.ElementTree(xmlet).write(xo)

# the script will end happily without a quit() or exit() call
# quit()


At least with your sample data
iterates over a single node (the root) so I omitted that loop.

_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Loading...