s***@mailworks.org
2015-06-27 00:03:48 UTC
Hello all,
I'm trying to merge and filter some xml. This is working well, but I'm
getting one node that's not in my list to include. Python version is
3.4.0.
The goal is to merge multiple xml files and then write a new one based
on whether or not <pid> is in an include list. In the mock data below,
the 3 xml files have a total of 8 <rec> nodes, and I have 4 <pid> values
in my list. The output is correctly formed xml, but it includes 5 <rec>
nodes; the 4 in the list, plus 89012 from input1.xml. It runs without
error. I've used used type() to compare
rec.find('part').find('pid').text and the items in the list, they're
strings. When the first for loop is done, xmlet has 8 rec nodes. Is
there a problem in the iteration in the second for? Any other
recommendations also welcome. Thanks!
The code itself was cobbled together from two sources,
http://stackoverflow.com/questions/9004135/merge-multiple-xml-files-from-command-line/11315257#11315257
and http://bryson3gps.wordpress.com/tag/elementtree/
Here's the code and data:
#!/usr/bin/env python3
import os, glob
from xml.etree import ElementTree as ET
xmls = glob.glob('input*.xml')
ilf = os.path.join(os.path.expanduser('~'),'include_list.txt')
xo = os.path.join(os.path.expanduser('~'),'mergedSortedOutput.xml')
il = [x.strip() for x in open(ilf)]
xmlet = None
for xml in xmls:
d = ET.parse(xml).getroot()
for rec in d.iter('inv'):
if xmlet is None:
xmlet = d
else:
xmlet.extend(rec)
for rec in xmlet:
if rec.find('part').find('pid').text not in il:
xmlet.remove(rec)
ET.ElementTree(xmlet).write(xo)
quit()
include_list.txt
12345
34567
56789
67890
input1.xml
<inv>
<rec>
<part>
<pid>67890</pid>
<tid>67890t</tid>
</part>
<detail>
<did>67890d</did>
</detail>
</rec>
<rec>
<part>
<pid>78901</pid>
<tid>78901t</tid>
</part>
<detail>
<did>78901d</did>
</detail>
</rec>
<rec>
<part>
<pid>89012</pid>
<tid>89012t</tid>
</part>
<detail>
<did>89012d</did>
</detail>
</rec>
</inv>
input2.xml
<inv>
<rec>
<part>
<pid>45678</pid>
<tid>45678t</tid>
</part>
<detail>
<did>45678d</did>
</detail>
</rec>
<rec>
<part>
<pid>56789</pid>
<tid>56789t</tid>
</part>
<detail>
<did>56789d</did>
</detail>
</rec>
</inv>
input3.xml
<inv>
<rec>
<part>
<pid>12345</pid>
<tid>12345t</tid>
</part>
<detail>
<did>12345d</did>
</detail>
</rec>
<rec>
<part>
<pid>23456</pid>
<tid>23456t</tid>
</part>
<detail>
<did>23456d</did>
</detail>
</rec>
<rec>
<part>
<pid>34567</pid>
<tid>34567t</tid>
</part>
<detail>
<did>34567d</did>
</detail>
</rec>
</inv>
mergedSortedOutput.xml:
<inv>
<rec>
<part>
<pid>67890</pid>
<tid>67890t</tid>
</part>
<detail>
<did>67890d</did>
</detail>
</rec>
<rec>
<part>
<pid>89012</pid>
<tid>89012t</tid>
</part>
<detail>
<did>89012d</did>
</detail>
</rec>
<rec>
<part>
<pid>12345</pid>
<tid>12345t</tid>
</part>
<detail>
<did>12345d</did>
</detail>
</rec>
<rec>
<part>
<pid>34567</pid>
<tid>34567t</tid>
</part>
<detail>
<did>34567d</did>
</detail>
</rec>
<rec>
<part>
<pid>56789</pid>
<tid>56789t</tid>
</part>
<detail>
<did>56789d</did>
</detail>
</rec>
</inv>
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
I'm trying to merge and filter some xml. This is working well, but I'm
getting one node that's not in my list to include. Python version is
3.4.0.
The goal is to merge multiple xml files and then write a new one based
on whether or not <pid> is in an include list. In the mock data below,
the 3 xml files have a total of 8 <rec> nodes, and I have 4 <pid> values
in my list. The output is correctly formed xml, but it includes 5 <rec>
nodes; the 4 in the list, plus 89012 from input1.xml. It runs without
error. I've used used type() to compare
rec.find('part').find('pid').text and the items in the list, they're
strings. When the first for loop is done, xmlet has 8 rec nodes. Is
there a problem in the iteration in the second for? Any other
recommendations also welcome. Thanks!
The code itself was cobbled together from two sources,
http://stackoverflow.com/questions/9004135/merge-multiple-xml-files-from-command-line/11315257#11315257
and http://bryson3gps.wordpress.com/tag/elementtree/
Here's the code and data:
#!/usr/bin/env python3
import os, glob
from xml.etree import ElementTree as ET
xmls = glob.glob('input*.xml')
ilf = os.path.join(os.path.expanduser('~'),'include_list.txt')
xo = os.path.join(os.path.expanduser('~'),'mergedSortedOutput.xml')
il = [x.strip() for x in open(ilf)]
xmlet = None
for xml in xmls:
d = ET.parse(xml).getroot()
for rec in d.iter('inv'):
if xmlet is None:
xmlet = d
else:
xmlet.extend(rec)
for rec in xmlet:
if rec.find('part').find('pid').text not in il:
xmlet.remove(rec)
ET.ElementTree(xmlet).write(xo)
quit()
include_list.txt
12345
34567
56789
67890
input1.xml
<inv>
<rec>
<part>
<pid>67890</pid>
<tid>67890t</tid>
</part>
<detail>
<did>67890d</did>
</detail>
</rec>
<rec>
<part>
<pid>78901</pid>
<tid>78901t</tid>
</part>
<detail>
<did>78901d</did>
</detail>
</rec>
<rec>
<part>
<pid>89012</pid>
<tid>89012t</tid>
</part>
<detail>
<did>89012d</did>
</detail>
</rec>
</inv>
input2.xml
<inv>
<rec>
<part>
<pid>45678</pid>
<tid>45678t</tid>
</part>
<detail>
<did>45678d</did>
</detail>
</rec>
<rec>
<part>
<pid>56789</pid>
<tid>56789t</tid>
</part>
<detail>
<did>56789d</did>
</detail>
</rec>
</inv>
input3.xml
<inv>
<rec>
<part>
<pid>12345</pid>
<tid>12345t</tid>
</part>
<detail>
<did>12345d</did>
</detail>
</rec>
<rec>
<part>
<pid>23456</pid>
<tid>23456t</tid>
</part>
<detail>
<did>23456d</did>
</detail>
</rec>
<rec>
<part>
<pid>34567</pid>
<tid>34567t</tid>
</part>
<detail>
<did>34567d</did>
</detail>
</rec>
</inv>
mergedSortedOutput.xml:
<inv>
<rec>
<part>
<pid>67890</pid>
<tid>67890t</tid>
</part>
<detail>
<did>67890d</did>
</detail>
</rec>
<rec>
<part>
<pid>89012</pid>
<tid>89012t</tid>
</part>
<detail>
<did>89012d</did>
</detail>
</rec>
<rec>
<part>
<pid>12345</pid>
<tid>12345t</tid>
</part>
<detail>
<did>12345d</did>
</detail>
</rec>
<rec>
<part>
<pid>34567</pid>
<tid>34567t</tid>
</part>
<detail>
<did>34567d</did>
</detail>
</rec>
<rec>
<part>
<pid>56789</pid>
<tid>56789t</tid>
</part>
<detail>
<did>56789d</did>
</detail>
</rec>
</inv>
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor