Discussion:
[Tutor] extracting text from word files (.doc, .docx) and pdf
Juan Jose Del Toro
2011-01-25 21:52:56 UTC
Permalink
Dear List;

I am looking for a way to extract parts of a text from word (.doc,.docx)
files as well as pdf; the idea is to walk through the whole directory tree
and populate a csv file with an excerpt from each file.
For PDF I found PyPdf <http://pybrary.net/pyPdf/>ave found nothing to read
doc, docx
--
¡Saludos! / Greetings!
Juan José Del Toro M.
***@gmail.com
Guadalajara, Jalisco MEXICO
Walter Prins
2011-01-25 22:13:26 UTC
Permalink
Post by Juan Jose Del Toro
Dear List;
I am looking for a way to extract parts of a text from word (.doc,.docx)
files as well as pdf; the idea is to walk through the whole directory tree
and populate a csv file with an excerpt from each file.
For PDF I found PyPdf <http://pybrary.net/pyPdf/>ave found nothing to read
doc, docx
http://www.google.com/search?q=python+read+ms+word+file&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a

which returns this:

http://stackoverflow.com/questions/125222/extracting-text-from-ms-word-files-in-python

Additionally -- docx are, IIRC, zipped XML, so you could probably just
uncompress it and scan the XML directly...

Walter
Corey Richardson
2011-01-25 22:21:13 UTC
Permalink
Post by Juan Jose Del Toro
Dear List;
I am looking for a way to extract parts of a text from word (.doc,.docx)
files as well as pdf; the idea is to walk through the whole directory tree
and populate a csv file with an excerpt from each file.
For PDF I found PyPdf <http://pybrary.net/pyPdf/>ave found nothing to read
doc, docx
_______________________________________________
http://mail.python.org/mailman/listinfo/tutor
A docx file is a compressed XML file (or groups of files). I don't know
if there is a python module for it, but you could probably whip up your
own. I know 7z on Windows will extract a .docx (probably anything can if
you point to it, not sure).
Emile van Sebille
2011-01-25 23:59:38 UTC
Permalink
On 1/25/2011 1:52 PM Juan Jose Del Toro said...
Post by Juan Jose Del Toro
Dear List;
I am looking for a way to extract parts of a text from word (.doc,.docx)
I recently did a project extracting data from word documents and used
antiword (http://www.winfield.demon.nl/) then used it like this:

def setContent(self):
self.content =
[
ii.strip().replace("˚","")
for ii in
commands.getoutput('/usr/local/bin/antiword "%s"' %
doc).split("\n")
if ii
]


Emile
Alan Gauld
2011-01-26 00:42:08 UTC
Permalink
Post by Juan Jose Del Toro
I am looking for a way to extract parts of a text from word
(.doc,.docx)
files as well as pdf;
In addition to the suggestions already given you can use
COM to drive Word itself if you have Word on the PC in
which you are running the code.

If not there are ways to drive OpenOffice too although I've never
used them...

HTH,
--
Alan Gauld
Author of the Learn to Program web site
http://www.alan-g.me.uk/
Loading...