[Tutor] gensim to generate document vectors

Discussion:

Joshua Valdez

2015-08-17 17:50:54 UTC

Okay, so I'm trying to use Doc2Vec to simply read in a a file that is a
list of sentences like this:
#

The elephant flaps its large ears to cool the blood in them and its body.

A house is a permanent building or structure for people or families to live
in.

....
#

What I want to do is generate two files one with unique words from these
sentences and another file that has one corresponding vector per line (if
theres no vector output I want to output a vector od 0's)

I'm getting the vocab fine with my code but I can't seem to figure out how
to print out the individual sentence vectors, I have looked through the
documentation and haven't found much help. Here is what my code looks like
so far.

sentences = []for uid, line in enumerate(open(filename)):
sentences.append(LabeledSentence(words=line.split(),
labels=['SENT_%s' % uid]))

model = Doc2Vec(alpha=0.025, min_alpha=0.025)
model.build_vocab(sentences)for epoch in range(10):
model.train(sentences)
model.alpha -= 0.002
model.min_alpha = model.alpha
sent_reg = r'[SENT].*'for item in model.vocab.keys():
sent = re.search(sent_reg, item)
if sent:
continue
else:
print item
###I'm not sure how to produce the vectors from here and this doesn't work##
sent_id = 0for item in model:
print model["SENT_"+str(sent_id)]
sent_id += 1

I'm not sure where to go from here so any help is appreciated.

Thanks!

*Joshua Valdez*
*Computational Linguist : Cognitive Scientist
*

(440)-231-0479
***@case.edu <***@uw.edu> | ***@uw.edu | ***@armsandanchors.com
<http://www.linkedin.com/in/valdezjoshua/>
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Alan Gauld

2015-08-17 19:42:44 UTC

Permalink

Post by Joshua Valdez
Okay, so I'm trying to use Doc2Vec to simply read in a a file that is a

This list us for folks learning the core Pyhton lanmguage and the
standard library.
Doc2Vec is not part of that library so you might find you get more
responses asking on the gensim community forums.
A quick Google search suggests there are several
to choose from

You might hit lucky here but its not an area we discuss often.

Post by Joshua Valdez
What I want to do is generate two files one with unique words from these
sentences and another file that has one corresponding vector per line (if
theres no vector output I want to output a vector od 0's)

Don't assume anyone here will know about your area of specialism.
What is a vector in this context?

Post by Joshua Valdez
I'm getting the vocab fine with my code but I can't seem to figure out how
to print out the individual sentence vectors, I have looked through the
documentation and haven't found much help. Here is what my code looks like
so far.

It seems to have gotten somewhat messed up.
I suspect you are using rich text or HTML formatting.
Try posting again in plain text.

Post by Joshua Valdez
sentences.append(LabeledSentence(words=line.split(),
labels=['SENT_%s' % uid]))
model = Doc2Vec(alpha=0.025, min_alpha=0.025)
model.train(sentences)
model.alpha -= 0.002
model.min_alpha = model.alpha
sent = re.search(sent_reg, item)
continue
print item
###I'm not sure how to produce the vectors from here and this doesn't work##
print model["SENT_"+str(sent_id)]
sent_id += 1

--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos

_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Danny Yoo

2015-08-17 20:44:06 UTC

Permalink

It appears that you're asking this question on Stack Overflow as well as here:

http://stackoverflow.com/questions/32056080/using-gensims-doc2vec-to-produce-sentence-vectors

By "vector", I am assuming you mean them in the sense described in:

https://en.wikipedia.org/wiki/Vector_space_model

where a document can be represented in n-dimensional space, where n is
the size of the vocabulary. (Linear algebra is awesome. I need to
learn it properly.)

When you mention that you've looked at the documentation, it can help
to be specific and provide links to the material you're using to
learn. That way, other folks might check those references.

I think you are looking at the following documentation:

http://rare-technologies.com/doc2vec-tutorial/

http://radimrehurek.com/gensim/models/doc2vec.html

but I'm not positive.

I think you want to be reading the tutorial around here:

http://radimrehurek.com/gensim/tut1.html#from-strings-to-vectors

which talks explicitly on using their library to go from strings to
vectors. It's part of the same library that you're using now, so I
think you want to look at that tutorial there.

As Alan mentions, the topic you're asking is highly specialized, so we
might not be able to provide good answers just because we're not
familiar with the domain. It's like asking higher-math questions to
an elementary school classroom: sure, we'll try to help, but we might
not be of much help. :P I really do think you should be talking to
these folks:

http://radimrehurek.com/gensim/support.html
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Danny Yoo

2015-08-17 21:09:48 UTC

Permalink

Followup:

If we want to get at the document vectors after training, I think
that, from reading the code here:

https://github.com/piskvorky/gensim/blob/develop/gensim/models/doc2vec.py#L254

that you want to get at the model's 'docvecs' attribute. We know it's
a DocvecArray because it is assigned here in the model.

https://github.com/piskvorky/gensim/blob/develop/gensim/models/doc2vec.py#L569

Given that, we should be able to just print out the first vector in
the trained model like this:

print(model.docvecs[0])

More generally, we should be able to do something like:

for index in range(len(model.docvecs)):
print(model.docvecs[index])

to get at the vectors for all the trained documents.

That being said, I have not executed any of this code on my machine.
I'm only going by reading, so I might be misinterpreting something.
Hence the suggestion to talk to folks who have actually used the
library. :P
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor