Generic filters
Exact matches only

# Word2Vec Model for Word Embedding

In the previous post, we discussed that every word in the input sentence is represented with the vectors of numeric values (called word embedding) before feeding it to the computer for various natural language processing (NLP) tasks. We further discussed that language modeling is used to predict the next word when given a set of previous words, and once the language model can predict the next word, we hope that the model can learn the embedding for all the words in our data. In this post, we will discuss the concepts behind the word embedding model called Word2Vec. We will try not to dive into mathematical details and try to keep the concepts as simple as possible. So let’s begin:

Suppose, we have ten unique words in our text sentences. We represent those words as a vocabulary called V.

V= {play, hockey, cricket, tennis, football, Lahore, London, Pakistan, Germany, Frankfurt}

In practice, we may have thousands of unique words in the text data, we just want to keep things simple to understand the concepts.  One way to represent the words in the vocabulary V using one-hot vectors (don’t worry about the terminology).  One-hot vector means, keep the number ‘1’ for the index position of the word and put ‘0’ everywhere else. So the one-hot vector words for the words in vocabulary “V” will be as follows:

play = [1 0 0 0 0 0 0 0 0 0 ]

hockey = [0 1 0 0 0 0 0 0 0 0 ]

cricket = [0 0 1 0 0 0 0 0 0 0 ]

.

.

.

Frankfurt = [0 0 0 0 0 0 0 0 0 1 ]

For example, we have a sentence ‘S’ with only two words: S=  play cricket

So each word in the sentence ‘S’ will be replaced with their on-hot vectors. So the embedding for the words in the sentence ‘S’ will be as follows (instead of vector, the sentence embedding will be now matrix, here we call it ‘E’):

E=         1 0 0 0 0 0 0 0 0 0

0 0 1 0 0 0 0 0 0 0

If we have ten words in a sentence, all those ten words will be replaced with their one-hot vectors (and the embedding matrix will contain 10 rows, each row for each word). If the sentence has five words, the embedding matrix will contain five rows.

Do you see any problem with this one-hot vector approach to represent words? I find three problems with this approach:

1. It is a vector of ‘1’ and ‘0’ for each word which is not meaningful.
2. It is a sparse vector which means every vector has ‘1’ at only one position and all the other entries are ‘0’.
3. Our vocabulary contains 10 words. For every word, we have only one ‘1’, and nine ‘0’. What if, our vocabulary has thousands of words? Yes, you are right. Every embedding vector will contain only one ‘1’ and the other thousands of positions will contain ‘0’.

So the researchers came up with other embedding methods, where each word will be represented with the vector of values that may be 50, 100, 300, or any other smaller value. Moreover, the words to the same area will have similar values in their embedding vectors. For example, the words “play, football, cricket, tennis, and hockey” are related to the sports, their embedding vector values will be close to each other. On the other hand, the words “London, Lahore, Pakistan, Germany, Frankfurt” are related to cities and places, they will have similar values.

Figure 1 Example of embedding vectors. Football and cricket vectors are close to each other. On the contrary, Pakistan and Germany’s embedding vectors are close to each other.

Figure 1 shows the example of embedding vectors just for understanding.

The Word2vec method is one of the embedding methods that solve the limitations of the one-hot embedding method. So how does it work?

Wor2vec method comes in two different flavors:

1. Continuous bag of words (CBOW)
2. Skip-gram

CBOW: This method takes the 2 or 3 previous words and tries to predict the next word using a neural network model.

For example, we have a sentence:  “I am confused whether I should drink milk or take tea.”

CBOW will first take two words “I, am” as input and try to predict the next word (in this example it is ‘confused’). These two input words will be passed to a neural network. In the first attempt, the predicted word will be incorrect, then the neural network will adjust weights to make the correct prediction. The next inputs to the model will be “am, confused” and the model will try to predict the word “weather”. In this way, the whole process is completed. We have seen the example of one sentence. However, Word2vec is trained on a very huge dataset. Once the neural network is trained on that huge dataset, the weights of the neural network can be extracted. Those weights serve as embedding.

Skip-gram: Skipgram works differently. It takes a word as input, then tries to predict the surrounding words.  For example, we have a sentence:  “I am confused whether I should drink milk or take tea.”

The model will take input “I” and try to predict the words “am and confused”. Then it will take the word “am” as input and try to predict the words “I and confused”. Then it will take the word “confused” as input and predict the words “am and whether”. Once the training is complete, the network weights serve as embedding.

There arise two questions:

1. Embedding vectors are learned after training. So how are the words passed as input initially to the Word2evc to learn those embedding?
2. Word2vec is the model for learning the embedding. But how can we use it for our NLP tasks?

The answer to the first question is: Initially, each word is represented as a one-hot vector. Yes, you read it right. Each word is represented as a one-hot vector. Once the training is complete, the learned weights we call “embedding” will be the matrix that contains the numeric vectors for each word that represents each word.

The answer for the second question is: Training of Word2vec was done by someone else (the researchers at Google) which took a lot of time. Now, if we need the numeric embedding for our NLP task, we just use the Word2vec to get the embedding for our input text. Hence we do not need to train our method for learning new embeddings.

One final note. The Word2vec embeddings are still non-contextual. This means the word “bank” will have the same vector values for “riverbank”, and “Habib bank”.

You may also like to read: Intro to Word Representations and Language Modeling

error: Content is protected !!