neural machine translation

Neural machine translation approach 
-->Represent sentence as matrices 
how to build these matrices -->the simplest form is we create vector for each word of the sentence and concatenate these vectors into matrix
just stack these vectors next to each other

foreg x1,x2,x3,x4 are the words and we have embeddings for each of these . just concatnate these embedings into the matrix 

We widely use RNN or lstm network for NMT task


Encoder-Decoder architecture -:

IN neural machine translation we map the words to a vector called embeddings and for complete sentence we create a fixed length vector using rnn and decorder using a language decodee tht fixed length vector to the target . In normal encoder decoder architecture what we do is we passed the fixed length vector at last time step to the decoder and then decoder convert the sentence to target 
but the problem while decoding the time step we need to go back to the first word for its translation but rnn gets fail so we use lstm .but still for long sentences problem does not get resolved 
The first word of the target is fully corelated with first word of source so.but that decoder needs to consider info from that 50 words long. in this case we get long term dependenc problem with lstm also if sentence is long . either the sol is we can use bi-LSTM but that does not resolve the problem fully.The trouble with seq2seq is that the only information that the decoder receives from the encoder is the last encoder hidden state 


#_------------------------------------------------------------------------------------------------------------------------------------------
Attention mechanism --:with attention mechanism we no longer encode the sentence fully into the fixed length vector .Instead passing last time step hidden state to decorder we pass all hidden states to the decoder .how it works is 


Step1--:

Suppose X is the sentence . we have all the words for that sentence .we pass vectors for that words in our lstm . it process the word with a output hidden state. we use that hidden state and pass it to next time step with the new word .we get a new hidden state the process goes on .suppose we have 4 words , we get 4 hidden states in the encoder side , we pass the the last consolidated hidden state to decoder first time step and get the first decoder hidden state 

Step 2 --:

Now we calculate the attention score for evry enocder hiden state.it is also called the alignment score.We take a dot product of every encoder hidden state and decoder hidenn state foreg

Decoder state=[10,5,10] 
encoder states=[0,1,1] 15
               [5,0,1] 60
               [1,1,0] 15 
	       [0,5,1] 30

Step3 ---
we run all the scores through a softmax layer and we get prob in range 0-1

encoder_hidden  score  score^
-----------------------------
     [0, 1, 1]     15       0
     [5, 0, 1]     60       1
     [1, 1, 0]     15       0
     [0, 5, 1]     35       0

Notice that based on the softmaxed score score^, the distribution of attention is only placed on [5, 0, 1] as expected. In reality, these numbers are not binary but a floating point between 0 and 1.

Step4-:

multiply each encoder hidden state with the softmax score 

encoder  score  score^  alignment
---------------------------------
[0, 1, 1]   15      0   [0, 0, 0]
[5, 0, 1]   60      1   [5, 0, 1]
[1, 1, 0]   15      0   [0, 0, 0]
[0, 5, 1]   35      0   [0, 0, 0]

Here we see that the alignment for all encoder hidden states except [5, 0, 1] are reduced to 0 due to low attention scores. This means we can expect that the first translated word should match the input word with the [5, 0, 1] embedding.

step5 ---:: 
Sum up the alignment vectors to get the context vector.
The alignment vectors are summed up to produce the context vector [1, 2]. A context vector is an aggregated information of the alignment vectors from the previous step.
context = [0+5+0+0, 0+0+0+0, 0+1+0+0] = [5, 0, 1]

step6- feed the context vector to the decoder .
concat the context and decoder hidden state vector into one vector

Let us now bring the whole thing together in the following visualization and look at how the attention process works:

The attention decoder RNN takes in the embedding of the <END> token, and an initial decoder hidden state.
The RNN processes its inputs, producing an output and a new hidden state vector (h4). The output is discarded.
Attention Step: We use the encoder hidden states and the h4 vector to calculate a context vector (C4) for this time step.
We concatenate h4 and C4 into one vector.
We pass this vector through a feedforward neural network (one trained jointly with the model).
The output of the feedforward neural networks indicates the output word of this time step.
Repeat for the next time steps


#---------------------------latest Update-----------------------------
Starring with NMT Seq toSeq  https://towardsdatascience.com/neural-machine-translation-15ecf6b0b
https://www.analyticsvidhya.com/blog/2019/01/neural-machine-translation-keras/
https://towardsdatascience.com/word-level-english-to-marathi-neural-machine-translation-using-seq2seq-encoder-decoder-lstm-model-1a913f2dc4a7


Code for attention NMT and working
https://towardsdatascience.com/implementing-neural-machine-translation-with-attention-using-tensorflow-fc9c6f26155f
https://towardsdatascience.com/intuitive-understanding-of-attention-mechanism-in-deep-learning-6c9482aecf4f
https://towardsdatascience.com/attn-illustrated-attention-5ec4ad276ee3


my intuation 

for seq2 seq 
how it works is it takes tha combine context vector and that context vetor is passed to the decorder side
started with a start token and information from prevoius time step(context vector).based on that it generate a target word based on the probablity of the 
corpus .in next time step it take that prevoius word as the new input and infor  from prevoius time step and based on that it made predictions 
#--------------------------------------------------------------------------


SEQ2SEQ with attention bahadanau attention model 
What we do in this we take all hidden state instead of passing last hidden state foreg 
we have a sentence 
yashu is my name-----yashu(hindi) mera naam hai 
we will feed these 4 wrds as their embedings into the LSTM or GRU we will get 4 hidden states 
H1 h2 h3 h4  here attention come as as decode does not have any internal state so we pass the last encoder state as the currennt state od decoder
based on that we train a Feedforward neural network for finding the relevant hidden state for the first word translation 
 
suppose yashu(hindi) is defined from the h1 or h2 based on that we train feed forward network.The reason is hidden states have the local info about the 
word 

after training network we got 4 output we softmaz output for making in the range of 0 to 1 
 we multiply the output with the hidden state and add it that is a context vector 
 we concat  the context vector with start token and pass to LSTM we get a output word at hidden state d0
 
 then for next time step we take info from prevous time step and the input that is the word we get 
 again we train a feedforward network for finding the context vector and concat with the word be obtain in last time step and pass to lstm 
 and we get a new out put ...the proccess goes on until we find a end token