GitHub - ghostagan/NLP-Projects: text preprocess, word2vec, sentence embedding in text similarity, text classification, Chinese word segmentation, Hidden Markov Model, CRFs, named entity recognition, knowledge graph, dialog system

NLP-Projects

Natural Language Processing related projects, which includes concepts and srcipts about:

Word2vec: gensim, fastText and tensorflow implementations. See Chinese notes, 中文解读.
Text similarity: gensim doc2vec and gensim word2vec averaging implementations.
Text classification: tensorflow LSTM (See Chinese notes 1, 中文解读 1 and Chinese notes 2, 中文解读 2) and fastText implementations.
Chinese word segmentation: HMM Viterbi implementations. See Chinese notes, 中文解读.
Sequence labeling - NER: brands NER via bi-directional LSTM + CRF, tensorflow implementation. See Chinese notes, 中文解读.
..

DL best practices in NLP

1. Word embeddings

Use pre-trained embeddings if available.
Embedding dimension is task-dependent
- Smaller dimensionality (i.e., 100) works well for syntactic tasks (i.e., NER, POS tagging)
- Larger dimensionality (i.e., 300) is useful for semantic tasks (i.e., sentiment analysis)

2. Depth

3 or 4 layer Bi-LSTMs (e.g. POS tagging, semantic role labelling).
8 encoder and 8 decoder layers (e.g., Google's NMT)
In most case, shallower model(i.e., 2 layers) is good enough.

3. Layer connections (for avoiding vanishing gradients)

Highway layer
- h = t * a(WX+b) + (1-t) * X, where t=sigmoid(W_TX+b_T) is called transform gate.
- Application: language modelling and speech recognition.
- Implementation: tf.contrib.rnn.HighwayWrapper
Residual connection
- h = a(WX+b) + X
- Implementation: tf.contrib.rnn.ResidualWrapper
Dense connection
- h_l = a(W[X_1, ..., X_l] + b)
- Application: multi-task learning

4. Dropout

Batch normalization in CV likes dropout in NLP.
Dropout rate of 0.5 is perferred.
Recurrent dropout (what's the difference between recurrent dropout and traditional dropout ?) applies the same dropout mask across timesteps at layer l. Implementation: tf.contrib.rnn.DropoutWrapper(variational_recurrent=True)

5. LSTM tricks

Treat initial state as variable [2]

# note: if here is LSTMCell, a bug appear: https://stackoverflow.com/questions/42947351/tensorflow-dynamic-rnn-typeerror-tensor-object-is-not-iterable
cell = tf.nn.rnn_cell.GRUCell(state_size)
init_state = tf.get_variable('init_state', [1, state_size], initializer=tf.constant_initializer(0.0))
# https://stackoverflow.com/questions/44486523/should-the-variables-of-the-initial-state-of-a-dynamic-rnn-among-the-inputs-of
init_state = tf.tile(init_state, [batch_size, 1])

Gradients clipping

variables = tf.trainable_variables()
gradients = tf.gradients(ys=cost, xs=variables)
clipped_gradients, _ = tf.clip_by_global_norm(gradients, clip_norm=self.clip_norm)
optimizer = tf.train.AdamOptimizer(learning_rate=1e-3)
optimize = optimizer.apply_gradients(grads_and_vars=zip(clipped_gradients, variables), global_step=self.global_step)

6. Attention

To do...

Reference:
[1] http://ruder.io/deep-learning-nlp-best-practices/
[2] https://r2rt.com/recurrent-neural-networks-in-tensorflow-iii-variable-length-sequences.html

Name		Name	Last commit message	Last commit date
Latest commit History 252 Commits
CRFs		CRFs
Chinese word segmentation		Chinese word segmentation
Dialog_system		Dialog_system
HMM		HMM
Knowledge_graph		Knowledge_graph
Network_embedding		Network_embedding
Question_answering		Question_answering
Reading_comprehension		Reading_comprehension
Sequence labeling - NER		Sequence labeling - NER
Text preprocess		Text preprocess
Text similarity		Text similarity
Text_classification		Text_classification
Text_generation		Text_generation
Universal language model		Universal language model
Word2vec		Word2vec
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP-Projects

DL best practices in NLP

1. Word embeddings

2. Depth

3. Layer connections (for avoiding vanishing gradients)

4. Dropout

5. LSTM tricks

6. Attention

Awesome packages

Chinese

1. pyltp

2. HanLP

English

1. Spacy

2. gensim

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NLP-Projects

DL best practices in NLP

1. Word embeddings

2. Depth

3. Layer connections (for avoiding vanishing gradients)

4. Dropout

5. LSTM tricks

6. Attention

Awesome packages

Chinese

1. pyltp

2. HanLP

English

1. Spacy

2. gensim

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages