Word2vec

Machine learning and data mining

Problems
Classification Clustering Regression Anomaly detection Association rules Reinforcement learning Structured prediction Feature engineering Feature learning Online learning Semi-supervised learning Unsupervised learning Learning to rank Grammar induction
Supervised learning (classification • regression)
Decision trees Ensembles (Bagging, Boosting, Random forest) k-NN Linear regression Naive Bayes Neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering
BIRCH Hierarchical k-means Expectation-maximization (EM) DBSCAN OPTICS Mean-shift
Dimensionality reduction
Factor analysis CCA ICA LDA NMF PCA t-SNE
Structured prediction
Graphical models (Bayes net, CRF, HMM)
Anomaly detection
k-NN Local outlier factor
Neural nets
Autoencoder Deep learning Multilayer perceptron RNN Restricted Boltzmann machine SOM Convolutional neural network
Theory
Bias-variance dilemma Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory
Machine learning venues
NIPS ICML JMLR ArXiv:cs.LG
Data mining venues
KDD ICDM SDM
Machine learning portal

Word2vec is a group of related models that are used to produce so-called word embeddings. These models are shallow, two-layer neural networks, that are trained to reconstruct linguistic contexts of words: the network is shown a word, and must guess at which words occurred in adjacent positions in an input text. The order of the remaining words is not important (bag-of-words assumption).^[1]

After training, word2vec models can be used to map each word to a vector of typically several hundred elements, which represent that word's relation to other words. This vector is the neural network's hidden layer.^[2]

Word2vec relies on either skip-grams or continuous bag of words (CBOW) to create neural word embeddings. It was created by a team of researchers led by Tomas Mikolov at Google. The algorithm has been subsequently analysed and explained by other researchers.^[3]^[4]

Skip grams and CBOW

Skip grams are word windows from which one word is excluded, an n-gram with gaps. With skip-grams, given a window size of $n$ words around a word $w$ , word2vec predicts contextual words $c$ ; i.e. in the notation of probability $p(c|w)$ . Conversely, CBOW predicts the current word, given the context in the window, $p(w|c)$ .

Extensions

An extension of word2vec to construct embeddings from entire documents (rather than the individual words) has been proposed.^[5] This extension is called paragraph2vec or doc2vec and has been implemented in the C, Java, Scala and Python tools (see below), with the Java and Python versions also supporting inference of document embeddings on new, unseen documents.

Analysis

The reasons for successful word embedding learning in the word2vec framework are poorly understood. Goldberg and Levy point out that the word2vec objective function causes words that occur in similar contexts to have similar embeddings (as measured by cosine similarity) and note that this is in line with J. R. Firth's distributional hypothesis. They also note that this explanation is "very hand-wavy".

Implementations

References

↑ Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg S.; Dean, Jeff (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems.
↑ Mikolov, Tomas; et al. "Efficient Estimation of Word Representations in Vector Space" (PDF). Retrieved 2015-08-14.
↑ Goldberg, Yoav; Levy, Omar. "word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method" (PDF). Retrieved 2015-08-14.
↑ Řehůřek, Radim. "Word2vec and friends". Retrieved 2015-08-14.
↑ "Doc2Vec and Paragraph Vectors for Classification". Retrieved 2016-01-13.

Natural language processing

General terms	Text corpus Speech corpus Stopwords Bag-of-words AI-complete n-gram (Bigram, Trigrams)

Text analysis	Text segmentation Part-of-speech tagging Text chunking Compound term processing Collocation extraction Stemming Lemmatisation Named-entity recognition Coreference resolution Sentiment analysis Concept mining Parsing Word sense disambiguation Terminology extraction Truecasing

Automatic summarization	Multi-document summarization Sentence extraction Text simplification

Machine translation	Computer-assisted Example-based Rule-based

Automatic identification and data capture	Speech recognition Speech synthesis Optical character recognition Natural language generation

Topic model	Pachinko allocation Latent Dirichlet allocation Latent semantic indexing

Computer-assisted reviewing	Automated essay scoring Concordancer Grammar checker Predictive text Spell checker Syntax guessing

Natural language user interface	Automated online assistant Chatterbot Interactive fiction Question answering

This article is issued from Wikipedia - version of the Wednesday, February 03, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.