Latent Dirichlet allocation

10 Jun 2019

α is the parameter of the Dirichlet prior on the per-document topic distributions,
β is the parameter of the Dirichlet prior on the per-topic word distribution,
$\theta _{i}$ is the topic distribution for document i,
$\varphi _{k}$ is the word distribution for topic k,
$z_{ij}$ is the topic for the j-th word in document i, and
$w_{ij}$ is the specific word.

The fact that W is grayed out means that words $w_{ij}$ are the only observable variables, and the other variables are latent variables.

It is helpful to think of the entities represented by $\theta$ and $\varphi$ as matrices created by decomposing the original document-word matrix that represents the corpus of documents being modeled.

In this view, $\theta$ consists of rows defined by documents and columns defined by topics, while $\varphi$ consists of rows defined by topics and columns defined by words.

Thus, ${\varphi _{1} ,\dots ,\varphi _{K}}$ refers to a set of rows, or vectors, each of which is a distribution over words, and ${\theta _{1},\dots ,\theta _{M}}$ refers to a set of rows, each of which is a distribution over topics.

chensheng@biheap.com:~$

Archive

About

RSS

Latent Dirichlet allocation