chensheng@biheap.com:~$

Latent Dirichlet allocation

  • α is the parameter of the Dirichlet prior on the per-document topic distributions,
  • β is the parameter of the Dirichlet prior on the per-topic word distribution,
  • is the topic distribution for document i,
  • is the word distribution for topic k,
  • is the topic for the j-th word in document i, and
  • is the specific word.

The fact that W is grayed out means that words are the only observable variables, and the other variables are latent variables.

It is helpful to think of the entities represented by and as matrices created by decomposing the original document-word matrix that represents the corpus of documents being modeled.

In this view, consists of rows defined by documents and columns defined by topics, while consists of rows defined by topics and columns defined by words.

Thus, refers to a set of rows, or vectors, each of which is a distribution over words, and refers to a set of rows, each of which is a distribution over topics.