This vignette describes the most
basic usage of the sentopics
package by estimating an LDA
model and analysis it’s output. Two other vignettes, describing time
series and topic models with sentiment are also available.
The package is shipped with a sample of press conferences from the
European Central bank. For ease of use, the press conferences have been
pre-processed into a tokens
object from the
quanteda
package. (See quanteda’s
introduction for details on these objects). The press conferences
also contains meta-data which can be accessed using
docvars()
.
The press conferences were obtained from ECB’s
website. The package also provides an helper function to replicate
the creation of the dataset:
get_ECB_press_conferences()
library("sentopics")
data("ECB_press_conferences_tokens")
print(ECB_press_conferences_tokens, 3)
# Tokens consisting of 3,860 documents and 5 docvars.
# 1_1 :
# [1] "outcome" "meeting" "decision"
# [4] "" "ecb" "general"
# [7] "council" "governing_council" "executive"
# [10] "board" "accordance" "escb"
# [ ... and 7 more ]
#
# 1_2 :
# [1] "" "state" "government" "member"
# [5] "executive" "board" "ecb" "president"
# [9] "vice" "president" "date" "establishment"
# [ ... and 13 more ]
#
# 1_3 :
# [1] "" "meeting" "executive" "board" "meeting" ""
# [7] "general" "" "meeting" ""
#
# [ reached max_ndoc ... 3,857 more documents ]
head(docvars(ECB_press_conferences_tokens))
# .date doc_id title
# 1 1998-06-09 1 ECB Press conference: Introductory statement
# 2 1998-06-09 1 ECB Press conference: Introductory statement
# 3 1998-06-09 1 ECB Press conference: Introductory statement
# 4 1998-06-09 1 ECB Press conference: Introductory statement
# 5 1998-06-09 1 ECB Press conference: Introductory statement
# 6 1998-06-09 1 ECB Press conference: Introductory statement
# section_title
# 1 Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
# 2 Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
# 3 Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
# 4 Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
# 5 Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
# 6 Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
# .sentiment
# 1 -0.01470588
# 2 -0.02500000
# 3 0.00000000
# 4 0.00000000
# 5 0.00000000
# 6 0.00000000
sentopics
implements three types of topic model. The
simplest, Latent Dirichlet Allocation (LDA), assumes that textual
documents are issued from a generative process involving K topics.
A given document d is constituted of a list of words d = (w1, …, wN), with N being the document’s length. Each word wi originates from a vocabulary consisting of V distinct terms. Then, documents are generated from the following random process:
In sentopics
the LDA model is estimated through Gibbs
sampling, that iteratively sample the topic assignment zi of every word
of the corpus until reaching a convergence. The topic assignments are
sampled from the following distribution: $$
p(z_i = k|w,z^{-i}) \propto
\frac{n_{k,v,.}^{-i} + \beta}{n_{k,.,.}^{-i} + V\beta}
\frac{n_{k,.,d}^{-i} + \alpha}{n_{.,.,d}^{-i} + K\alpha},$$ where
nk, v, d
is the count of words at index v of the vocabulary, assigned to
topic k and part of document
d. The replacement of one of
the indices {k, v, d} by a dot
indicates instead the count for all topics, all vocabulary indices or
all documents. The superscript −i indicates that the current word
position i is left out from
the count variables.
sentopics
The estimation of an LDA model is easily replicated using the
LDA()
and fit()
function. The first function
prepares the R
object and initialize the assignment of the
latent topics. The second function estimates the model using Gibbs
sampling for a given number of iterations. Note that fit()
may be used to iterate the model multiple times without resetting the
estimation.
set.seed(123)
lda <- LDA(ECB_press_conferences_tokens)
lda
# An LDA model with 5 topics. Currently fitted by 0 Gibbs sampling iterations.
# ------------------Useful methods------------------
# fit :Estimate the model using Gibbs sampling
# topics :Return the most important topic of each document
# topWords :Return a data.table with the top words of each topic/sentiment
# plot :Plot a sunburst chart representing the estimated mixtures
# This message is displayed once per session, unless calling `print(x, extended = TRUE)`
lda <- fit(lda, iterations = 100)
lda
# An LDA model with 5 topics. Currently fitted by 100 Gibbs sampling iterations.
Internally, the lda
object is stored as a list and
contains the model’s parameters and outputs.
str(lda, max.level = 1, give.attr = FALSE)
# List of 10
# $ tokens :List of 3860
# $ vocabulary :Classes 'data.table' and 'data.frame': 1168 obs. of 3 variables:
# $ K : num 5
# $ alpha : num [1:5, 1] 1 1 1 1 1
# $ beta : num [1:5, 1:1168] 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ...
# $ it : num 100
# $ za :List of 3860
# $ theta : num [1:3860, 1:5] 0.0455 0.0357 0.0909 0.0667 0.0625 ...
# $ phi : num [1:1168, 1:5] 4.32e-07 4.32e-07 4.32e-07 6.47e-03 4.32e-07 ...
# $ logLikelihood: num [1:100, 1] -943778 -927554 -912671 -893733 -864875 ...
tokens
is the initial tokens object used to create the
model. vocabulary
is a data.frame indexing the set of
words. K
is the number of topics. alpha
is the
hyperparameter of the document-topic mixtures. beta
is the
hyperparameter of the topic-word mixtures. it
is the number
of iterations of the model. za
contains the topic
assignments of each word of the corpus. theta
are the
estimated document-topic mixtures. phi
are the estimated
topic-word mixtures. logLikelihood
is the log-likelihood of
the model at each iteration.
Estimated mixtures are easily accessible through the $
operator. But the package also includes the topWords()
function to extract the most probable words of each topic.
topWords()
includes three types of outputs: long
data.table
/data-frame
, matrix
or
ggplot
object (also accessible through the alias
plot_topWords()
).
head(lda$theta)
# topic
# doc_id topic1 topic2 topic3 topic4 topic5
# 1_1 0.04545455 0.04545455 0.7727273 0.04545455 0.09090909
# 1_2 0.03571429 0.14285714 0.7500000 0.03571429 0.03571429
# 1_3 0.09090909 0.09090909 0.6363636 0.09090909 0.09090909
# 1_4 0.06666667 0.06666667 0.7333333 0.06666667 0.06666667
# 1_5 0.06250000 0.06250000 0.7500000 0.06250000 0.06250000
# 1_6 0.05263158 0.10526316 0.5789474 0.10526316 0.15789474
topWords(lda, output = "matrix")
# topic1 topic2 topic3 topic4
# [1,] "price" "fiscal" "governing_council" "growth"
# [2,] "inflation" "euro_area" "ecb" "quarter"
# [3,] "development" "growth" "meeting" "loan"
# [4,] "annual" "country" "president" "financial"
# [5,] "increase" "policy" "bank" "euro_area"
# [6,] "projection" "reform" "operation" "rate"
# [7,] "hicp" "structural" "outcome" "sector"
# [8,] "oil" "market" "press" "condition"
# [9,] "euro_area" "economic" "vice" "annual"
# [10,] "inflation_rate" "measure" "euro" "credit"
# topic5
# [1,] "risk"
# [2,] "economic"
# [3,] "monetary"
# [4,] "price_stability"
# [5,] "euro_area"
# [6,] "development"
# [7,] "interest_rate"
# [8,] "outlook"
# [9,] "monetary_policy"
# [10,] "growth"
In addition, document-level is facilitated through the use of the
melt()
method, that joins estimated topical proportions to
document metadata present in the tokens
input. This result
in a long data.table
/data.frame
that
can be used for plotting or easily reshaped to a wide format (for
example using data.table::dcast
).
melt(lda, include_docvars = TRUE)
# topic prob .date .id doc_id
# <fctr> <num> <Date> <char> <char>
# 1: topic1 0.04545455 1998-06-09 1_1 1
# 2: topic1 0.03571429 1998-06-09 1_2 1
# 3: topic1 0.09090909 1998-06-09 1_3 1
# 4: topic1 0.06666667 1998-06-09 1_4 1
# 5: topic1 0.06250000 1998-06-09 1_5 1
# ---
# 19296: topic5 0.28947368 2021-12-16 260_20 260
# 19297: topic5 0.10526316 2021-12-16 260_21 260
# 19298: topic5 0.05000000 2021-12-16 260_22 260
# 19299: topic5 0.41025641 2021-12-16 260_23 260
# 19300: topic5 0.14285714 2021-12-16 260_24 260
# title
# <char>
# 1: ECB Press conference: Introductory statement
# 2: ECB Press conference: Introductory statement
# 3: ECB Press conference: Introductory statement
# 4: ECB Press conference: Introductory statement
# 5: ECB Press conference: Introductory statement
# ---
# 19296: PRESS CONFERENCE
# 19297: PRESS CONFERENCE
# 19298: PRESS CONFERENCE
# 19299: PRESS CONFERENCE
# 19300: PRESS CONFERENCE
# section_title
# <char>
# 1: Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
# 2: Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
# 3: Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
# 4: Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
# 5: Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
# ---
# 19296: Christine Lagarde, President of the ECB,Luis de Guindos, Vice-President of the ECB
# 19297: Christine Lagarde, President of the ECB,Luis de Guindos, Vice-President of the ECB
# 19298: Christine Lagarde, President of the ECB,Luis de Guindos, Vice-President of the ECB
# 19299: Christine Lagarde, President of the ECB,Luis de Guindos, Vice-President of the ECB
# 19300: Christine Lagarde, President of the ECB,Luis de Guindos, Vice-President of the ECB
# .sentiment
# <num>
# 1: -0.01470588
# 2: -0.02500000
# 3: 0.00000000
# 4: 0.00000000
# 5: 0.00000000
# ---
# 19296: -0.01960784
# 19297: 0.00000000
# 19298: 0.05555556
# 19299: -0.01052632
# 19300: 0.00000000
To ease the result analysis, we can rename the default topic labels
using the sentopics_labels()
function. As a result, all
outputs of the model will now display the custom labels.
sentopics_labels(lda) <- list(
topic = c("Inflation", "Fiscal policy", "Governing council", "Financial sector", "Uncertainty")
)
head(lda$theta)
# topic
# doc_id Inflation Fiscal policy Governing council Financial sector Uncertainty
# 1_1 0.04545455 0.04545455 0.7727273 0.04545455 0.09090909
# 1_2 0.03571429 0.14285714 0.7500000 0.03571429 0.03571429
# 1_3 0.09090909 0.09090909 0.6363636 0.09090909 0.09090909
# 1_4 0.06666667 0.06666667 0.7333333 0.06666667 0.06666667
# 1_5 0.06250000 0.06250000 0.7500000 0.06250000 0.06250000
# 1_6 0.05263158 0.10526316 0.5789474 0.10526316 0.15789474
plot_topWords(lda) + ggplot2::theme_grey(base_size = 9)
Besides modifying topic labels, it is also possible to merge topics
into a greater thematic. This is often useful when estimating a large
number of topics (e.g, K > 15). The mergeTopics()
does
this job and re-label topics accordingly.
merged <- mergeTopics(lda, list(
`Big big thematic` = c(1, 3:5),
`Fical policy` = 2
))
merged
# An LDA model with 2 topics. Currently fitted by 100 Gibbs sampling iterations.
Note that merging topics is only useful for presentation purpose.
Using again fit
on a model with merged topics will
drastically change the results as the current state of the model does
not results from a standard estimation with the merged set of
parameters.
Provided that the plotly
package is installed, one can
also directly use plot()
on the estimated topic model to
enjoy a dynamic view of topic proportions and their most probable words
(presented as a screenshot hereafter to limit this vignette’s size).