Title: | Tools for Joint Sentiment and Topic Analysis of Textual Data |
---|---|
Description: | A framework that joins topic modeling and sentiment analysis of textual data. The package implements a fast Gibbs sampling estimation of Latent Dirichlet Allocation (Griffiths and Steyvers (2004) <doi:10.1073/pnas.0307752101>) and Joint Sentiment/Topic Model (Lin, He, Everson and Ruger (2012) <doi:10.1109/TKDE.2011.48>). It offers a variety of helpers and visualizations to analyze the result of topic modeling. The framework also allows enriching topic models with dates and externally computed sentiment measures. A flexible aggregation scheme enables the creation of time series of sentiment or topical proportions from the enriched topic models. Moreover, a novel method jointly aggregates topic proportions and sentiment measures to derive time series of topical sentiment. |
Authors: | Olivier Delmarcelle [aut, cre] , Samuel Borms [ctb] , Chengua Lin [cph] (Original JST implementation), Yulan He [cph] (Original JST implementation), Jose Bernardo [cph] (Original JST implementation), David Robinson [cph] (Implementation of reorder_within()), Julia Silge [cph] (Implementation of reorder_within(), <https://orcid.org/0000-0002-3671-836X>) |
Maintainer: | Olivier Delmarcelle <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.7.3 |
Built: | 2024-11-04 21:45:57 UTC |
Source: | https://github.com/odelmarcelle/sentopics |
sentopics provides function to easily estimate a range of topic models and process their output. Particularly, it facilitates the integration of topic analysis with a time dimension through time-series generating functions. In addition, sentopics interacts with sentiment analysis to compute the sentiment conveyed by topics. Finally, the package implements a number of visualization helping interpreting the results of topic models.
Please refer to the vignettes for a comprehensive introduction to the package functions.
Basic usage: Introduction to topic model estimation with sentopics
Topical time series: Integrate topic analysis with sentiment analysis along a time dimension
For further details, you may browse the package documentation.
Please cite the package in publications. Use
citation("sentopics")
.
Maintainer: Olivier Delmarcelle [email protected] (ORCID)
Other contributors:
Samuel Borms [email protected] (ORCID) [contributor]
Chengua Lin [email protected] (Original JST implementation) [copyright holder]
Yulan He [email protected] (Original JST implementation) [copyright holder]
Jose Bernardo (Original JST implementation) [copyright holder]
David Robinson [email protected] (Implementation of reorder_within()) [copyright holder]
Julia Silge [email protected] (ORCID) (Implementation of reorder_within()) [copyright holder]
Useful links:
Report bugs at https://github.com/odelmarcelle/sentopics/issues
These functions converts estimated models from other topic modeling packages to the format used by sentopics.
as.LDA(x, ...) ## S3 method for class 'STM' as.LDA(x, docs, ...) ## S3 method for class 'LDA_Gibbs' as.LDA(x, docs, ...) ## S3 method for class 'LDA_VEM' as.LDA(x, docs, ...) ## S3 method for class 'textmodel_lda' as.LDA(x, ...) as.LDA_lda(list, docs, alpha, eta) ## S3 method for class 'keyATM_output' as.LDA(x, docs, ...)
as.LDA(x, ...) ## S3 method for class 'STM' as.LDA(x, docs, ...) ## S3 method for class 'LDA_Gibbs' as.LDA(x, docs, ...) ## S3 method for class 'LDA_VEM' as.LDA(x, docs, ...) ## S3 method for class 'textmodel_lda' as.LDA(x, ...) as.LDA_lda(list, docs, alpha, eta) ## S3 method for class 'keyATM_output' as.LDA(x, docs, ...)
x |
an estimated topic model from stm, topicmodels or seededlda. |
... |
arguments passed to other methods. |
docs |
for some objects, the documents used to initialize the model. |
list |
the list containing an estimated model from lda. |
alpha |
for lda models, the document-topic mixture hyperparameter.
If missing, the hyperparameter will be set to |
eta |
for lda models, the topic-word mixture hyperparameter. Other
packages refer to this hyperparameter as beta. If missing, the
hyperparameter will be set to |
Some models do not store the topic assignment of each word (for
example, estimated through variational inference). For these, the
conversion is limited and some functionalities of sentopics will be
disabled. The list of affected functions is subject to change and currently
includes fit()
, mergeTopics()
and rJST.LDA()
.
Since models from the lda package are simply lists of outputs, the
function as.LDA_lda()
is not related to the other methods and should be
applied directly on lists containing a model.
A S3 list of class LDA
, as if it was created and estimated using
LDA()
and fit()
.
## stm library("stm") stm <- stm(poliblog5k.docs, poliblog5k.voc, K=25, prevalence=~rating, data=poliblog5k.meta, max.em.its=2, init.type="Spectral") as.LDA(stm, docs = poliblog5k.docs) ## lda library("lda") data("cora.documents") data("cora.vocab") lda <- lda.collapsed.gibbs.sampler(cora.documents, 5, ## Num clusters cora.vocab, 100, ## Num iterations 0.1, 0.1) LDA <- as.LDA_lda(lda, docs = cora.documents, alpha = .1, eta = .1) ## topicmodels data("AssociatedPress", package = "topicmodels") lda <- topicmodels::LDA(AssociatedPress[1:20,], control = list(alpha = 0.1), k = 2) LDA <- as.LDA(lda, docs = AssociatedPress[1:20,]) ## seededlda library("seededlda") lda <- textmodel_lda(dfm(ECB_press_conferences_tokens), k = 6, max_iter = 100) LDA <- as.LDA(lda) ## keyATM library("keyATM") data(keyATM_data_bills, package = "keyATM") keyATM_docs <- keyATM_read(keyATM_data_bills$doc_dfm) out <- keyATM(docs = keyATM_docs, model = "base", no_keyword_topics = 5, keywords = keyATM_data_bills$keywords) LDA <- as.LDA(out, docs = keyATM_docs)
## stm library("stm") stm <- stm(poliblog5k.docs, poliblog5k.voc, K=25, prevalence=~rating, data=poliblog5k.meta, max.em.its=2, init.type="Spectral") as.LDA(stm, docs = poliblog5k.docs) ## lda library("lda") data("cora.documents") data("cora.vocab") lda <- lda.collapsed.gibbs.sampler(cora.documents, 5, ## Num clusters cora.vocab, 100, ## Num iterations 0.1, 0.1) LDA <- as.LDA_lda(lda, docs = cora.documents, alpha = .1, eta = .1) ## topicmodels data("AssociatedPress", package = "topicmodels") lda <- topicmodels::LDA(AssociatedPress[1:20,], control = list(alpha = 0.1), k = 2) LDA <- as.LDA(lda, docs = AssociatedPress[1:20,]) ## seededlda library("seededlda") lda <- textmodel_lda(dfm(ECB_press_conferences_tokens), k = 6, max_iter = 100) LDA <- as.LDA(lda) ## keyATM library("keyATM") data(keyATM_data_bills, package = "keyATM") keyATM_docs <- keyATM_read(keyATM_data_bills$doc_dfm) out <- keyATM(docs = keyATM_docs, model = "base", no_keyword_topics = 5, keywords = keyATM_data_bills$keywords) LDA <- as.LDA(out, docs = keyATM_docs)
Convert back a dfm to a tokens object
## S3 method for class 'dfm' as.tokens( x, concatenator = NULL, tokens = NULL, ignore_list = NULL, case_insensitive = FALSE, padding = TRUE, ... )
## S3 method for class 'dfm' as.tokens( x, concatenator = NULL, tokens = NULL, ignore_list = NULL, case_insensitive = FALSE, padding = TRUE, ... )
x |
quanteda::dfm to be coerced |
concatenator |
only used for consistency with the generic |
tokens |
optionally, the tokens from which the dfm was created. Providing the initial tokens will ensure that the word order will be respected in the coerced object. |
ignore_list |
a character vector of words that should not be removed
from the initial tokens object. Useful to avoid removing some lexicon word
following the usage of |
case_insensitive |
only used when the |
padding |
if |
... |
unused |
a quanteda quanteda::tokens object.
quanteda::as.tokens()
quanteda::dfm()
library("quanteda") dfm <- dfm(ECB_press_conferences_tokens, tolower = FALSE) dfm <- dfm_trim(dfm, min_termfreq = 200) as.tokens(dfm) as.tokens(dfm, tokens = ECB_press_conferences_tokens) as.tokens(dfm, tokens = ECB_press_conferences_tokens, padding = FALSE)
library("quanteda") dfm <- dfm(ECB_press_conferences_tokens, tolower = FALSE) dfm <- dfm_trim(dfm, min_termfreq = 200) as.tokens(dfm) as.tokens(dfm, tokens = ECB_press_conferences_tokens) as.tokens(dfm, tokens = ECB_press_conferences_tokens, padding = FALSE)
Computes the distance between different estimates of a topic model. Since the estimation of a topic model is random, the results may largely differ as the process is repeated. This function allows to compute the distance between distinct realizations of the estimation process. Estimates are referred to as chains.
chainsDistances( x, method = c("euclidean", "hellinger", "cosine", "minMax", "naiveEuclidean", "invariantEuclidean"), ... )
chainsDistances( x, method = c("euclidean", "hellinger", "cosine", "minMax", "naiveEuclidean", "invariantEuclidean"), ... )
x |
a valid |
method |
the method used to measure the distance between chains. |
... |
further arguments passed to internal distance functions. |
The method
argument determines how are computed distance.
euclidean
finds the pairs of topics that minimizes and returns the total
Euclidean distance.
hellinger
does the same but based on the Hellinger distance.
cosine
does the same but based on the Cosine distance.
minMax
computes the maximum distance among the best pairs of distances.
Inspired by the minimum-matching distance from Tang et al. (2014).
naiveEuclidean
computes the Euclidean distance without searching for the
best pairs of topics.
invariantEuclidean
computes the best pairs of topics for all allowed
permutations of topic indices. For JST and reversed-JST models, the two-
levels hierarchy of document-sentiment-topic leads some permutations of
indices to represent a drastically different outcome. This setting restricts
the set of permutations to the ones that do not change the interpretation of
the model. Equivalent to euclidean
for LDA models.
A matrix of distance between the elements of x
Olivier Delmarcelle
Tang, J., Meng, Z., Nguyen, X., Mei, Q., and Zhang, M. (2014). Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis. In Proceedings of the 31st International Conference on Machine Learning, 32, 90–198.
plot.multiChains()
chainsScores()
model <- LDA(ECB_press_conferences_tokens) model <- fit(model, 10, nChains = 5) chainsDistances(model)
model <- LDA(ECB_press_conferences_tokens) model <- fit(model, 10, nChains = 5) chainsDistances(model)
Compute various scores (log likelihood, coherence) for a list of topic models.
chainsScores(x, window = 110, nWords = 10)
chainsScores(x, window = 110, nWords = 10)
x |
a valid |
window |
optional. If |
nWords |
the number of words used to compute coherence. See
|
A data.table
with some statistics about each chain. For the
coherence metrics, the value shown is the mean coherence across all topics
of a chain
When nChains > 1
, the function can take advantage of
future.apply::future_lapply (if installed) to spread the computation over
multiple processes. This requires the specification of a parallel strategy
using future::plan()
. See the examples below.
model <- LDA(ECB_press_conferences_tokens[1:10]) model <- fit(model, 10, nChains = 5) chainsScores(model, window = 5) chainsScores(model, window = "boolean") # -- Parallel computation -- require(future.apply) future::plan("multisession", workers = 2) # Set up 2 workers chainsScores(model, window = "boolean") future::plan("sequential") # Shut down workers
model <- LDA(ECB_press_conferences_tokens[1:10]) model <- fit(model, 10, nChains = 5) chainsScores(model, window = 5) chainsScores(model, window = "boolean") # -- Parallel computation -- require(future.apply) future::plan("multisession", workers = 2) # Set up 2 workers chainsScores(model, window = "boolean") future::plan("sequential") # Shut down workers
Computes various coherence based metrics for topic models. It
assesses the quality of estimated topics based on co-occurrences of words.
For best results, consider cleaning the initial tokens object with padding = TRUE
.
coherence( x, nWords = 10, method = c("C_NPMI", "C_V"), window = NULL, NPMIs = NULL )
coherence( x, nWords = 10, method = c("C_NPMI", "C_V"), window = NULL, NPMIs = NULL )
x |
a model created from the |
nWords |
the number of words in each topic used for evaluation. |
method |
the coherence method used. |
window |
optional. If |
NPMIs |
optional NPMI matrix. If provided, skip the computation of NPMI between words, substantially decreasing computing time. |
Currently, only C_NPMI and C_V are documented. The implementation follows Röder & al. (2015). For C_NPMI, the sliding window is 10 whereas it is 110 for C_V.
A vector or matrix containing the coherence score of each topic.
Olivier Delmarcelle
Röder, M., Both, A., & Hinneburg, A. (2015). Exploring the Space of Topic Coherence Measures. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, 399-–408.
Computes Monetary Policy and Economic Condition scores using the Picault-Renault lexicon for central bank communication.
compute_PicaultRenault_scores(x, min_ngram = 2, return_dfm = FALSE)
compute_PicaultRenault_scores(x, min_ngram = 2, return_dfm = FALSE)
x |
a quanteda::corpus object. |
min_ngram |
the minimum length of n-grams considered in the computation |
return_dfm |
if |
The computation is done on a per-document basis, such as each document is scored with a value between -1 and 1. This is relevant to the computation of the denominator of the score.
It is possible to compute the score for paragraphs and sentences for a quanteda::corpus segmented using quanteda::corpus_reshape. Segmenting a corpus using quanteda's helpers retain track to which document each paragraph/sentence belong. However, in that case, it is possible that paragraphs or sentences are scored outside the (-1,1) interval. In any case, the of the paragraph/sentences scores averaged over documents will be contained in the (-1,1) interval.
A matrix with two columns, indicating respectively the MP (Monetary Policy) and EC (Economic Condition) scores of each document.
Picault, M. & Renault, T. (2017). Words are not all created equal: A new measure of ECB communication. Journal of International Money and Finance, 79, 136–156.
# on documents docs <- quanteda::corpus_reshape(ECB_press_conferences, "documents") compute_PicaultRenault_scores(docs) # on paragraphs compute_PicaultRenault_scores(ECB_press_conferences)
# on documents docs <- quanteda::corpus_reshape(ECB_press_conferences, "documents") compute_PicaultRenault_scores(docs) # on paragraphs compute_PicaultRenault_scores(ECB_press_conferences)
A corpus of 260 ECB press conference, split into 4224 paragraphs. The corpus contains a number of docvars indicating the date of the press conference and a measured sentiment based on the Loughran-McDonald lexicon.
ECB_press_conferences
ECB_press_conferences
A quanteda::corpus object.
https://www.ecb.europa.eu/press/key/date/html/index.en.html.
docvars(ECB_press_conferences)
docvars(ECB_press_conferences)
The pre-processed and tokenized version of the ECB_press_conferences corpus of press conferences. The processing involved the following steps:
Subset paragraphs shorter than 10 words
Removal of stop words
Part-of-speech tagging, following which only nouns, proper nouns and adjective were retained.
Detection and merging of frequent compound words
Frequency-based cleaning of rare and very common words
ECB_press_conferences_tokens
ECB_press_conferences_tokens
A quanteda::tokens object.
https://www.ecb.europa.eu/press/key/date/html/index.en.html.
LDA(ECB_press_conferences_tokens)
LDA(ECB_press_conferences_tokens)
This function is used to estimate a topic model created by
LDA()
, JST()
or rJST()
. In essence, this function iterates a Gibbs
sampler MCMC.
## S3 method for class 'sentopicmodel' fit( object, iterations = 100, nChains = 1, displayProgress = TRUE, computeLikelihood = TRUE, seed = NULL, ... ) ## S3 method for class 'multiChains' fit( object, iterations = 100, nChains = NULL, displayProgress = TRUE, computeLikelihood = TRUE, seed = NULL, ... )
## S3 method for class 'sentopicmodel' fit( object, iterations = 100, nChains = 1, displayProgress = TRUE, computeLikelihood = TRUE, seed = NULL, ... ) ## S3 method for class 'multiChains' fit( object, iterations = 100, nChains = NULL, displayProgress = TRUE, computeLikelihood = TRUE, seed = NULL, ... )
object |
|
iterations |
the number of iterations by which the model should be fitted. |
nChains |
if set above 1, the model will be fitted multiple times
from various starting positions. Latent variables will be re-initialized if
|
displayProgress |
if |
computeLikelihood |
if set to |
seed |
for reproducibility, a seed can be provided. |
... |
arguments passed to other methods. Not used. |
a sentopicmodel
of the relevant model class if nChains
is
unspecified or equal to 1. A multiChains
object if nChains
is greater
than 1.
When nChains > 1
, the function can take advantage of
future.apply::future_lapply (if installed) to spread the computation over
multiple processes. This requires the specification of a parallel strategy
using future::plan()
. See the examples below.
model <- rJST(ECB_press_conferences_tokens) fit(model, 10) # -- Parallel computation -- require(future.apply) future::plan("multisession", workers = 2) # Set up 2 workers fit(model, 10, nChains = 2) future::plan("sequential") # Shut down workers
model <- rJST(ECB_press_conferences_tokens) fit(model, 10) # -- Parallel computation -- require(future.apply) future::plan("multisession", workers = 2) # Set up 2 workers fit(model, 10, nChains = 2) future::plan("sequential") # Shut down workers
This helper function automatically retrieve the full data set of press conferences made available by the ECB. It implements a number of pre-processing steps used to remove the Q&A section from the text.
get_ECB_press_conferences( years = 1998:2021, language = "en", data.table = TRUE )
get_ECB_press_conferences( years = 1998:2021, language = "en", data.table = TRUE )
years |
the years for which press conferences should be retrieved |
language |
the language in which press conferences should be retrieved |
data.table |
if TRUE, returns a data.table. Otherwise, return a list in which each element is a press conference. |
Depending on the arguments, returns either a data.frame or a quanteda::tokens object containing press conferences of the ECB.
This helper function automatically retrieve the full data set of speeches made available by the ECB. In addition, it implements a number of pre-processing steps that may be turned on or off as needed.
get_ECB_speeches( filter_english = TRUE, clean_footnotes = TRUE, compute_sentiment = TRUE, tokenize_w_POS = FALSE )
get_ECB_speeches( filter_english = TRUE, clean_footnotes = TRUE, compute_sentiment = TRUE, tokenize_w_POS = FALSE )
filter_english |
if |
clean_footnotes |
if |
compute_sentiment |
if |
tokenize_w_POS |
if |
Depending on the arguments, returns either a data.frame or a quanteda::tokens object containing speeches of the ECB.
This function initialize a Joint Sentiment/Topic model.
JST( x, lexicon = NULL, S = 3, K = 5, gamma = 1, alpha = 5, beta = 0.01, gammaCycle = 0, alphaCycle = 0 )
JST( x, lexicon = NULL, S = 3, K = 5, gamma = 1, alpha = 5, beta = 0.01, gammaCycle = 0, alphaCycle = 0 )
x |
tokens object containing the texts. A coercion will be attempted if |
lexicon |
a |
S |
the number of sentiments |
K |
the number of topics |
gamma |
the hyperparameter of sentiment-document distribution |
alpha |
the hyperparameter of topic-document distribution |
beta |
the hyperparameter of vocabulary distribution |
gammaCycle |
integer specifying the cycle size between two updates of the hyperparameter alpha |
alphaCycle |
integer specifying the cycle size between two updates of the hyperparameter alpha |
The rJST.LDA
methods enable the transition from a previously
estimated LDA model to a sentiment-aware rJST
model. The function
retains the previously estimated topics and randomly assigns sentiment to
every word of the corpus. The new model will retain the iteration count of
the initial LDA model.
An S3 list containing the model parameter and the estimated mixture.
This object corresponds to a Gibbs sampler estimator with zero iterations.
The MCMC can be iterated using the fit()
function.
tokens
is the tokens object used to create the model
vocabulary
contains the set of words of the corpus
it
tracks the number of Gibbs sampling iterations
za
is the list of topic assignment, aligned to the tokens
object with
padding removed
logLikelihood
returns the measured log-likelihood at each iteration,
with a breakdown of the likelihood into hierarchical components as
attribute
The topWords()
function easily extract the most probables words of each
topic/sentiment.
Olivier Delmarcelle
Lin, C. and He, Y. (2009). Joint sentiment/topic model for sentiment analysis. In Proceedings of the 18th ACM conference on Information and knowledge management, 375–384.
Lin, C., He, Y., Everson, R. and Ruger, S. (2012). Weakly Supervised Joint Sentiment-Topic Detection from Text. IEEE Transactions on Knowledge and Data Engineering, 24(6), 1134–-1145.
Fitting a model: fit()
,
extracting top words: topWords()
Other topic models:
LDA()
,
rJST()
,
sentopicmodel()
# creating a JST model JST(ECB_press_conferences_tokens) # estimating a JST model including a lexicon jst <- JST(ECB_press_conferences_tokens, lexicon = LoughranMcDonald) jst <- fit(jst, 100)
# creating a JST model JST(ECB_press_conferences_tokens) # estimating a JST model including a lexicon jst <- JST(ECB_press_conferences_tokens, lexicon = LoughranMcDonald) jst <- fit(jst, 100)
This function initialize a Latent Dirichlet Allocation model.
LDA(x, K = 5, alpha = 1, beta = 0.01)
LDA(x, K = 5, alpha = 1, beta = 0.01)
x |
tokens object containing the texts. A coercion will be attempted if |
K |
the number of topics |
alpha |
the hyperparameter of topic-document distribution |
beta |
the hyperparameter of vocabulary distribution |
The rJST.LDA
methods enable the transition from a previously
estimated LDA model to a sentiment-aware rJST
model. The function
retains the previously estimated topics and randomly assigns sentiment to
every word of the corpus. The new model will retain the iteration count of
the initial LDA model.
An S3 list containing the model parameter and the estimated mixture.
This object corresponds to a Gibbs sampler estimator with zero iterations.
The MCMC can be iterated using the fit()
function.
tokens
is the tokens object used to create the model
vocabulary
contains the set of words of the corpus
it
tracks the number of Gibbs sampling iterations
za
is the list of topic assignment, aligned to the tokens
object with
padding removed
logLikelihood
returns the measured log-likelihood at each iteration,
with a breakdown of the likelihood into hierarchical components as
attribute
The topWords()
function easily extract the most probables words of each
topic/sentiment.
Olivier Delmarcelle
Blei, D.M., Ng, A.Y. and Jordan, M.I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993–1022.
Fitting a model: fit()
, extracting
top words: topWords()
Other topic models:
JST()
,
rJST()
,
sentopicmodel()
# creating a model LDA(ECB_press_conferences_tokens, K = 5, alpha = 0.1, beta = 0.01) # estimating an LDA model lda <- LDA(ECB_press_conferences_tokens) lda <- fit(lda, 100)
# creating a model LDA(ECB_press_conferences_tokens, K = 5, alpha = 0.1, beta = 0.01) # estimating an LDA model lda <- LDA(ECB_press_conferences_tokens) lda <- fit(lda, 100)
This function call LDAvis to create a dynamic visualization of an estimated topic model.
LDAvis(x, ...)
LDAvis(x, ...)
x |
an |
... |
further arguments passed on to |
The CRAN release of LDAvis does not support UTF-8 characters
and automatically reorder topics. To solve these two issues, please install
the development version of LDAvis from github
(devtools::install_github("cpsievert/LDAvis")
).
Nothing, called for its side effects.
lda <- LDA(ECB_press_conferences_tokens) lda <- fit(lda, 100) LDAvis(lda)
lda <- LDA(ECB_press_conferences_tokens) lda <- fit(lda, 100) LDAvis(lda)
The Loughran-McDonald lexicon for financial texts adapted for usage in sentopics. The lexicon is enhanced with two list of valence-shifting words.
LoughranMcDonald
LoughranMcDonald
A quanteda::dictionary containing two polarity categories (negative and positive) and two valence-shifting categories (negator and amplifier).
https://sraf.nd.edu/loughranmcdonald-master-dictionary/ for the lexicon and lexicon::hash_valence_shifters for the valence shifters.
Loughran, T. & McDonald, B. (2011). When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks. The Journal of Finance, 66(1), 35–65.
JST(ECB_press_conferences_tokens, lexicon = LoughranMcDonald)
JST(ECB_press_conferences_tokens, lexicon = LoughranMcDonald)
data.table::melt()
As of the CRAN release of the 1.14.8 version of data.table, the
data.table::melt()
function is not a generic. This function aims to
temporary provide a generic to this function, so that melt.sentopicmodel()
can be effectively dispatched when used. Expect this function to disappear
shortly after the release of data.table 1.14.9.
melt(data, ...)
melt(data, ...)
data |
an object to melt |
... |
arguments passed to other methods |
An unkeyed data.table
containing the molten data.
data.table::melt()
, melt.sentopicmodel()
This function extracts the estimated document mixtures from a topic model and returns them in a long data.table format.
## S3 method for class 'sentopicmodel' melt(data, ..., include_docvars = FALSE)
## S3 method for class 'sentopicmodel' melt(data, ..., include_docvars = FALSE)
data |
a model created from the |
... |
not used |
include_docvars |
if |
A data.table in the long format, where each line is the estimated proportion of a single topic/sentiment for a document. For JST and rJST models, the probability is also decomposed into 'L1' and 'L2' layers, representing the probability at each layer of the topic-sentiment hierarchy.
Olivier Delmarcelle
topWords()
for extracting representative words,
data.table::melt()
and data.table::dcast()
# only returns topic proportion for LDA models lda <- LDA(ECB_press_conferences_tokens) lda <- fit(lda, 10) melt(lda) # includes sentiment for JST and rJST models jst <- JST(ECB_press_conferences_tokens) jst <- fit(jst, 10) melt(jst)
# only returns topic proportion for LDA models lda <- LDA(ECB_press_conferences_tokens) lda <- fit(lda, 10) melt(lda) # includes sentiment for JST and rJST models jst <- JST(ECB_press_conferences_tokens) jst <- fit(jst, 10) melt(jst)
This operation is especially useful for the analysis of the model's output, by grouping together topics that share a common theme.
mergeTopics(x, merging_list)
mergeTopics(x, merging_list)
x |
|
merging_list |
a list where each element is an integer vector containing the indices of topics to be merged. If named, the list's names become the label of the aggregated themes. |
Topics are aggregated at the word assignment level. New document-topic and topic-word probabilities are derived from the aggregated assignments.
Note that the output of this function does not constitute an estimated
topic model, but merely an aggregation to ease the analysis. It is not
advised to use fit()
on the merged topic
model as it will radically affect the content and proportions of the new
themes.
A LDA()
or rJST()
model with the merged topics.
sentopics_labels
lda <- LDA(ECB_press_conferences_tokens, K = 5) lda <- fit(lda, 100) merging_list <- list( c(1,5), 2:4 ) mergeTopics(lda, merging_list) # also possible with a named list merging_list2 <- list( mytheme_1 = c(1,5), mytheme_2 = 2:4 ) merged <- mergeTopics(lda, merging_list2) sentopics_labels(merged) # implemented for rJST rjst <- rJST(ECB_press_conferences_tokens, lexicon = LoughranMcDonald) rjst <- fit(rjst, 100) mergeTopics(rjst, merging_list2)
lda <- LDA(ECB_press_conferences_tokens, K = 5) lda <- fit(lda, 100) merging_list <- list( c(1,5), 2:4 ) mergeTopics(lda, merging_list) # also possible with a named list merging_list2 <- list( mytheme_1 = c(1,5), mytheme_2 = 2:4 ) merged <- mergeTopics(lda, merging_list2) sentopics_labels(merged) # implemented for rJST rjst <- rJST(ECB_press_conferences_tokens, lexicon = LoughranMcDonald) rjst <- fit(rjst, 100) mergeTopics(rjst, merging_list2)
The Picault-Renault lexicon, specialized in the analysis of central bank communication. The lexicon identifies a large number of n-grams and gives their probability to belong to six categories:
Monetary Policy - accommodative
Monetary Policy - neutral
Monetary Policy - restrictive
Economic Condition - negative
Economic Condition - neutral
Economic Condition - positive
PicaultRenault
PicaultRenault
A data.table object.
http://www.cbcomindex.com/lexicon.php
Picault, M. & Renault, T. (2017). Words are not all created equal: A new measure of ECB communication. Journal of International Money and Finance, 79, 136–156.
compute_PicaultRenault_scores()
head(PicaultRenault)
head(PicaultRenault)
A regression dataset built to partially replicate the result of Picault & Renault. This dataset contains, for each press conference published after 2000:
The Main Refinancing Rate (MRR) of the ECB set following the press conference
The change in the MRR following the press conference
The change in the MRR observed at the previous press conference
The Bloomberg consensus on the announced MRR
The Surprise brought by the announcement, computed as the Bloomberg consensus minus the MRR following the conference
The EURO STOXX 50 return on the day of the press conference
The EURO STOXX 50 return on the day preceding the announcement
PicaultRenault_data
PicaultRenault_data
An xts::xts object.
The data was manually prepared by the author of this package.
Picault, M. & Renault, T. (2017). Words are not all created equal: A new measure of ECB communication. Journal of International Money and Finance, 79, 136–156.
head(PicaultRenault_data)
head(PicaultRenault_data)
Plot the results of chainsDistance(x)
using multidimensional
scaling. See chainsDistances()
for details on the distance computation
and stats::cmdscale()
for the implementation of the multidimensional
scaling.
## S3 method for class 'multiChains' plot( x, ..., method = c("euclidean", "hellinger", "cosine", "minMax", "naiveEuclidean", "invariantEuclidean") )
## S3 method for class 'multiChains' plot( x, ..., method = c("euclidean", "hellinger", "cosine", "minMax", "naiveEuclidean", "invariantEuclidean") )
x |
a valid |
... |
not used |
method |
the method used to measure the distance between chains. |
Invisibly, the coordinates of each topic model resulting from the multidimensional scaling.
models <- LDA(ECB_press_conferences_tokens) models <- fit(models, 10, nChains = 5) plot(models)
models <- LDA(ECB_press_conferences_tokens) models <- fit(models, 10, nChains = 5) plot(models)
Summarize and plot a sentopics model using a sunburst chart from the plotly::plotly library.
## S3 method for class 'sentopicmodel' plot(x, nWords = 15, layers = 3, sort = FALSE, ...)
## S3 method for class 'sentopicmodel' plot(x, nWords = 15, layers = 3, sort = FALSE, ...)
x |
a model created from the |
nWords |
the number of words per topic/sentiment to display in the outer layer of the plot |
layers |
specifies the number of layers for the sunburst chart. This
will restrict the output to the |
sort |
if |
... |
not used |
A plotly
sunburst chart.
lda <- LDA(ECB_press_conferences_tokens) lda <- fit(lda, 100) plot(lda, nWords = 5) # only displays the topic proportions plot(lda, layers = 1)
lda <- LDA(ECB_press_conferences_tokens) lda <- fit(lda, 100) plot(lda, nWords = 5) # only displays the topic proportions plot(lda, layers = 1)
Print methods for sentopics models. Once per session (or
forced by using extended = TRUE
), it lists the most important function
related to sentopics models.
## S3 method for class 'sentopicmodel' print(x, extended = FALSE, ...) ## S3 method for class 'rJST' print(x, extended = FALSE, ...) ## S3 method for class 'LDA' print(x, extended = FALSE, ...) ## S3 method for class 'JST' print(x, extended = FALSE, ...)
## S3 method for class 'sentopicmodel' print(x, extended = FALSE, ...) ## S3 method for class 'rJST' print(x, extended = FALSE, ...) ## S3 method for class 'LDA' print(x, extended = FALSE, ...) ## S3 method for class 'JST' print(x, extended = FALSE, ...)
x |
the model to be printed |
extended |
if |
... |
not used |
No return value, called for side effects (printing).
Aggregate the topical or sentiment proportions at the document level into time series.
proportion_topics( x, period = c("year", "quarter", "month", "day", "identity"), rolling_window = 1, complete = TRUE, plot = c(FALSE, TRUE, "silent"), plot_ridgelines = TRUE, as.xts = TRUE, ... ) plot_proportion_topics( x, period = c("year", "quarter", "month", "day"), rolling_window = 1, complete = TRUE, plot_ridgelines = TRUE, ... )
proportion_topics( x, period = c("year", "quarter", "month", "day", "identity"), rolling_window = 1, complete = TRUE, plot = c(FALSE, TRUE, "silent"), plot_ridgelines = TRUE, as.xts = TRUE, ... ) plot_proportion_topics( x, period = c("year", "quarter", "month", "day"), rolling_window = 1, complete = TRUE, plot_ridgelines = TRUE, ... )
x |
a |
period |
the sampling period within which the sentiment of documents
will be averaged. |
rolling_window |
if greater than 1, determines the rolling window to compute a moving average of sentiment. The rolling window is based on the period unit and rely on actual dates (i.e, is not affected by unequally spaced data points). |
complete |
if FALSE, only compute proportions at the upper level of the topic model hierarchy (topics for rJST and sentiment for JST). No effect on LDA models. |
plot |
if |
plot_ridgelines |
if |
as.xts |
if |
... |
other arguments passed on to |
A time series of proportions, stored as an xts::xts object or as a data.frame.
sentopics_sentiment sentopics_date
Other series functions:
sentiment_breakdown()
,
sentiment_series()
,
sentiment_topics()
lda <- LDA(ECB_press_conferences_tokens) lda <- fit(lda, 100) proportion_topics(lda) # plot shortcut plot_proportion_topics(lda, period = "month", rolling_window = 3) # with or without ridgelines plot_proportion_topics(lda, period = "month", plot_ridgelines = FALSE) # also available for rJST and JST models jst <- JST(ECB_press_conferences_tokens, lexicon = LoughranMcDonald) jst <- fit(jst, 100) # including both layers proportion_topics(jst) # or not proportion_topics(jst, complete = FALSE)
lda <- LDA(ECB_press_conferences_tokens) lda <- fit(lda, 100) proportion_topics(lda) # plot shortcut plot_proportion_topics(lda, period = "month", rolling_window = 3) # with or without ridgelines plot_proportion_topics(lda, period = "month", plot_ridgelines = FALSE) # also available for rJST and JST models jst <- JST(ECB_press_conferences_tokens, lexicon = LoughranMcDonald) jst <- fit(jst, 100) # including both layers proportion_topics(jst) # or not proportion_topics(jst, complete = FALSE)
This function is used re-initialize a topic model, as if it was
created from LDA()
, JST()
or another model. The re-initialized model
retains its original parameter specification.
reset(object)
reset(object)
object |
a model created from the |
a sentopicmodel
of the relevant model class, with the iteration count
reset to zero and re-initialized assignment of latent variables.
Olivier Delmarcelle
model <- LDA(ECB_press_conferences_tokens) model <- fit(model, 10) reset(model)
model <- LDA(ECB_press_conferences_tokens) model <- fit(model, 10) reset(model)
This function initialize a Reversed Joint Sentiment/Topic model.
rJST(x, ...) ## Default S3 method: rJST( x, lexicon = NULL, K = 5, S = 3, alpha = 1, gamma = 5, beta = 0.01, alphaCycle = 0, gammaCycle = 0, ... ) ## S3 method for class 'LDA' rJST(x, lexicon = NULL, S = 3, gamma = 5, ...)
rJST(x, ...) ## Default S3 method: rJST( x, lexicon = NULL, K = 5, S = 3, alpha = 1, gamma = 5, beta = 0.01, alphaCycle = 0, gammaCycle = 0, ... ) ## S3 method for class 'LDA' rJST(x, lexicon = NULL, S = 3, gamma = 5, ...)
x |
tokens object containing the texts. A coercion will be attempted if |
... |
not used |
lexicon |
a |
K |
the number of topics |
S |
the number of sentiments |
alpha |
the hyperparameter of topic-document distribution |
gamma |
the hyperparameter of sentiment-document distribution |
beta |
the hyperparameter of vocabulary distribution |
alphaCycle |
integer specifying the cycle size between two updates of the hyperparameter alpha |
gammaCycle |
integer specifying the cycle size between two updates of the hyperparameter alpha |
The rJST.LDA
methods enable the transition from a previously
estimated LDA model to a sentiment-aware rJST
model. The function
retains the previously estimated topics and randomly assigns sentiment to
every word of the corpus. The new model will retain the iteration count of
the initial LDA model.
An S3 list containing the model parameter and the estimated mixture.
This object corresponds to a Gibbs sampler estimator with zero iterations.
The MCMC can be iterated using the fit()
function.
tokens
is the tokens object used to create the model
vocabulary
contains the set of words of the corpus
it
tracks the number of Gibbs sampling iterations
za
is the list of topic assignment, aligned to the tokens
object with
padding removed
logLikelihood
returns the measured log-likelihood at each iteration,
with a breakdown of the likelihood into hierarchical components as
attribute
The topWords()
function easily extract the most probables words of each
topic/sentiment.
Olivier Delmarcelle
Lin, C. and He, Y. (2009). Joint sentiment/topic model for sentiment analysis. In Proceedings of the 18th ACM conference on Information and knowledge management, 375–384.
Lin, C., He, Y., Everson, R. and Ruger, S. (2012). Weakly Supervised Joint Sentiment-Topic Detection from Text. IEEE Transactions on Knowledge and Data Engineering, 24(6), 1134–-1145.
Fitting a model: fit()
, extracting
top words: topWords()
Other topic models:
JST()
,
LDA()
,
sentopicmodel()
# simple rJST model rJST(ECB_press_conferences_tokens) # estimating a rJST model including lexicon rjst <- rJST(ECB_press_conferences_tokens, lexicon = LoughranMcDonald) rjst <- fit(rjst, 100) # from an LDA model: lda <- LDA(ECB_press_conferences_tokens) lda <- fit(lda, 100) # creating a rJST model out of it rjst <- rJST(lda, lexicon = LoughranMcDonald) # topic proportions remain identical identical(lda$theta, rjst$theta) # model should be iterated to estimate sentiment proportions rjst <- fit(rjst, 100)
# simple rJST model rJST(ECB_press_conferences_tokens) # estimating a rJST model including lexicon rjst <- rJST(ECB_press_conferences_tokens, lexicon = LoughranMcDonald) rjst <- fit(rjst, 100) # from an LDA model: lda <- LDA(ECB_press_conferences_tokens) lda <- fit(lda, 100) # creating a rJST model out of it rjst <- rJST(lda, lexicon = LoughranMcDonald) # topic proportions remain identical identical(lda$theta, rjst$theta) # model should be iterated to estimate sentiment proportions rjst <- fit(rjst, 100)
Break down the sentiment series obtained with
sentiment_series()
into topical components. Sentiment is broken down at
the document level using estimated topic proportions, then processed to
create a time series and its components.
sentiment_breakdown( x, period = c("year", "quarter", "month", "day", "identity"), rolling_window = 1, scale = TRUE, scaling_period = c("1900-01-01", "2099-12-31"), plot = c(FALSE, TRUE, "silent"), as.xts = TRUE, ... ) plot_sentiment_breakdown( x, period = c("year", "quarter", "month", "day"), rolling_window = 1, scale = TRUE, scaling_period = c("1900-01-01", "2099-12-31"), ... )
sentiment_breakdown( x, period = c("year", "quarter", "month", "day", "identity"), rolling_window = 1, scale = TRUE, scaling_period = c("1900-01-01", "2099-12-31"), plot = c(FALSE, TRUE, "silent"), as.xts = TRUE, ... ) plot_sentiment_breakdown( x, period = c("year", "quarter", "month", "day"), rolling_window = 1, scale = TRUE, scaling_period = c("1900-01-01", "2099-12-31"), ... )
x |
a |
period |
the sampling period within which the sentiment of documents
will be averaged. |
rolling_window |
if greater than 1, determines the rolling window to compute a moving average of sentiment. The rolling window is based on the period unit and rely on actual dates (i.e, is not affected by unequally spaced data points). |
scale |
if |
scaling_period |
the date range over which the scaling should be applied. Particularly useful to normalize only the beginning of the time series. |
plot |
if |
as.xts |
if |
... |
other arguments passed on to |
The sentiment is broken down at the sentiment level assuming the following composition:
, where
is the sentiment of topic i and
the proportion of
topic i in a given document. For an LDA model, the sentiment of each topic
is considered equal to the document sentiment (i.e.
). The topical sentiment attention, defined by
represent the effective sentiment conveyed by a topic in a
document. The topical sentiment attention of all documents in a period are
averaged to compute the breakdown of the sentiment time series.
A time series of sentiment, stored as an xts::xts object or as a data.frame.
sentopics_sentiment sentopics_date
Other series functions:
proportion_topics()
,
sentiment_series()
,
sentiment_topics()
lda <- LDA(ECB_press_conferences_tokens) lda <- fit(lda, 100) sentiment_breakdown(lda) # plot shortcut plot_sentiment_breakdown(lda) # also available for rJST models (with topic-level sentiment) rjst <- rJST(ECB_press_conferences_tokens, lexicon = LoughranMcDonald) rjst <- fit(rjst, 100) sentopics_sentiment(rjst, override = TRUE) plot_sentiment_breakdown(rjst)
lda <- LDA(ECB_press_conferences_tokens) lda <- fit(lda, 100) sentiment_breakdown(lda) # plot shortcut plot_sentiment_breakdown(lda) # also available for rJST models (with topic-level sentiment) rjst <- rJST(ECB_press_conferences_tokens, lexicon = LoughranMcDonald) rjst <- fit(rjst, 100) sentopics_sentiment(rjst, override = TRUE) plot_sentiment_breakdown(rjst)
Compute a sentiment time series based on the internal sentiment
and dates of a sentopicmodel
. The time series computation supports
multiple sampling period and optionally allow computing a moving average.
sentiment_series( x, period = c("year", "quarter", "month", "day"), rolling_window = 1, scale = TRUE, scaling_period = c("1900-01-01", "2099-12-31"), as.xts = TRUE, ... )
sentiment_series( x, period = c("year", "quarter", "month", "day"), rolling_window = 1, scale = TRUE, scaling_period = c("1900-01-01", "2099-12-31"), as.xts = TRUE, ... )
x |
a |
period |
the sampling period within which the sentiment of documents
will be averaged. |
rolling_window |
if greater than 1, determines the rolling window to compute a moving average of sentiment. The rolling window is based on the period unit and rely on actual dates (i.e, is not affected by unequally spaced data points). |
scale |
if |
scaling_period |
the date range over which the scaling should be applied. Particularly useful to normalize only the beginning of the time series. |
as.xts |
if |
... |
other arguments passed on to |
A time series of sentiment, stored as an xts::xts or data.frame.
sentopics_sentiment sentopics_date
Other series functions:
proportion_topics()
,
sentiment_breakdown()
,
sentiment_topics()
lda <- LDA(ECB_press_conferences_tokens) series <- sentiment_series(lda, period = "month") # JST and rJST models can use computed sentiment from the sentiment layer, # but the model must be estimated first. rjst <- rJST(ECB_press_conferences_tokens, lexicon = LoughranMcDonald) sentiment_series(rjst) sentopics_sentiment(rjst) <- NULL ## remove existing sentiment rjst <- fit(rjst, 10) ## estimating the model is then needed sentiment_series(rjst) # note the presence of both raw and scaled sentiment values # in the initial object sentopics_sentiment(lda) sentopics_sentiment(rjst)
lda <- LDA(ECB_press_conferences_tokens) series <- sentiment_series(lda, period = "month") # JST and rJST models can use computed sentiment from the sentiment layer, # but the model must be estimated first. rjst <- rJST(ECB_press_conferences_tokens, lexicon = LoughranMcDonald) sentiment_series(rjst) sentopics_sentiment(rjst) <- NULL ## remove existing sentiment rjst <- fit(rjst, 10) ## estimating the model is then needed sentiment_series(rjst) # note the presence of both raw and scaled sentiment values # in the initial object sentopics_sentiment(lda) sentopics_sentiment(rjst)
Derive topical time series of sentiment from a LDA()
or
rJST()
model. The time series are created by leveraging on estimated
topic proportions and internal sentiment (for LDA
models) or topical
sentiment (for rJST
models).
sentiment_topics( x, period = c("year", "quarter", "month", "day", "identity"), rolling_window = 1, scale = TRUE, scaling_period = c("1900-01-01", "2099-12-31"), plot = c(FALSE, TRUE, "silent"), plot_ridgelines = TRUE, as.xts = TRUE, ... ) plot_sentiment_topics( x, period = c("year", "quarter", "month", "day"), rolling_window = 1, scale = TRUE, scaling_period = c("1900-01-01", "2099-12-31"), plot_ridgelines = TRUE, ... )
sentiment_topics( x, period = c("year", "quarter", "month", "day", "identity"), rolling_window = 1, scale = TRUE, scaling_period = c("1900-01-01", "2099-12-31"), plot = c(FALSE, TRUE, "silent"), plot_ridgelines = TRUE, as.xts = TRUE, ... ) plot_sentiment_topics( x, period = c("year", "quarter", "month", "day"), rolling_window = 1, scale = TRUE, scaling_period = c("1900-01-01", "2099-12-31"), plot_ridgelines = TRUE, ... )
x |
a |
period |
the sampling period within which the sentiment of documents
will be averaged. |
rolling_window |
if greater than 1, determines the rolling window to compute a moving average of sentiment. The rolling window is based on the period unit and rely on actual dates (i.e, is not affected by unequally spaced data points). |
scale |
if |
scaling_period |
the date range over which the scaling should be applied. Particularly useful to normalize only the beginning of the time series. |
plot |
if |
plot_ridgelines |
if |
as.xts |
if |
... |
other arguments passed on to |
A topical sentiment is computed at the document level for each
topic. For an LDA model, the sentiment of each topic is considered equal to
the document sentiment (i.e. ). For a rJST
model, these result from the proportions in the sentiment layer under each
topic. To compute the topical time series, the topical sentiment of all
documents in a period are aggregated according to their respective topic
proportion. For example, for a given topic, the topical sentiment in period
is computed using:
, where is the sentiment of
the topic in document d and
the topic proportion in a
document d.
an xts::xts or data.frame containing the time series of topical sentiments.
sentopics_sentiment sentopics_date
Other series functions:
proportion_topics()
,
sentiment_breakdown()
,
sentiment_series()
lda <- LDA(ECB_press_conferences_tokens) lda <- fit(lda, 100) sentiment_topics(lda) # plot shortcut plot_sentiment_topics(lda, period = "month", rolling_window = 3) # with or without ridgelines plot_sentiment_topics(lda, period = "month", plot_ridgelines = FALSE) # also available for rJST models with internal sentiment computation rjst <- rJST(ECB_press_conferences_tokens, lexicon = LoughranMcDonald) rjst <- fit(rjst, 100) sentopics_sentiment(rjst, override = TRUE) sentiment_topics(rjst)
lda <- LDA(ECB_press_conferences_tokens) lda <- fit(lda, 100) sentiment_topics(lda) # plot shortcut plot_sentiment_topics(lda, period = "month", rolling_window = 3) # with or without ridgelines plot_sentiment_topics(lda, period = "month", plot_ridgelines = FALSE) # also available for rJST models with internal sentiment computation rjst <- rJST(ECB_press_conferences_tokens, lexicon = LoughranMcDonald) rjst <- fit(rjst, 100) sentopics_sentiment(rjst, override = TRUE) sentiment_topics(rjst)
Extract or replace the internal dates of a sentopicmodel
. The
internal dates are used to create time series using the functions
sentiment_series()
or sentiment_topics()
. Dates should be provided by
using sentopics_date(x) <- value
or by storing a '.date' docvars in
the tokens object used to create the model.
sentopics_date(x, include_docvars = FALSE) sentopics_date(x) <- value
sentopics_date(x, include_docvars = FALSE) sentopics_date(x) <- value
x |
a |
include_docvars |
if |
value |
a |
a data.frame with the stored date per document.
The internal date is stored internally in the docvars of the topic
model. This means that dates may also be accessed through the docvars()
function, although this is discouraged.
Olivier Delmarcelle
Other sentopics helpers:
sentopics_labels()
,
sentopics_sentiment()
# example dataset already contains ".date" docvar docvars(ECB_press_conferences_tokens) # dates are automatically stored in the sentopicmodel object lda <- LDA(ECB_press_conferences_tokens) sentopics_date(lda) # dates can be removed or modified by the assignment operator sentopics_date(lda) <- NULL sentopics_date(lda) <- docvars(ECB_press_conferences_tokens, ".date")
# example dataset already contains ".date" docvar docvars(ECB_press_conferences_tokens) # dates are automatically stored in the sentopicmodel object lda <- LDA(ECB_press_conferences_tokens) sentopics_date(lda) # dates can be removed or modified by the assignment operator sentopics_date(lda) <- NULL sentopics_date(lda) <- docvars(ECB_press_conferences_tokens, ".date")
Extract or replace the labels of a sentopicmodel
. The replaced
labels will appear in most functions dealing with the output of the
sentomicmodel
.
sentopics_labels(x, flat = TRUE) sentopics_labels(x) <- value
sentopics_labels(x, flat = TRUE) sentopics_labels(x) <- value
x |
a |
flat |
if FALSE, return a list of dimension labels instead of a character vector. |
value |
a list of future labels for the topic model. The list should be named and contain a character vector for each dimension to label. See the examples for a correct usage. |
a character vector of topic/sentiment labels.
Olivier Delmarcelle
mergeTopics
Other sentopics helpers:
sentopics_date()
,
sentopics_sentiment()
# by default, sentopics_labels() generate standard topic names lda <- LDA(ECB_press_conferences_tokens) sentopics_labels(lda) # to change labels, a named list must be provided sentopics_labels(lda) <- list( topic = paste0("superTopic", 1:lda$K) ) sentopics_labels(lda) # using NULL remove labels sentopics_labels(lda) <- NULL sentopics_labels(lda) # also works for JST/rJST models jst <- JST(ECB_press_conferences_tokens) sentopics_labels(jst) <- list( topic = paste0("superTopic", 1:jst$K), sentiment = c("negative", "neutral", "positive") ) sentopics_labels(jst) # setting flat = FALSE return a list or labels for each dimension sentopics_labels(jst, flat = FALSE)
# by default, sentopics_labels() generate standard topic names lda <- LDA(ECB_press_conferences_tokens) sentopics_labels(lda) # to change labels, a named list must be provided sentopics_labels(lda) <- list( topic = paste0("superTopic", 1:lda$K) ) sentopics_labels(lda) # using NULL remove labels sentopics_labels(lda) <- NULL sentopics_labels(lda) # also works for JST/rJST models jst <- JST(ECB_press_conferences_tokens) sentopics_labels(jst) <- list( topic = paste0("superTopic", 1:jst$K), sentiment = c("negative", "neutral", "positive") ) sentopics_labels(jst) # setting flat = FALSE return a list or labels for each dimension sentopics_labels(jst, flat = FALSE)
Compute, extract or replace the internal sentiment of a
sentopicmodel
. The internal sentiment is used to create time series using
the functions sentiment_series()
or sentiment_topics()
. If the input
model contains a sentiment layer, sentiment can be computed directly from
the output of the model. Otherwise, sentiment obtained externally should be
added for each document.
sentopics_sentiment( x, method = c("proportional", "proportionalPol"), override = FALSE, quiet = FALSE, include_docvars = FALSE ) sentopics_sentiment(x) <- value
sentopics_sentiment( x, method = c("proportional", "proportionalPol"), override = FALSE, quiet = FALSE, include_docvars = FALSE ) sentopics_sentiment(x) <- value
x |
a |
method |
the method used to compute sentiment, see "Methods" below.
Ignored if an internal sentiment is already stored, unless |
override |
by default, the function computes sentiment only if no
internal sentiment is already stored within the |
quiet |
if |
include_docvars |
if |
value |
a numeric vector of sentiment to input into the model. |
The computed sentiment varies depending on the model. For LDA, sentiment computation is not possible.
For JST, the sentiment is computed on a per-document basis according to the document-level sentiment mixtures.
For a rJST model, a sentiment is computed for each topic, resulting in
K
sentiment values per document. In that case, the .sentiment
column is
an average of the K
sentiment values, weighted by they respective topical
proportions.
A data.frame with the stored sentiment per document.
The function accepts two methods of computing sentiment:
proportional computes the difference between the estimated positive and negative proportions for each document (and possibly each topic).
proportionalPol computes the difference between positive and negative proportions, divided by the sum of positive and negative proportions. As a result, the computed sentiment lies within the (-1;1) interval.
Both methods will lead to the same result for a JST model containing only negative and positive sentiments.
The internal sentiment is stored internally in the docvars of the
topic model. This means that sentiment may also be accessed through the
docvars()
function, although this is discouraged.
Olivier Delmarcelle
Other sentopics helpers:
sentopics_date()
,
sentopics_labels()
# example dataset already contains ".sentiment" docvar docvars(ECB_press_conferences_tokens) # sentiment is automatically stored in the sentopicmodel object lda <- LDA(ECB_press_conferences_tokens) sentopics_sentiment(lda) # sentiment can be removed or modified by the assignment operator sentopics_sentiment(lda) <- NULL sentopics_sentiment(lda) <- docvars(ECB_press_conferences_tokens, ".sentiment") # for JST models, sentiment can be computed from the output of the model jst <- JST(ECB_press_conferences_tokens, lexicon = LoughranMcDonald) jst <- fit(jst, 100) sentopics_sentiment(jst, override = TRUE) # replace existing sentiment ## for rJST models one sentiment value is computed by topic rjst <- rJST(ECB_press_conferences_tokens, lexicon = LoughranMcDonald) rjst <- fit(rjst, 100) sentopics_sentiment(rjst, override = TRUE)
# example dataset already contains ".sentiment" docvar docvars(ECB_press_conferences_tokens) # sentiment is automatically stored in the sentopicmodel object lda <- LDA(ECB_press_conferences_tokens) sentopics_sentiment(lda) # sentiment can be removed or modified by the assignment operator sentopics_sentiment(lda) <- NULL sentopics_sentiment(lda) <- docvars(ECB_press_conferences_tokens, ".sentiment") # for JST models, sentiment can be computed from the output of the model jst <- JST(ECB_press_conferences_tokens, lexicon = LoughranMcDonald) jst <- fit(jst, 100) sentopics_sentiment(jst, override = TRUE) # replace existing sentiment ## for rJST models one sentiment value is computed by topic rjst <- rJST(ECB_press_conferences_tokens, lexicon = LoughranMcDonald) rjst <- fit(rjst, 100) sentopics_sentiment(rjst, override = TRUE)
Extract the top words in each topic/sentiment from a
sentopicmodel
.
topWords( x, nWords = 10, method = c("frequency", "probability", "term-score", "FREX"), output = c("data.frame", "plot", "matrix"), subset, w = 0.5 ) plot_topWords( x, nWords = 10, method = c("frequency", "probability", "term-score", "FREX"), subset, w = 0.5 )
topWords( x, nWords = 10, method = c("frequency", "probability", "term-score", "FREX"), output = c("data.frame", "plot", "matrix"), subset, w = 0.5 ) plot_topWords( x, nWords = 10, method = c("frequency", "probability", "term-score", "FREX"), subset, w = 0.5 )
x |
|
nWords |
the number of top words to extract |
method |
specify if a re-ranking function should be applied before returning the top words. See Details for a description of each method. |
output |
determines the output of the function |
subset |
allows to subset using a logical expression, as in |
w |
only used when |
"frequency"
ranks top words according to their frequency
within a topic. This method also reports the overall frequency of
each word. When returning a plot, the overall frequency is
represented with a grey bar.
"probability"
uses the estimated topic-word mixture to
rank top words.
"term-score"
implements the re-ranking method from Blei and
Lafferty (2009). This method down-weights terms that have high
probability in all topics using the following score:
for
topic , vocabulary word
and number of topics
.
"FREX"
implements the re-ranking method from Bischof and Airoldi
(2012). This method used the weight to balance between
topic-word probability and topic exclusivity using the following
score:
for
topic , vocabulary word
, number of topics
and
weight
, where
is the empirical cumulative
distribution function.
The top words of the topic model. Depending on the output chosen, can
result in either a long-style data.frame, a ggplot2
object or a matrix.
Olivier Delmarcelle
Blei, DM. and Lafferty, JD. (2009). Topic models.. In Text Mining, chapter 4, 101–124.
Bischof JM. and Airoldi, EM. (2012). Summarizing Topical Content with Word Frequency and Exclusivity.. In Proceedings of the 29th International Conference on International Conference on Machine Learning, ICML'12, 9–16.
melt.sentopicmodel()
for extracting estimated mixtures
model <- LDA(ECB_press_conferences_tokens) model <- fit(model, 10) topWords(model) topWords(model, output = "matrix") topWords(model, method = "FREX") plot_topWords(model) plot_topWords(model, subset = topic %in% 1:2) jst <- JST(ECB_press_conferences_tokens) jst <- fit(jst, 10) plot_topWords(jst) plot_topWords(jst, subset = topic %in% 1:2 & sentiment == 3)
model <- LDA(ECB_press_conferences_tokens) model <- fit(model, 10) topWords(model) topWords(model, output = "matrix") topWords(model, method = "FREX") plot_topWords(model) plot_topWords(model, subset = topic %in% 1:2) jst <- JST(ECB_press_conferences_tokens) jst <- fit(jst, 10) plot_topWords(jst) plot_topWords(jst, subset = topic %in% 1:2 & sentiment == 3)