Type: | Package |
Title: | Scaling Models and Classifiers for Textual Data |
Version: | 0.9.10 |
Description: | Scaling models and classifiers for sparse matrix objects representing textual data in the form of a document-feature matrix. Includes original implementations of 'Laver', 'Benoit', and Garry's (2003) <doi:10.1017/S0003055403000698>, 'Wordscores' model, the Perry and 'Benoit' (2017) <doi:10.48550/arXiv.1710.08963> class affinity scaling model, and the 'Slapin' and 'Proksch' (2008) <doi:10.1111/j.1540-5907.2008.00338.x> 'wordfish' model, as well as methods for correspondence analysis, latent semantic analysis, and fast Naive Bayes and linear 'SVMs' specially designed for sparse textual data. |
Depends: | R (≥ 3.1.0), methods |
Imports: | glmnet, Matrix (≥ 1.2), quanteda (≥ 4.0.0), RSpectra, Rcpp (≥ 0.12.12), stringi |
Suggests: | ca, covr, fastNaiveBayes, knitr, lsa, microbenchmark, naivebayes, quanteda.textplots, spelling, testthat, rmarkdown |
LinkingTo: | Rcpp, RcppArmadillo (≥ 0.7.600.1.0), quanteda |
URL: | https://github.com/quanteda/quanteda.textmodels |
License: | GPL-3 |
Encoding: | UTF-8 |
LazyData: | true |
Language: | en-GB |
RoxygenNote: | 7.3.2 |
Collate: | 'RcppExports.R' 'quanteda.textmodels-package.R' 'data-documentation.R' 'textmodel-methods.R' 'textmodel_affinity.R' 'textmodel_ca.R' 'textmodel_lsa.R' 'textmodel_lr.R' 'textmodel_nb.R' 'textmodel_svmlin.R' 'textmodel_wordfish.R' 'textmodel_wordscores.R' 'textplot_influence.R' 'utils.R' |
VignetteBuilder: | knitr |
NeedsCompilation: | yes |
Packaged: | 2025-02-10 18:45:31 UTC; kbenoit |
Author: | Kenneth Benoit |
Maintainer: | Kenneth Benoit <kbenoit@quanteda.org> |
Repository: | CRAN |
Date/Publication: | 2025-02-10 23:50:11 UTC |
quanteda.textmodels: Scaling Models and Classifiers for Textual Data
Description
Scaling models and classifiers for sparse matrix objects representing textual data in the form of a document-feature matrix. Includes original implementations of 'Laver', 'Benoit', and Garry's (2003) doi:10.1017/S0003055403000698, 'Wordscores' model, the Perry and 'Benoit' (2017) doi:10.48550/arXiv.1710.08963 class affinity scaling model, and the 'Slapin' and 'Proksch' (2008) doi:10.1111/j.1540-5907.2008.00338.x 'wordfish' model, as well as methods for correspondence analysis, latent semantic analysis, and fast Naive Bayes and linear 'SVMs' specially designed for sparse textual data.
Author(s)
Maintainer: Kenneth Benoit kbenoit@smu.edu.sg (ORCID) [copyright holder]
Authors:
Kohei Watanabe watanabe.kohei@gmail.com (ORCID)
Haiyan Wang whyinsa@yahoo.com (ORCID)
Patrick O. Perry patperry@gmail.com (ORCID)
Benjamin Lauderdale b.e.lauderdale@lse.ac.uk (ORCID)
Johannes Gruber JohannesB.Gruber@gmail.com (ORCID)
William Lowe lowe@hertie-school.org (ORCID)
Other contributors:
Vikas Sindhwani vikas.sindhwani@gmail.com (authored svmlin C++ source code) [copyright holder]
European Research Council (ERC-2011-StG 283794-QUANTESS) [funder]
See Also
Useful links:
Internal function to fit the likelihood scaling mixture model.
Description
Ken recommends you use textmodel_affinity()
instead.
Usage
affinity(p, x, smooth = 0.5, verbose = FALSE)
Arguments
p |
word likelihoods within classes, estimated from training data |
x |
term-document matrix for document(s) to be scaled |
smooth |
a misnamed smoothing parameter, either a scalar or a vector equal in length to the number of documents |
Value
a list containing:
-
coefficients
point estimates of theta -
se
(likelihood) standard error of theta -
cov
covariance matrix -
smooth
values of the smoothing parameter -
support
logical indicating if the feature was included
Author(s)
Patrick Perry
Examples
p <- matrix(c(c(5/6, 0, 1/6), c(0, 4/5, 1/5)), nrow = 3,
dimnames = list(c("A", "B", "C"), NULL))
theta <- c(.2, .8)
q <- drop(p %*% theta)
x <- 2 * q
(fit <- affinity(p, x))
Coerce various objects to coefficients_textmodel
Description
Helper functions used in summary.textmodel_*()
.
Usage
as.coefficients_textmodel(x)
Arguments
x |
an object to be coerced |
Value
an object with the class tag of coefficients_textmodel
Coerce various objects to statistics_textmodel
Description
This is a helper function used in summary.textmodel_*
.
Usage
as.statistics_textmodel(x)
Arguments
x |
an object to be coerced |
Value
an object of class statistics_textmodel
Assign the summary.textmodel class to a list
Description
Assigns the class summary.textmodel
to a list
Usage
as.summary.textmodel(x)
Arguments
x |
a named list |
Value
an object of class summary.textmodel
Extract model coefficients from a fitted textmodel_ca object
Description
coef()
extract model coefficients from a fitted textmodel_ca
object. coefficients()
is an alias.
Usage
## S3 method for class 'textmodel_ca'
coef(object, doc_dim = 1, feat_dim = 1, ...)
coefficients.textmodel_ca(object, doc_dim = 1, feat_dim = 1, ...)
Arguments
object |
a fitted textmodel_ca object |
doc_dim , feat_dim |
the document and feature dimension scores to be extracted |
... |
unused |
Value
a list containing numeric vectors of feature and document
coordinates. Includes NA
vectors of standard errors for consistency with
other models' coefficient outputs, and for the possibility of having these
computed in the future.
-
coef_feature
column coordinates of the features -
coef_feature_se
feature length vector ofNA
values -
coef_document
row coordinates of the documents -
coef_document_se
document length vector ofNA
values
Crowd-labelled sentence corpus from a 2010 EP debate on coal subsidies
Description
A multilingual text corpus of speeches from a European Parliament debate on coal subsidies in 2010, with individual crowd codings as the unit of observation. The sentences are drawn from officially translated speeches from a debate over a European Parliament debate concerning a Commission report proposing an extension to a regulation permitting state aid to uncompetitive coal mines.
Each speech is available in six languages: English, German, Greek, Italian, Polish and Spanish. The unit of observation is the individual crowd coding of each natural sentence. For more information on the coding approach see Benoit et al. (2016).
Usage
data_corpus_EPcoaldebate
Format
The corpus consists of 16,806 documents (i.e. codings of a sentence) and includes the following document-level variables:
- sentence_id
character; a unique identifier for each sentence
- crowd_subsidy_label
factor; whether a coder labelled the sentence as "Pro-Subsidy", "Anti-Subsidy" or "Neutral or inapplicable"
- language
factor; the language (translation) of the speech
- name_last
character; speaker's last name
- name_first
character; speaker's first name
- ep_group
factor; abbreviation of the EP party group of the speaker
- country
factor; the speaker's country of origin
- vote
factor; the speaker's vote on the proposal (For/Against/Abstain/NA)
- coder_id
character; a unique identifier for each crowd coder
- coder_trust
numeric; the "trust score" from the Crowdflower platform used to code the sentences, which can theoretically range between 0 and 1. Only coders with trust scores above 0.8 are included in the corpus.
A corpus object.
References
Benoit, K., Conway, D., Lauderdale, B.E., Laver, M., & Mikhaylov, S. (2016). Crowd-sourced Text Analysis: Reproducible and Agile Production of Political Data. American Political Science Review, 100,(2), 278–295. doi:10.1017/S0003055416000058
Confidence debate from 1991 Irish Parliament
Description
Texts of speeches from a no-confidence motion debated in the Irish Dáil from 16-18 October 1991 over the future of the Fianna Fail-Progressive Democrat coalition. (See Laver and Benoit 2002 for details.)
Usage
data_corpus_dailnoconf1991
Format
data_corpus_dailnoconf1991
is a corpus with 58 texts,
including docvars for name
, party
, and position
.
Source
https://www.oireachtas.ie/en/debates/debate/dail/1991-10-16/10/
References
Laver, M. & Benoit, K.R. (2002). Locating TDs in Policy Spaces: Wordscoring Dáil Speeches. Irish Political Studies, 17(1), 59–73.
Laver, M., Benoit, K.R., & Garry, J. (2003). Estimating Policy Positions from Political Text using Words as Data. American Political Science Review, 97(2), 311–331.
Examples
## Not run:
library("quanteda")
data_dfm_dailnoconf1991 <- data_corpus_dailnoconf1991 %>%
tokens(remove_punct = TRUE) %>%
dfm()
tmod <- textmodel_affinity(data_dfm_dailnoconf1991,
c("Govt", "Opp", "Opp", rep(NA, 55)))
(pred <- predict(tmod))
dat <-
data.frame(party = as.character(docvars(data_corpus_dailnoconf1991, "party")),
govt = coef(pred)[, "Govt"],
position = as.character(docvars(data_corpus_dailnoconf1991, "position")))
bymedian <- with(dat, reorder(paste(party, position), govt, median))
oldpar <- par(no.readonly = TRUE)
par(mar = c(5, 6, 4, 2) + .1)
boxplot(govt ~ bymedian, data = dat,
horizontal = TRUE, las = 1,
xlab = "Degree of support for government",
ylab = "")
abline(h = 7.5, col = "red", lty = "dashed")
text(c(0.9, 0.9), c(8.5, 6.5), c("Goverment", "Opposition"))
par(oldpar)
## End(Not run)
Irish budget speeches from 2010
Description
Speeches and document-level variables from the debate over the Irish budget of 2010.
Usage
data_corpus_irishbudget2010
Format
The corpus object for the 2010 budget speeches, with document-level variables for year, debate, serial number, first and last name of the speaker, and the speaker's party.
Details
At the time of the debate, Fianna Fáil (FF) and the Greens formed the government coalition, while Fine Gael (FG), Labour (LAB), and Sinn Féin (SF) were in opposition.
Source
Dáil Éireann Debate, Budget Statement 2010. 9 December 2009. vol. 697, no. 3.
References
Lowe, W. & Benoit, K.R. (2013). Validating Estimates of Latent Traits From Textual Data Using Human Judgment as a Benchmark. Political Analysis, 21(3), 298–313. doi:10.1093/pan/mpt002.
Movie reviews with polarity from Pang and Lee (2004)
Description
A corpus object containing 2,000 movie reviews classified by positive or negative sentiment.
Usage
data_corpus_moviereviews
Format
The corpus includes the following document variables:
- sentiment
factor indicating whether a review was manually classified as positive
pos
or negativeneg
.- id1
Character counting the position in the corpus.
- id2
Random number for each review.
Details
For more information, see cat(meta(data_corpus_moviereviews, "readme"))
.
Source
https://www.cs.cornell.edu/people/pabo/movie-review-data/
References
Pang, B., Lee, L. (2004) "A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts.", Proceedings of the ACL.
Examples
# check polarities
table(data_corpus_moviereviews$sentiment)
# make the data into sentences, because each line is a sentence
data_corpus_moviereviewsents <-
quanteda::corpus_segment(data_corpus_moviereviews, "\n", extract_pattern = FALSE)
print(data_corpus_moviereviewsents, max_ndoc = 3)
Internal function to match a dfm features to a target set
Description
Takes a dfm and a set of features, and makes them match the features listed in the set.
Usage
force_conformance(x, features, force = TRUE)
Arguments
x |
input dfm |
features |
character; a vector of feature names |
force |
logical; if |
Value
a dfm from the quanteda package containing
only features
as columns, in the same order as features
. A warning message
is printed if some feature names from features
are not matched in x
.
Examples
quanteda.textmodels:::force_conformance(quanteda::data_dfm_lbgexample, c("C", "B", "Z"))
Compute feature influence from a predicted textmodel_affinity object
Description
Computes the influence of features on scaled textmodel_affinity()
applications.
Usage
## S3 method for class 'predict.textmodel_affinity'
influence(model, subset = !train, ...)
Arguments
model |
a predicted textmodel_affinity() object |
subset |
whether to use all data or a subset (for instance, exclude the training set) |
... |
unused |
Value
a named list classed as influence.predict.textmodel_affinity that contains
-
norm
a document by feature class sparse matrix of normalised influence measures -
count
a vector of counts of each non-zero feature in the input matrix -
rate
the normalised feature count for each non-zero feature in the input matrix -
mode
an integer vector of 1 or 2 indicating the class which the feature is influencing, for each non-zero feature -
levels
a character vector of the affinity class labels -
subset
a logical vector indicating whether the document was included in the computation of influence;FALSE
for documents assigned a class label in training the model -
support
logical vector for each feature matching the same return from predict.textmodel_affinity
See Also
Examples
tmod <- textmodel_affinity(quanteda::data_dfm_lbgexample, y = c("L", NA, NA, NA, "R", NA))
pred <- predict(tmod)
influence(pred)
Prediction for a fitted affinity textmodel
Description
Estimate \theta_i
for each document, from a fitted
textmodel_affinity object.
Other methods below provide standard ways to extract or compute quantities from predicted textmodel_affinity objects.
Usage
## S3 method for class 'textmodel_affinity'
predict(object, newdata = NULL, level = 0.95, ...)
## S3 method for class 'predict.textmodel_affinity'
coef(object, ...)
## S3 method for class 'predict.textmodel_affinity'
residuals(object, type = c("response", "pearson"), ...)
## S3 method for class 'predict.textmodel_affinity'
rstandard(model, ...)
Arguments
object |
a fitted affinity textmodel |
newdata |
dfm on which prediction should be made |
level |
probability level for confidence interval width |
... |
unused |
type |
see residuals.lm |
Value
predict()
returns a list of predicted affinity textmodel
quantities, containing:
-
coefficients
a numeric matrix of affinity estimates (coefficients) for each class (columns) for each document (rows) -
se
a numeric matrix of likelihood standard errors for affinity coefficients each class (columns) for each document (rows) -
cov
an array of covariance matrices for each affinity class, one per document -
smooth
a numeric vector of length two for the smoothing parameterssmooth
andref_smooth
fromtextmodel_affinity()
-
newdata
a dfm on which prediction has been made -
train
a logical vector indicating which documents were used in training the model -
level
the confidence level for computing standard errors -
p
thep
return fromtextmodel_affinity
-
support
logical vector indicating whether a feature was included in computing class affinities
coef()
returns a document \times
class matrix of class
affinities for each document.
residuals()
returns a document-by-feature matrix of residuals.
resid()
is an alias.
rstandard()
is a shortcut to return the Pearson residuals.
See Also
influence.predict.textmodel_affinity()
for methods of
computing the influence of particular features from a predicted
textmodel_affinity model.
Prediction from a fitted textmodel_lr object
Description
predict.textmodel_lr()
implements class predictions from a fitted
logistic regression model.
Usage
## S3 method for class 'textmodel_lr'
predict(
object,
newdata = NULL,
type = c("class", "probability"),
force = TRUE,
...
)
## S3 method for class 'textmodel_lr'
coef(object, ...)
## S3 method for class 'textmodel_lr'
coefficients(object, ...)
Arguments
object |
a fitted logistic regression textmodel |
newdata |
dfm on which prediction should be made |
type |
the type of predicted values to be returned; see Value |
force |
make newdata's feature set conformant to the model terms |
... |
not used |
Value
predict.textmodel_lr()
returns either a vector of class
predictions for each row of newdata
(when type = "class"
), or
a document-by-class matrix of class probabilities (when 'type =
"probability"“).
coef.textmodel_lr()
returns a (sparse) matrix of coefficients for
each feature, computed at the value of the penalty parameter fitted in the
model. For binary outcomes, results are returned only for the class
corresponding to the second level of the factor response; for multinomial
outcomes, these are computed for each class.
See Also
Prediction from a fitted textmodel_nb object
Description
predict.textmodel_nb()
implements class predictions from a fitted
Naive Bayes model. using trained Naive Bayes examples
Usage
## S3 method for class 'textmodel_nb'
predict(
object,
newdata = NULL,
type = c("class", "probability", "logposterior"),
force = FALSE,
...
)
## S3 method for class 'textmodel_nb'
coef(object, ...)
## S3 method for class 'textmodel_nb'
coefficients(object, ...)
Arguments
object |
a fitted Naive Bayes textmodel |
newdata |
dfm on which prediction should be made |
type |
the type of predicted values to be returned; see Value |
force |
make newdata's feature set conformant to the model terms |
... |
not used |
Value
predict.textmodel_nb
returns either a vector of class
predictions for each row of newdata
(when type = "class"
), or
a document-by-class matrix of class probabilities (when type = "probability"
) or log posterior likelihoods (when type = "logposterior"
).
coef.textmodel_nb()
returns a matrix of estimated
word likelihoods given the class. (In earlier versions,
this was named PwGc
.)
See Also
Examples
# application to LBG (2003) example data
(tmod <- textmodel_nb(quanteda::data_dfm_lbgexample, y = c("A", "A", "B", "C", "C", NA)))
predict(tmod)
predict(tmod, type = "logposterior")
Prediction from a fitted textmodel_svmlin object
Description
predict.textmodel_svmlin()
implements class predictions from a fitted
linear SVM model.
Usage
## S3 method for class 'textmodel_svmlin'
predict(
object,
newdata = NULL,
type = c("class", "probability"),
force = FALSE,
...
)
Arguments
object |
a fitted linear SVM textmodel |
newdata |
dfm on which prediction should be made |
type |
the type of predicted values to be returned; see Value |
force |
logical, if |
... |
not used |
Value
predict.textmodel_svmlin
returns either a vector of class
predictions for each row of newdata
(when type = "class"
), or
a document-by-class matrix of class probabilities (when type = "probability"
).
See Also
Prediction from a textmodel_wordfish method
Description
predict.textmodel_wordfish()
returns estimated document scores and
confidence intervals. The method is provided for consistency with other
textmodel_*()
methods, but does not currently allow prediction on
out-of-sample data.
Usage
## S3 method for class 'textmodel_wordfish'
predict(
object,
se.fit = FALSE,
interval = c("none", "confidence"),
level = 0.95,
...
)
## S3 method for class 'textmodel_wordfish'
coef(object, margin = c("both", "documents", "features"), ...)
coefficients.textmodel_wordfish(object, ...)
Arguments
object |
a fitted wordfish model |
se.fit |
if |
interval |
type of confidence interval calculation |
level |
tolerance/confidence level for intervals |
... |
not used |
margin |
which margin of parameter estimates to return: both (in a list), or just document or feature parameters |
Value
coef.textmodel_wordfish()
returns a matrix of estimated
parameters coefficients for the specified margin.
Predict textmodel_wordscores
Description
Predict textmodel_wordscores
Usage
## S3 method for class 'textmodel_wordscores'
predict(
object,
newdata = NULL,
se.fit = FALSE,
interval = c("none", "confidence"),
level = 0.95,
rescaling = c("none", "lbg", "mv"),
force = TRUE,
...
)
Arguments
object |
a fitted Wordscores textmodel |
newdata |
dfm on which prediction should be made |
se.fit |
if |
interval |
type of confidence interval calculation |
level |
tolerance/confidence level for intervals |
rescaling |
|
force |
make the feature set of |
... |
not used |
Value
predict.textmodel_wordscores()
returns a named vector of predicted
document scores ("text scores" S_{vd}
in LBG 2003), or a named list if
se.fit = TRUE
consisting of the predicted scores ($fit
) and the
associated standard errors ($se.fit
). When interval = "confidence"
, the predicted values will be a matrix. This behaviour matches
that of stats::predict.lm()
.
Examples
tmod <- textmodel_wordscores(quanteda::data_dfm_lbgexample, c(seq(-1.5, 1.5, .75), NA))
predict(tmod)
predict(tmod, rescaling = "mv")
predict(tmod, rescaling = "lbg")
predict(tmod, se.fit = TRUE)
predict(tmod, se.fit = TRUE, interval = "confidence")
predict(tmod, se.fit = TRUE, interval = "confidence", rescaling = "lbg")
Print methods for textmodel features estimates
Description
This is a helper function used in print.summary.textmodel
.
Usage
## S3 method for class 'coefficients_textmodel'
print(x, digits = max(3L, getOption("digits") - 3L), ...)
Arguments
x |
a coefficients_textmodel object |
digits |
minimal number of significant digits, see
|
... |
additional arguments not used |
Implements print methods for textmodel_statistics
Description
Implements print methods for textmodel_statistics
Usage
## S3 method for class 'statistics_textmodel'
print(x, digits = max(3L, getOption("digits") - 3L), ...)
Arguments
x |
a textmodel_wordscore_statistics object |
digits |
minimal number of significant digits, see
|
... |
further arguments passed to or from other methods |
print method for summary.textmodel
Description
print method for summary.textmodel
Usage
## S3 method for class 'summary.textmodel'
print(x, digits = max(3L, getOption("digits") - 3L), ...)
Arguments
x |
a |
digits |
minimal number of significant digits, see
|
... |
additional arguments not used |
print method for a wordfish model
Description
print method for a wordfish model
Usage
## S3 method for class 'textmodel_wordfish'
print(x, ...)
Arguments
x |
for print method, the object to be printed |
... |
unused |
summary method for textmodel_lr objects
Description
summary method for textmodel_lr objects
Usage
## S3 method for class 'textmodel_lr'
summary(object, n = 30, ...)
Arguments
object |
output from |
n |
how many coefficients to print before truncating |
... |
additional arguments not used |
Value
a summary.textmodel
classed list containing elements from the
call to textmodel_lr()
, including the call, statistics for lambda, and
the estimated feature scores
summary method for textmodel_nb objects
Description
summary method for textmodel_nb objects
Usage
## S3 method for class 'textmodel_nb'
summary(object, n = 30, ...)
Arguments
object |
output from |
n |
how many coefficients to print before truncating |
... |
additional arguments not used |
Value
a summary.textmodel
classed list containing the call, the class
priors, and the estimated feature scores
summary method for textmodel_svmlin objects
Description
summary method for textmodel_svmlin objects
Usage
## S3 method for class 'textmodel_svmlin'
summary(object, n = 30, ...)
Arguments
object |
output from |
n |
how many coefficients to print before truncating |
... |
additional arguments not used |
Value
a summary.textmodel
classed list containing the call and the
estimated feature scores
summary method for textmodel_wordfish
Description
summary method for textmodel_wordfish
Usage
## S3 method for class 'textmodel_wordfish'
summary(object, n = 30, ...)
Arguments
object |
a textmodel_wordfish object |
n |
maximum number of features to print in summary |
... |
unused |
Value
a summary.textmodel
classed list containing the call, the
estimated document positions, and the estimated feature scores
Class affinity maximum likelihood text scaling model
Description
textmodel_affinity()
implements the maximum likelihood supervised text
scaling method described in Perry and Benoit (2017).
Usage
textmodel_affinity(
x,
y,
exclude = NULL,
smooth = 0.5,
ref_smooth = 0.5,
verbose = quanteda_options("verbose")
)
Arguments
x |
the dfm or bootstrap_dfm object on which the model will be fit. Does not need to contain only the training documents, since the index of these will be matched automatically. |
y |
vector of training classes/scores associated with each document
identified in |
exclude |
a set of words to exclude from the model |
smooth |
a smoothing parameter for class affinities; defaults to 0.5 (Jeffreys prior). A plausible alternative would be 1.0 (Laplace prior). |
ref_smooth |
a smoothing parameter for token distributions; defaults to 0.5 |
verbose |
logical; if |
Value
A textmodel_affinity
class list object, with elements:
-
smooth
a numeric vector of length two for the smoothing parameterssmooth
andref_smooth
x
the input model matrixx
y
the vector of class training labelsy
p
a feature\times
class sparse matrix of estimated class affinities -
support
logical vector indicating whether a feature was included in computing class affinities -
call
the model call
Author(s)
Patrick Perry and Kenneth Benoit
References
Perry, P.O. & Benoit, K.R. (2017). Scaling Text with the Class Affinity Model. doi:10.48550/arXiv.1710.08963.
See Also
predict.textmodel_affinity()
for methods of applying a
fitted textmodel_affinity()
model object to predict quantities from
(other) documents.
Examples
(af <- textmodel_affinity(quanteda::data_dfm_lbgexample, y = c("L", NA, NA, NA, "R", NA)))
predict(af)
predict(af, newdata = quanteda::data_dfm_lbgexample[6, ])
## Not run:
# compute bootstrapped SEs
dfmat <- quanteda::bootstrap_dfm(data_corpus_dailnoconf1991, n = 10, remove_punct = TRUE)
textmodel_affinity(dfmat, y = c("Govt", "Opp", "Opp", rep(NA, 55)))
## End(Not run)
Internal methods for textmodel_affinity
Description
Internal print and summary methods for derivative textmodel_affinity objects.
Usage
## S3 method for class 'influence.predict.textmodel_affinity'
print(x, n = 30, ...)
## S3 method for class 'influence.predict.textmodel_affinity'
summary(object, ...)
## S3 method for class 'summary.influence.predict.textmodel_affinity'
print(x, n = 30, ...)
Arguments
n |
how many coefficients to print before truncating |
Value
summary.influence.predict.textmodel_affinity()
returns a list
classes as summary.influence.predict.textmodel_affinity
that includes:
-
word
the feature name -
count
the total counts of each feature for which influence was computed -
mean
,median
,sd
,max
mean, median, standard deviation, and maximum values of influence for each feature, computed across classes -
direction
an integer vector of 1 or 2 indicating the class which the feature is influencing -
rate
a document by feature class sparse matrix of normalised influence measures -
count
a vector of counts of each non-zero feature in the input matrix -
rate
the median ofrate
frominfluence.predict.textmodel_affinity()
-
support
logical vector for each feature matching the same return frompredict.textmodel_affinity()
the mean, the standard deviation, the direction of the influence, the rate, and the support
Correspondence analysis of a document-feature matrix
Description
textmodel_ca
implements correspondence analysis scaling on a
dfm. The method is a fast/sparse version of function
ca.
Usage
textmodel_ca(x, smooth = 0, nd = NA, sparse = FALSE, residual_floor = 0.1)
Arguments
x |
the dfm on which the model will be fit |
smooth |
a smoothing parameter for word counts; defaults to zero. |
nd |
Number of dimensions to be included in output; if |
sparse |
retains the sparsity if set to |
residual_floor |
specifies the threshold for the residual matrix for
calculating the truncated svd.Larger value will reduce memory and time cost
but might reduce accuracy; only applicable when |
Details
svds in the RSpectra package is applied to enable the fast computation of the SVD.
Value
textmodel_ca()
returns a fitted CA textmodel that is a special
class of ca object.
Note
You may need to set sparse = TRUE
) and
increase the value of residual_floor
to ignore less important
information and hence to reduce the memory cost when you have a very big
dfm.
If your attempt to fit the model fails due to the matrix being too large,
this is probably because of the memory demands of computing the V
\times V
residual matrix. To avoid this, consider increasing the value of
residual_floor
by 0.1, until the model can be fit.
Author(s)
Kenneth Benoit and Haiyan Wang
References
Nenadic, O. & Greenacre, M. (2007). Correspondence Analysis in R, with Two- and Three-dimensional Graphics: The ca package. Journal of Statistical Software, 20(3). doi:10.18637/jss.v020.i03
See Also
Examples
library("quanteda")
dfmat <- dfm(tokens(data_corpus_irishbudget2010))
tmod <- textmodel_ca(dfmat)
summary(tmod)
Logistic regression classifier for texts
Description
Fits a fast penalized maximum likelihood estimator to predict discrete categories from sparse dfm objects. Using the glmnet package, the function computes the regularization path for the lasso or elasticnet penalty at a grid of values for the regularization parameter lambda. This is done automatically by testing on several folds of the data at estimation time.
Usage
textmodel_lr(x, y, ...)
Arguments
x |
the dfm on which the model will be fit. Does not need to contain only the training documents. |
y |
vector of training labels associated with each document identified
in |
... |
additional arguments passed to
|
Value
an object of class textmodel_lr
, a list containing:
-
x
,y
the input model matrix and input training class labels -
algorithm
character; the type and family of logistic regression model used in callingcv.glmnet()
-
type
the type of associated withalgorithm
-
classnames
the levels of training classes iny
-
lrfitted
the fitted model object fromcv.glmnet()
-
call
the model call
References
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software 33(1), 1-22. doi:10.18637/jss.v033.i01
See Also
cv.glmnet()
, predict.textmodel_lr()
,
coef.textmodel_lr()
Examples
## Example from 13.1 of _An Introduction to Information Retrieval_
library("quanteda")
corp <- corpus(c(d1 = "Chinese Beijing Chinese",
d2 = "Chinese Chinese Shanghai",
d3 = "Chinese Macao",
d4 = "Tokyo Japan Chinese",
d5 = "London England Chinese",
d6 = "Chinese Chinese Chinese Tokyo Japan"),
docvars = data.frame(train = factor(c("Y", "Y", "Y", "N", "N", NA))))
dfmat <- dfm(tokens(corp), tolower = FALSE)
## simulate bigger sample as classification on small samples is problematic
set.seed(1)
dfmat <- dfm_sample(dfmat, 50, replace = TRUE)
## train model
(tmod1 <- textmodel_lr(dfmat, docvars(dfmat, "train")))
summary(tmod1)
coef(tmod1)
## predict probability and classes
predict(tmod1, type = "prob")
predict(tmod1)
Latent Semantic Analysis
Description
Fit the Latent Semantic Analysis scaling model to a dfm,
which may be weighted (for instance using quanteda::dfm_tfidf()
).
Usage
textmodel_lsa(x, nd = 10, margin = c("both", "documents", "features"))
Arguments
x |
the dfm on which the model will be fit |
nd |
the number of dimensions to be included in output |
margin |
margin to be smoothed by the SVD |
Details
svds in the RSpectra package is applied to enable the fast computation of the SVD.
Value
a textmodel_lsa
class object, a list containing:
-
sk
a numeric vector containing the d values from the SVD -
docs
document coordinates from the SVD (u) -
features
feature coordinates from the SVD (v) -
matrix_low_rank
the multiplication of udv' -
data
the input data as a CSparseMatrix from the Matrix package
Note
The number of dimensions nd
retained in LSA is an empirical
issue. While a reduction in k
can remove much of the noise, keeping
too few dimensions or factors may lose important information.
Author(s)
Haiyan Wang and Kohei Watanabe
References
Rosario, B. (2000). Latent Semantic Indexing: An Overview. Technical report INFOSYS 240 Spring Paper, University of California, Berkeley.
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6): 391.
See Also
predict.textmodel_lsa()
, coef.textmodel_lsa()
Examples
library("quanteda")
dfmat <- dfm(tokens(data_corpus_irishbudget2010))
# create an LSA space and return its truncated representation in the low-rank space
tmod <- textmodel_lsa(dfmat[1:10, ])
head(tmod$docs)
# matrix in low_rank LSA space
tmod$matrix_low_rank[,1:5]
# fold queries into the space generated by dfmat[1:10,]
# and return its truncated versions of its representation in the new low-rank space
pred <- predict(tmod, newdata = dfmat[11:14, ])
pred$docs_newspace
Post-estimations methods for textmodel_lsa
Description
Post-estimation methods for fitted textmodel_lsa objects.
Usage
## S3 method for class 'textmodel_lsa'
predict(object, newdata = NULL, ...)
## S3 method for class 'textmodel_lsa'
as.dfm(x)
## S3 method for class 'textmodel_lsa'
coef(object, doc_dim = 1, feat_dim = 1, ...)
coefficients.textmodel_lsa(object, doc_dim = 1, feat_dim = 1, ...)
Arguments
object , x |
previously fitted textmodel_lsa object |
newdata |
new matrix to be transformed into the lsa space |
... |
unused |
doc_dim , feat_dim |
the document and feature dimension scores to be extracted |
Value
predict()
returns a predicted textmodel_lsa object, projecting the patterns onto
new data.
coef.textmodel_lsa
extracts model coefficients from a fitted
textmodel_ca object.
Naive Bayes classifier for texts
Description
Fit a multinomial or Bernoulli Naive Bayes model, given a dfm and some training labels.
Usage
textmodel_nb(
x,
y,
smooth = 1,
prior = c("uniform", "docfreq", "termfreq"),
distribution = c("multinomial", "Bernoulli")
)
Arguments
x |
the dfm on which the model will be fit. Does not need to contain only the training documents. |
y |
vector of training labels associated with each document identified
in |
smooth |
smoothing parameter for feature counts, added to the feature frequency totals by training class |
prior |
prior distribution on texts; one of |
distribution |
count model for text features, can be |
Value
textmodel_nb()
returns a list consisting of the following (where
I
is the total number of documents, J
is the total number of
features, and k
is the total number of training classes):
call |
original function call |
param |
|
x |
the |
y |
the |
distribution |
character; the distribution of |
priors |
numeric; the class prior probabilities |
smooth |
numeric; the value of the smoothing parameter |
Prior distributions
Prior distributions refer to the prior probabilities assigned to the training classes, and the choice of prior distribution affects the calculation of the fitted probabilities. The default is uniform priors, which sets the unconditional probability of observing the one class to be the same as observing any other class.
"Document frequency" means that the class priors will be taken from the relative proportions of the class documents used in the training set. This approach is so common that it is assumed in many examples, such as the worked example from Manning, Raghavan, and Schütze (2008) below. It is not the default in quanteda, however, since there may be nothing informative in the relative numbers of documents used to train a classifier other than the relative availability of the documents. When training classes are balanced in their number of documents (usually advisable), however, then the empirically computed "docfreq" would be equivalent to "uniform" priors.
Setting prior
to "termfreq" makes the priors equal to the proportions of
total feature counts found in the grouped documents in each training class,
so that the classes with the largest number of features are assigned the
largest priors. If the total count of features in each training class was
the same, then "uniform" and "termfreq" would be the same.
Smoothing parameter
The smooth
value is added to the feature frequencies, aggregated by
training class, to avoid zero frequencies in any class. This has the
effect of giving more weight to infrequent term occurrences.
Author(s)
Kenneth Benoit
References
Manning, C.D., Raghavan, P., & Schütze, H. (2008). An Introduction to Information Retrieval. Cambridge: Cambridge University Press (Chapter 13). Available at https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf.
Jurafsky, D. & Martin, J.H. (2018). From Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Draft of September 23, 2018 (Chapter 6, Naive Bayes). Available at https://web.stanford.edu/~jurafsky/slp3/.
See Also
Examples
## Example from 13.1 of _An Introduction to Information Retrieval_
library("quanteda")
txt <- c(d1 = "Chinese Beijing Chinese",
d2 = "Chinese Chinese Shanghai",
d3 = "Chinese Macao",
d4 = "Tokyo Japan Chinese",
d5 = "Chinese Chinese Chinese Tokyo Japan")
x <- dfm(tokens(txt), tolower = FALSE)
y <- factor(c("Y", "Y", "Y", "N", NA), ordered = TRUE)
## replicate IIR p261 prediction for test set (document 5)
(tmod1 <- textmodel_nb(x, y, prior = "docfreq"))
summary(tmod1)
coef(tmod1)
predict(tmod1, type = "prob")
predict(tmod1)
# contrast with other priors
predict(textmodel_nb(x, y, prior = "uniform"))
predict(textmodel_nb(x, y, prior = "termfreq"))
## replicate IIR p264 Bernoulli Naive Bayes
tmod2 <- textmodel_nb(x, y, distribution = "Bernoulli", prior = "docfreq")
predict(tmod2, newdata = x[5, ], type = "prob")
predict(tmod2, newdata = x[5, ])
[experimental] Linear SVM classifier for texts
Description
Fit a fast linear SVM classifier for sparse text matrices, using svmlin C++
code written by Vikas Sindhwani and S. Sathiya Keerthi. This method
implements the modified finite Newton L2-SVM method (L2-SVM-MFN) method
described in Sindhwani and Keerthi (2006). Currently,
textmodel_svmlin()
only works for two-class problems.
Usage
textmodel_svmlin(
x,
y,
intercept = TRUE,
lambda = 1,
cp = 1,
cn = 1,
scale = FALSE,
center = FALSE
)
Arguments
x |
the dfm on which the model will be fit. Does not need to contain only the training documents. |
y |
vector of training labels associated with each document identified
in |
intercept |
logical; if |
lambda |
numeric; regularization parameter lambda (default 1) |
cp |
numeric; Relative cost for "positive" examples (the second factor level) |
cn |
numeric; Relative cost for "negative" examples (the first factor level) |
scale |
logical; if |
center |
logical; if |
Value
a fitted model object of class textmodel_svmlin
Warning
This function is marked experimental since it's not fully working yet in a way that translates into more standard SVM parameters that we understand. Use with caution after reading the Sindhwani and Keerthi (2006) paper.
References
Vikas Sindhwani and S. Sathiya Keerthi (2006). Large Scale Semi-supervised Linear SVMs. Proceedings of ACM SIGIR. August 6–11, 2006, Seattle.
V. Sindhwani and S. Sathiya Keerthi (2006). Newton Methods for Fast Solution of Semi-supervised Linear SVMs. Book Chapter in Large Scale Kernel Machines, MIT Press, 2006.
See Also
Examples
# use Lenihan for govt class and Bruton for opposition
library("quanteda")
docvars(data_corpus_irishbudget2010, "govtopp") <- c("Govt", "Opp", rep(NA, 12))
dfmat <- dfm(tokens(data_corpus_irishbudget2010))
tmod <- textmodel_svmlin(dfmat, y = dfmat$govtopp)
predict(tmod)
Wordfish text model
Description
Estimate Slapin and Proksch's (2008) "wordfish" Poisson scaling model of one-dimensional document positions using conditional maximum likelihood.
Usage
textmodel_wordfish(
x,
dir = c(1, 2),
priors = c(Inf, Inf, 3, 1),
tol = c(1e-06, 1e-08),
dispersion = c("poisson", "quasipoisson"),
dispersion_level = c("feature", "overall"),
dispersion_floor = 0,
abs_err = FALSE,
residual_floor = 0.5
)
Arguments
x |
the dfm on which the model will be fit |
dir |
set global identification by specifying the indexes for a pair of
documents such that |
priors |
prior precisions for the estimated parameters |
tol |
tolerances for convergence. The first value is a convergence threshold for the log-posterior of the model, the second value is the tolerance in the difference in parameter values from the iterative conditional maximum likelihood (from conditionally estimating document-level, then feature-level parameters). |
dispersion |
sets whether a quasi-Poisson quasi-likelihood should be
used based on a single dispersion parameter ( |
dispersion_level |
sets the unit level for the dispersion parameter,
options are |
dispersion_floor |
constraint for the minimal underdispersion multiplier
in the quasi-Poisson model. Used to minimize the distorting effect of
terms with rare term or document frequencies that appear to be severely
underdispersed. Default is 0, but this only applies if |
abs_err |
specifies how the convergence is considered |
residual_floor |
specifies the threshold for residual matrix when
calculating the svds, only applies when |
Details
The returns match those of Will Lowe's R implementation of
wordfish
(see the austin package), except that here we have renamed
words
to be features
. (This return list may change.) We
have also followed the practice begun with Slapin and Proksch's early
implementation of the model that used a regularization parameter of
se(\sigma) = 3
, through the third element in priors
.
Value
An object of class textmodel_fitted_wordfish
. This is a list
containing:
dir |
global identification of the dimension |
theta |
estimated document positions |
alpha |
estimated document fixed effects |
beta |
estimated feature marginal effects |
psi |
estimated word fixed effects |
docs |
document labels |
features |
feature labels |
sigma |
regularization parameter for betas in Poisson form |
ll |
log likelihood at convergence |
se.theta |
standard errors for theta-hats |
x |
dfm to which the model was fit |
Note
In the rare situation where a warning message of "The algorithm did not converge." shows up, removing some documents may work.
Author(s)
Benjamin Lauderdale, Haiyan Wang, and Kenneth Benoit
References
Slapin, J. & Proksch, S.O. (2008). A Scaling Model for Estimating Time-Series Party Positions from Texts. doi:10.1111/j.1540-5907.2008.00338.x. American Journal of Political Science, 52(3), 705–772.
Lowe, W. & Benoit, K.R. (2013). Validating Estimates of Latent Traits from Textual Data Using Human Judgment as a Benchmark. doi:10.1093/pan/mpt002. Political Analysis, 21(3), 298–313.
See Also
Examples
(tmod1 <- textmodel_wordfish(quanteda::data_dfm_lbgexample, dir = c(1,5)))
summary(tmod1, n = 10)
coef(tmod1)
predict(tmod1)
predict(tmod1, se.fit = TRUE)
predict(tmod1, interval = "confidence")
## Not run:
library("quanteda")
dfmat <- dfm(tokens(data_corpus_irishbudget2010))
(tmod2 <- textmodel_wordfish(dfmat, dir = c(6,5)))
(tmod3 <- textmodel_wordfish(dfmat, dir = c(6,5),
dispersion = "quasipoisson", dispersion_floor = 0))
(tmod4 <- textmodel_wordfish(dfmat, dir = c(6,5),
dispersion = "quasipoisson", dispersion_floor = .5))
plot(tmod3$phi, tmod4$phi, xlab = "Min underdispersion = 0", ylab = "Min underdispersion = .5",
xlim = c(0, 1.0), ylim = c(0, 1.0))
plot(tmod3$phi, tmod4$phi, xlab = "Min underdispersion = 0", ylab = "Min underdispersion = .5",
xlim = c(0, 1.0), ylim = c(0, 1.0), type = "n")
underdispersedTerms <- sample(which(tmod3$phi < 1.0), 5)
which(featnames(dfmat) %in% names(topfeatures(dfmat, 20)))
text(tmod3$phi, tmod4$phi, tmod3$features,
cex = .8, xlim = c(0, 1.0), ylim = c(0, 1.0), col = "grey90")
text(tmod3$phi['underdispersedTerms'], tmod4$phi['underdispersedTerms'],
tmod3$features['underdispersedTerms'],
cex = .8, xlim = c(0, 1.0), ylim = c(0, 1.0), col = "black")
if (requireNamespace("austin")) {
tmod5 <- austin::wordfish(quanteda::as.wfm(dfmat), dir = c(6, 5))
cor(tmod1$theta, tmod5$theta)
}
## End(Not run)
Wordscores text model
Description
textmodel_wordscores
implements Laver, Benoit and Garry's (2003)
"Wordscores" method for scaling texts on a single dimension, given a set of
anchoring or reference texts whose values are set through reference
scores. This scale can be fitted in the linear space (as per LBG 2003) or in
the logit space (as per Beauchamp 2012). Estimates of virgin or
unknown texts are obtained using the predict()
method to score
documents from a fitted textmodel_wordscores
object.
Usage
textmodel_wordscores(x, y, scale = c("linear", "logit"), smooth = 0)
Arguments
x |
the dfm on which the model will be trained |
y |
vector of training scores associated with each document
in |
scale |
scale on which to score the words; |
smooth |
a smoothing parameter for word counts; defaults to zero to match the LBG (2003) method. See Value below for additional information on the behaviour of this argument. |
Details
The textmodel_wordscores()
function and the associated
predict()
method are designed
to function in the same manner as stats::predict.lm()
.
coef()
can also be used to extract the word coefficients from the
fitted textmodel_wordscores
object, and summary()
will print
a nice summary of the fitted object.
Value
A fitted textmodel_wordscores
object. This object will
contain a copy of the input data, but in its original form without any
smoothing applied. Calling predict.textmodel_wordscores()
on
this object without specifying a value for newdata
, for instance,
will predict on the unsmoothed object. This behaviour differs from
versions of quanteda <= 1.2.
Author(s)
Kenneth Benoit
References
Laver, M., Benoit, K.R., & Garry, J. (2003). Estimating Policy Positions from Political Text using Words as Data. American Political Science Review, 97(2), 311–331.
Beauchamp, N. (2012). Using Text to Scale Legislatures with Uninformative Voting. New York University Mimeo.
Martin, L.W. & Vanberg, G. (2007). A Robust Transformation Procedure for Interpreting Political Text. Political Analysis 16(1), 93–100. doi:10.1093/pan/mpm010
See Also
predict.textmodel_wordscores()
for methods of applying a
fitted textmodel_wordscores model object to predict quantities from
(other) documents.
Examples
(tmod <- textmodel_wordscores(quanteda::data_dfm_lbgexample, y = c(seq(-1.5, 1.5, .75), NA)))
summary(tmod)
coef(tmod)
predict(tmod)
predict(tmod, rescaling = "lbg")
predict(tmod, se.fit = TRUE, interval = "confidence", rescaling = "mv")
Influence plot for text scaling models
Description
Plot the results of a fitted scaling model, from (e.g.) a predicted textmodel_affinity model.
Usage
textplot_influence(x, n = 30, ...)
Arguments
x |
the object output from |
n |
the number of features whose influence will be plotted |
... |
additional arguments passed to |
Value
Creates a base R plot of feature influences of the median influence
by the log10 median rate of the feature, and invisibly returns the elements
from the call to plot()
.
Author(s)
Patrick Perry and Kenneth Benoit
See Also
influence.predict.textmodel_affinity()
Examples
tmod <- textmodel_affinity(quanteda::data_dfm_lbgexample, y = c("L", NA, NA, NA, "R", NA))
pred <- predict(tmod)
textplot_influence(influence(pred))