Type: Package
Title: Scaling Models and Classifiers for Textual Data
Version: 0.9.10
Description: Scaling models and classifiers for sparse matrix objects representing textual data in the form of a document-feature matrix. Includes original implementations of 'Laver', 'Benoit', and Garry's (2003) <doi:10.1017/S0003055403000698>, 'Wordscores' model, the Perry and 'Benoit' (2017) <doi:10.48550/arXiv.1710.08963> class affinity scaling model, and the 'Slapin' and 'Proksch' (2008) <doi:10.1111/j.1540-5907.2008.00338.x> 'wordfish' model, as well as methods for correspondence analysis, latent semantic analysis, and fast Naive Bayes and linear 'SVMs' specially designed for sparse textual data.
Depends: R (≥ 3.1.0), methods
Imports: glmnet, Matrix (≥ 1.2), quanteda (≥ 4.0.0), RSpectra, Rcpp (≥ 0.12.12), stringi
Suggests: ca, covr, fastNaiveBayes, knitr, lsa, microbenchmark, naivebayes, quanteda.textplots, spelling, testthat, rmarkdown
LinkingTo: Rcpp, RcppArmadillo (≥ 0.7.600.1.0), quanteda
URL: https://github.com/quanteda/quanteda.textmodels
License: GPL-3
Encoding: UTF-8
LazyData: true
Language: en-GB
RoxygenNote: 7.3.2
Collate: 'RcppExports.R' 'quanteda.textmodels-package.R' 'data-documentation.R' 'textmodel-methods.R' 'textmodel_affinity.R' 'textmodel_ca.R' 'textmodel_lsa.R' 'textmodel_lr.R' 'textmodel_nb.R' 'textmodel_svmlin.R' 'textmodel_wordfish.R' 'textmodel_wordscores.R' 'textplot_influence.R' 'utils.R'
VignetteBuilder: knitr
NeedsCompilation: yes
Packaged: 2025-02-10 18:45:31 UTC; kbenoit
Author: Kenneth Benoit ORCID iD [cre, aut, cph], Kohei Watanabe ORCID iD [aut], Haiyan Wang ORCID iD [aut], Patrick O. Perry ORCID iD [aut], Benjamin Lauderdale ORCID iD [aut], Johannes Gruber ORCID iD [aut], William Lowe ORCID iD [aut], Vikas Sindhwani [cph] (authored svmlin C++ source code), European Research Council [fnd] (ERC-2011-StG 283794-QUANTESS)
Maintainer: Kenneth Benoit <kbenoit@quanteda.org>
Repository: CRAN
Date/Publication: 2025-02-10 23:50:11 UTC

quanteda.textmodels: Scaling Models and Classifiers for Textual Data

Description

Scaling models and classifiers for sparse matrix objects representing textual data in the form of a document-feature matrix. Includes original implementations of 'Laver', 'Benoit', and Garry's (2003) doi:10.1017/S0003055403000698, 'Wordscores' model, the Perry and 'Benoit' (2017) doi:10.48550/arXiv.1710.08963 class affinity scaling model, and the 'Slapin' and 'Proksch' (2008) doi:10.1111/j.1540-5907.2008.00338.x 'wordfish' model, as well as methods for correspondence analysis, latent semantic analysis, and fast Naive Bayes and linear 'SVMs' specially designed for sparse textual data.

Author(s)

Maintainer: Kenneth Benoit kbenoit@smu.edu.sg (ORCID) [copyright holder]

Authors:

Other contributors:

See Also

Useful links:


Internal function to fit the likelihood scaling mixture model.

Description

Ken recommends you use textmodel_affinity() instead.

Usage

affinity(p, x, smooth = 0.5, verbose = FALSE)

Arguments

p

word likelihoods within classes, estimated from training data

x

term-document matrix for document(s) to be scaled

smooth

a misnamed smoothing parameter, either a scalar or a vector equal in length to the number of documents

Value

a list containing:

Author(s)

Patrick Perry

Examples

p <- matrix(c(c(5/6, 0, 1/6), c(0, 4/5, 1/5)), nrow = 3,
            dimnames = list(c("A", "B", "C"), NULL))
theta <- c(.2, .8)
q <- drop(p %*% theta)
x <- 2 * q
(fit <- affinity(p, x))

Coerce various objects to coefficients_textmodel

Description

Helper functions used in ⁠summary.textmodel_*()⁠.

Usage

as.coefficients_textmodel(x)

Arguments

x

an object to be coerced

Value

an object with the class tag of coefficients_textmodel


Coerce various objects to statistics_textmodel

Description

This is a helper function used in ⁠summary.textmodel_*⁠.

Usage

as.statistics_textmodel(x)

Arguments

x

an object to be coerced

Value

an object of class statistics_textmodel


Assign the summary.textmodel class to a list

Description

Assigns the class summary.textmodel to a list

Usage

as.summary.textmodel(x)

Arguments

x

a named list

Value

an object of class summary.textmodel


Extract model coefficients from a fitted textmodel_ca object

Description

coef() extract model coefficients from a fitted textmodel_ca object. coefficients() is an alias.

Usage

## S3 method for class 'textmodel_ca'
coef(object, doc_dim = 1, feat_dim = 1, ...)

coefficients.textmodel_ca(object, doc_dim = 1, feat_dim = 1, ...)

Arguments

object

a fitted textmodel_ca object

doc_dim, feat_dim

the document and feature dimension scores to be extracted

...

unused

Value

a list containing numeric vectors of feature and document coordinates. Includes NA vectors of standard errors for consistency with other models' coefficient outputs, and for the possibility of having these computed in the future.


Crowd-labelled sentence corpus from a 2010 EP debate on coal subsidies

Description

A multilingual text corpus of speeches from a European Parliament debate on coal subsidies in 2010, with individual crowd codings as the unit of observation. The sentences are drawn from officially translated speeches from a debate over a European Parliament debate concerning a Commission report proposing an extension to a regulation permitting state aid to uncompetitive coal mines.

Each speech is available in six languages: English, German, Greek, Italian, Polish and Spanish. The unit of observation is the individual crowd coding of each natural sentence. For more information on the coding approach see Benoit et al. (2016).

Usage

data_corpus_EPcoaldebate

Format

The corpus consists of 16,806 documents (i.e. codings of a sentence) and includes the following document-level variables:

sentence_id

character; a unique identifier for each sentence

crowd_subsidy_label

factor; whether a coder labelled the sentence as "Pro-Subsidy", "Anti-Subsidy" or "Neutral or inapplicable"

language

factor; the language (translation) of the speech

name_last

character; speaker's last name

name_first

character; speaker's first name

ep_group

factor; abbreviation of the EP party group of the speaker

country

factor; the speaker's country of origin

vote

factor; the speaker's vote on the proposal (For/Against/Abstain/NA)

coder_id

character; a unique identifier for each crowd coder

coder_trust

numeric; the "trust score" from the Crowdflower platform used to code the sentences, which can theoretically range between 0 and 1. Only coders with trust scores above 0.8 are included in the corpus.

A corpus object.

References

Benoit, K., Conway, D., Lauderdale, B.E., Laver, M., & Mikhaylov, S. (2016). Crowd-sourced Text Analysis: Reproducible and Agile Production of Political Data. American Political Science Review, 100,(2), 278–295. doi:10.1017/S0003055416000058


Confidence debate from 1991 Irish Parliament

Description

Texts of speeches from a no-confidence motion debated in the Irish Dáil from 16-18 October 1991 over the future of the Fianna Fail-Progressive Democrat coalition. (See Laver and Benoit 2002 for details.)

Usage

data_corpus_dailnoconf1991

Format

data_corpus_dailnoconf1991 is a corpus with 58 texts, including docvars for name, party, and position.

Source

https://www.oireachtas.ie/en/debates/debate/dail/1991-10-16/10/

References

Laver, M. & Benoit, K.R. (2002). Locating TDs in Policy Spaces: Wordscoring Dáil Speeches. Irish Political Studies, 17(1), 59–73.

Laver, M., Benoit, K.R., & Garry, J. (2003). Estimating Policy Positions from Political Text using Words as Data. American Political Science Review, 97(2), 311–331.

Examples

## Not run: 
library("quanteda")
data_dfm_dailnoconf1991 <- data_corpus_dailnoconf1991 %>%
    tokens(remove_punct = TRUE) %>%
    dfm()
tmod <- textmodel_affinity(data_dfm_dailnoconf1991,
                           c("Govt", "Opp", "Opp", rep(NA, 55)))
(pred <- predict(tmod))
dat <-
    data.frame(party = as.character(docvars(data_corpus_dailnoconf1991, "party")),
               govt = coef(pred)[, "Govt"],
               position = as.character(docvars(data_corpus_dailnoconf1991, "position")))
bymedian <- with(dat, reorder(paste(party, position), govt, median))
oldpar <- par(no.readonly = TRUE)
par(mar = c(5, 6, 4, 2) + .1)
boxplot(govt ~ bymedian, data = dat,
        horizontal = TRUE, las = 1,
        xlab = "Degree of support for government",
        ylab = "")
abline(h = 7.5, col = "red", lty = "dashed")
text(c(0.9, 0.9), c(8.5, 6.5), c("Goverment", "Opposition"))
par(oldpar)

## End(Not run)

Irish budget speeches from 2010

Description

Speeches and document-level variables from the debate over the Irish budget of 2010.

Usage

data_corpus_irishbudget2010

Format

The corpus object for the 2010 budget speeches, with document-level variables for year, debate, serial number, first and last name of the speaker, and the speaker's party.

Details

At the time of the debate, Fianna Fáil (FF) and the Greens formed the government coalition, while Fine Gael (FG), Labour (LAB), and Sinn Féin (SF) were in opposition.

Source

Dáil Éireann Debate, Budget Statement 2010. 9 December 2009. vol. 697, no. 3.

References

Lowe, W. & Benoit, K.R. (2013). Validating Estimates of Latent Traits From Textual Data Using Human Judgment as a Benchmark. Political Analysis, 21(3), 298–313. doi:10.1093/pan/mpt002.


Movie reviews with polarity from Pang and Lee (2004)

Description

A corpus object containing 2,000 movie reviews classified by positive or negative sentiment.

Usage

data_corpus_moviereviews

Format

The corpus includes the following document variables:

sentiment

factor indicating whether a review was manually classified as positive pos or negative neg.

id1

Character counting the position in the corpus.

id2

Random number for each review.

Details

For more information, see cat(meta(data_corpus_moviereviews, "readme")).

Source

https://www.cs.cornell.edu/people/pabo/movie-review-data/

References

Pang, B., Lee, L. (2004) "A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts.", Proceedings of the ACL.

Examples

# check polarities
table(data_corpus_moviereviews$sentiment)

# make the data into sentences, because each line is a sentence
data_corpus_moviereviewsents <-
    quanteda::corpus_segment(data_corpus_moviereviews, "\n", extract_pattern = FALSE)
print(data_corpus_moviereviewsents, max_ndoc = 3)

Internal function to match a dfm features to a target set

Description

Takes a dfm and a set of features, and makes them match the features listed in the set.

Usage

force_conformance(x, features, force = TRUE)

Arguments

x

input dfm

features

character; a vector of feature names

force

logical; if TRUE, make the new dfm conform to the vector of features, otherwise return an error message

Value

a dfm from the quanteda package containing only features as columns, in the same order as features. A warning message is printed if some feature names from features are not matched in x.

Examples

quanteda.textmodels:::force_conformance(quanteda::data_dfm_lbgexample, c("C", "B", "Z"))

Compute feature influence from a predicted textmodel_affinity object

Description

Computes the influence of features on scaled textmodel_affinity() applications.

Usage

## S3 method for class 'predict.textmodel_affinity'
influence(model, subset = !train, ...)

Arguments

model

a predicted textmodel_affinity() object

subset

whether to use all data or a subset (for instance, exclude the training set)

...

unused

Value

a named list classed as influence.predict.textmodel_affinity that contains

See Also

influence.lm()

Examples

tmod <- textmodel_affinity(quanteda::data_dfm_lbgexample, y = c("L", NA, NA, NA, "R", NA))
pred <- predict(tmod)
influence(pred)

Prediction for a fitted affinity textmodel

Description

Estimate \theta_i for each document, from a fitted textmodel_affinity object.

Other methods below provide standard ways to extract or compute quantities from predicted textmodel_affinity objects.

Usage

## S3 method for class 'textmodel_affinity'
predict(object, newdata = NULL, level = 0.95, ...)

## S3 method for class 'predict.textmodel_affinity'
coef(object, ...)

## S3 method for class 'predict.textmodel_affinity'
residuals(object, type = c("response", "pearson"), ...)

## S3 method for class 'predict.textmodel_affinity'
rstandard(model, ...)

Arguments

object

a fitted affinity textmodel

newdata

dfm on which prediction should be made

level

probability level for confidence interval width

...

unused

type

see residuals.lm

Value

predict() returns a list of predicted affinity textmodel quantities, containing:

coef() returns a document \times class matrix of class affinities for each document.

residuals() returns a document-by-feature matrix of residuals. resid() is an alias.

rstandard() is a shortcut to return the Pearson residuals.

See Also

influence.predict.textmodel_affinity() for methods of computing the influence of particular features from a predicted textmodel_affinity model.


Prediction from a fitted textmodel_lr object

Description

predict.textmodel_lr() implements class predictions from a fitted logistic regression model.

Usage

## S3 method for class 'textmodel_lr'
predict(
  object,
  newdata = NULL,
  type = c("class", "probability"),
  force = TRUE,
  ...
)

## S3 method for class 'textmodel_lr'
coef(object, ...)

## S3 method for class 'textmodel_lr'
coefficients(object, ...)

Arguments

object

a fitted logistic regression textmodel

newdata

dfm on which prediction should be made

type

the type of predicted values to be returned; see Value

force

make newdata's feature set conformant to the model terms

...

not used

Value

predict.textmodel_lr() returns either a vector of class predictions for each row of newdata (when type = "class"), or a document-by-class matrix of class probabilities (when 'type = "probability"“).

coef.textmodel_lr() returns a (sparse) matrix of coefficients for each feature, computed at the value of the penalty parameter fitted in the model. For binary outcomes, results are returned only for the class corresponding to the second level of the factor response; for multinomial outcomes, these are computed for each class.

See Also

textmodel_lr()


Prediction from a fitted textmodel_nb object

Description

predict.textmodel_nb() implements class predictions from a fitted Naive Bayes model. using trained Naive Bayes examples

Usage

## S3 method for class 'textmodel_nb'
predict(
  object,
  newdata = NULL,
  type = c("class", "probability", "logposterior"),
  force = FALSE,
  ...
)

## S3 method for class 'textmodel_nb'
coef(object, ...)

## S3 method for class 'textmodel_nb'
coefficients(object, ...)

Arguments

object

a fitted Naive Bayes textmodel

newdata

dfm on which prediction should be made

type

the type of predicted values to be returned; see Value

force

make newdata's feature set conformant to the model terms

...

not used

Value

predict.textmodel_nb returns either a vector of class predictions for each row of newdata (when type = "class"), or a document-by-class matrix of class probabilities (when type = "probability") or log posterior likelihoods (when type = "logposterior").

coef.textmodel_nb() returns a matrix of estimated word likelihoods given the class. (In earlier versions, this was named PwGc.)

See Also

textmodel_nb()

Examples

# application to LBG (2003) example data
(tmod <- textmodel_nb(quanteda::data_dfm_lbgexample, y = c("A", "A", "B", "C", "C", NA)))
predict(tmod)
predict(tmod, type = "logposterior")

Prediction from a fitted textmodel_svmlin object

Description

predict.textmodel_svmlin() implements class predictions from a fitted linear SVM model.

Usage

## S3 method for class 'textmodel_svmlin'
predict(
  object,
  newdata = NULL,
  type = c("class", "probability"),
  force = FALSE,
  ...
)

Arguments

object

a fitted linear SVM textmodel

newdata

dfm on which prediction should be made

type

the type of predicted values to be returned; see Value

force

logical, if TRUE, make newdata's feature set conformant to the model terms

...

not used

Value

predict.textmodel_svmlin returns either a vector of class predictions for each row of newdata (when type = "class"), or a document-by-class matrix of class probabilities (when type = "probability").

See Also

textmodel_svmlin()


Prediction from a textmodel_wordfish method

Description

predict.textmodel_wordfish() returns estimated document scores and confidence intervals. The method is provided for consistency with other ⁠textmodel_*()⁠ methods, but does not currently allow prediction on out-of-sample data.

Usage

## S3 method for class 'textmodel_wordfish'
predict(
  object,
  se.fit = FALSE,
  interval = c("none", "confidence"),
  level = 0.95,
  ...
)

## S3 method for class 'textmodel_wordfish'
coef(object, margin = c("both", "documents", "features"), ...)

coefficients.textmodel_wordfish(object, ...)

Arguments

object

a fitted wordfish model

se.fit

if TRUE, return standard errors as well

interval

type of confidence interval calculation

level

tolerance/confidence level for intervals

...

not used

margin

which margin of parameter estimates to return: both (in a list), or just document or feature parameters

Value

coef.textmodel_wordfish() returns a matrix of estimated parameters coefficients for the specified margin.


Predict textmodel_wordscores

Description

Predict textmodel_wordscores

Usage

## S3 method for class 'textmodel_wordscores'
predict(
  object,
  newdata = NULL,
  se.fit = FALSE,
  interval = c("none", "confidence"),
  level = 0.95,
  rescaling = c("none", "lbg", "mv"),
  force = TRUE,
  ...
)

Arguments

object

a fitted Wordscores textmodel

newdata

dfm on which prediction should be made

se.fit

if TRUE, return standard errors as well

interval

type of confidence interval calculation

level

tolerance/confidence level for intervals

rescaling

"none" for "raw" scores; "lbg" for LBG (2003) rescaling; or "mv" for the rescaling proposed by Martin and Vanberg (2007). See References.

force

make the feature set of newdata conform to the model terms. The default of TRUE means that a fitted model can be applied to scale a dfm that does not contain a 1:1 match of features in the training and prediction data.

...

not used

Value

predict.textmodel_wordscores() returns a named vector of predicted document scores ("text scores" S_{vd} in LBG 2003), or a named list if se.fit = TRUE consisting of the predicted scores (⁠$fit⁠) and the associated standard errors (⁠$se.fit⁠). When interval = "confidence", the predicted values will be a matrix. This behaviour matches that of stats::predict.lm().

Examples

tmod <- textmodel_wordscores(quanteda::data_dfm_lbgexample, c(seq(-1.5, 1.5, .75), NA))
predict(tmod)
predict(tmod, rescaling = "mv")
predict(tmod, rescaling = "lbg")
predict(tmod, se.fit = TRUE)
predict(tmod, se.fit = TRUE, interval = "confidence")
predict(tmod, se.fit = TRUE, interval = "confidence", rescaling = "lbg")

Print methods for textmodel features estimates

Description

This is a helper function used in print.summary.textmodel.

Usage

## S3 method for class 'coefficients_textmodel'
print(x, digits = max(3L, getOption("digits") - 3L), ...)

Arguments

x

a coefficients_textmodel object

digits

minimal number of significant digits, see print.default()

...

additional arguments not used


Implements print methods for textmodel_statistics

Description

Implements print methods for textmodel_statistics

Usage

## S3 method for class 'statistics_textmodel'
print(x, digits = max(3L, getOption("digits") - 3L), ...)

Arguments

x

a textmodel_wordscore_statistics object

digits

minimal number of significant digits, see print.default()

...

further arguments passed to or from other methods


print method for summary.textmodel

Description

print method for summary.textmodel

Usage

## S3 method for class 'summary.textmodel'
print(x, digits = max(3L, getOption("digits") - 3L), ...)

Arguments

x

a summary.textmodel object

digits

minimal number of significant digits, see print.default()

...

additional arguments not used


print method for a wordfish model

Description

print method for a wordfish model

Usage

## S3 method for class 'textmodel_wordfish'
print(x, ...)

Arguments

x

for print method, the object to be printed

...

unused


summary method for textmodel_lr objects

Description

summary method for textmodel_lr objects

Usage

## S3 method for class 'textmodel_lr'
summary(object, n = 30, ...)

Arguments

object

output from textmodel_lr()

n

how many coefficients to print before truncating

...

additional arguments not used

Value

a summary.textmodel classed list containing elements from the call to textmodel_lr(), including the call, statistics for lambda, and the estimated feature scores


summary method for textmodel_nb objects

Description

summary method for textmodel_nb objects

Usage

## S3 method for class 'textmodel_nb'
summary(object, n = 30, ...)

Arguments

object

output from textmodel_nb()

n

how many coefficients to print before truncating

...

additional arguments not used

Value

a summary.textmodel classed list containing the call, the class priors, and the estimated feature scores


summary method for textmodel_svmlin objects

Description

summary method for textmodel_svmlin objects

Usage

## S3 method for class 'textmodel_svmlin'
summary(object, n = 30, ...)

Arguments

object

output from textmodel_svmlin()

n

how many coefficients to print before truncating

...

additional arguments not used

Value

a summary.textmodel classed list containing the call and the estimated feature scores


summary method for textmodel_wordfish

Description

summary method for textmodel_wordfish

Usage

## S3 method for class 'textmodel_wordfish'
summary(object, n = 30, ...)

Arguments

object

a textmodel_wordfish object

n

maximum number of features to print in summary

...

unused

Value

a summary.textmodel classed list containing the call, the estimated document positions, and the estimated feature scores


Class affinity maximum likelihood text scaling model

Description

textmodel_affinity() implements the maximum likelihood supervised text scaling method described in Perry and Benoit (2017).

Usage

textmodel_affinity(
  x,
  y,
  exclude = NULL,
  smooth = 0.5,
  ref_smooth = 0.5,
  verbose = quanteda_options("verbose")
)

Arguments

x

the dfm or bootstrap_dfm object on which the model will be fit. Does not need to contain only the training documents, since the index of these will be matched automatically.

y

vector of training classes/scores associated with each document identified in data

exclude

a set of words to exclude from the model

smooth

a smoothing parameter for class affinities; defaults to 0.5 (Jeffreys prior). A plausible alternative would be 1.0 (Laplace prior).

ref_smooth

a smoothing parameter for token distributions; defaults to 0.5

verbose

logical; if TRUE print diagnostic information during fitting.

Value

A textmodel_affinity class list object, with elements:

Author(s)

Patrick Perry and Kenneth Benoit

References

Perry, P.O. & Benoit, K.R. (2017). Scaling Text with the Class Affinity Model. doi:10.48550/arXiv.1710.08963.

See Also

predict.textmodel_affinity() for methods of applying a fitted textmodel_affinity() model object to predict quantities from (other) documents.

Examples

(af <- textmodel_affinity(quanteda::data_dfm_lbgexample, y = c("L", NA, NA, NA, "R", NA)))
predict(af)
predict(af, newdata = quanteda::data_dfm_lbgexample[6, ])

## Not run: 
# compute bootstrapped SEs
dfmat <- quanteda::bootstrap_dfm(data_corpus_dailnoconf1991, n = 10, remove_punct = TRUE)
textmodel_affinity(dfmat, y = c("Govt", "Opp", "Opp", rep(NA, 55)))

## End(Not run)

Internal methods for textmodel_affinity

Description

Internal print and summary methods for derivative textmodel_affinity objects.

Usage

## S3 method for class 'influence.predict.textmodel_affinity'
print(x, n = 30, ...)

## S3 method for class 'influence.predict.textmodel_affinity'
summary(object, ...)

## S3 method for class 'summary.influence.predict.textmodel_affinity'
print(x, n = 30, ...)

Arguments

n

how many coefficients to print before truncating

Value

summary.influence.predict.textmodel_affinity() returns a list classes as summary.influence.predict.textmodel_affinity that includes:

the mean, the standard deviation, the direction of the influence, the rate, and the support


Correspondence analysis of a document-feature matrix

Description

textmodel_ca implements correspondence analysis scaling on a dfm. The method is a fast/sparse version of function ca.

Usage

textmodel_ca(x, smooth = 0, nd = NA, sparse = FALSE, residual_floor = 0.1)

Arguments

x

the dfm on which the model will be fit

smooth

a smoothing parameter for word counts; defaults to zero.

nd

Number of dimensions to be included in output; if NA (the default) then the maximum possible dimensions are included.

sparse

retains the sparsity if set to TRUE; set it to TRUE if x (the dfm) is too big to be allocated after converting to dense

residual_floor

specifies the threshold for the residual matrix for calculating the truncated svd.Larger value will reduce memory and time cost but might reduce accuracy; only applicable when sparse = TRUE

Details

svds in the RSpectra package is applied to enable the fast computation of the SVD.

Value

textmodel_ca() returns a fitted CA textmodel that is a special class of ca object.

Note

You may need to set sparse = TRUE) and increase the value of residual_floor to ignore less important information and hence to reduce the memory cost when you have a very big dfm. If your attempt to fit the model fails due to the matrix being too large, this is probably because of the memory demands of computing the V \times V residual matrix. To avoid this, consider increasing the value of residual_floor by 0.1, until the model can be fit.

Author(s)

Kenneth Benoit and Haiyan Wang

References

Nenadic, O. & Greenacre, M. (2007). Correspondence Analysis in R, with Two- and Three-dimensional Graphics: The ca package. Journal of Statistical Software, 20(3). doi:10.18637/jss.v020.i03

See Also

coef.textmodel_lsa(), ca

Examples

library("quanteda")
dfmat <- dfm(tokens(data_corpus_irishbudget2010))
tmod <- textmodel_ca(dfmat)
summary(tmod)

Logistic regression classifier for texts

Description

Fits a fast penalized maximum likelihood estimator to predict discrete categories from sparse dfm objects. Using the glmnet package, the function computes the regularization path for the lasso or elasticnet penalty at a grid of values for the regularization parameter lambda. This is done automatically by testing on several folds of the data at estimation time.

Usage

textmodel_lr(x, y, ...)

Arguments

x

the dfm on which the model will be fit. Does not need to contain only the training documents.

y

vector of training labels associated with each document identified in train. (These will be converted to factors if not already factors.)

...

additional arguments passed to cv.glmnet()

Value

an object of class textmodel_lr, a list containing:

References

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software 33(1), 1-22. doi:10.18637/jss.v033.i01

See Also

cv.glmnet(), predict.textmodel_lr(), coef.textmodel_lr()

Examples

## Example from 13.1 of _An Introduction to Information Retrieval_
library("quanteda")
corp <- corpus(c(d1 = "Chinese Beijing Chinese",
                 d2 = "Chinese Chinese Shanghai",
                 d3 = "Chinese Macao",
                 d4 = "Tokyo Japan Chinese",
                 d5 = "London England Chinese",
                 d6 = "Chinese Chinese Chinese Tokyo Japan"),
               docvars = data.frame(train = factor(c("Y", "Y", "Y", "N", "N", NA))))
dfmat <- dfm(tokens(corp), tolower = FALSE)

## simulate bigger sample as classification on small samples is problematic
set.seed(1)
dfmat <- dfm_sample(dfmat, 50, replace = TRUE)

## train model
(tmod1 <- textmodel_lr(dfmat, docvars(dfmat, "train")))
summary(tmod1)
coef(tmod1)

## predict probability and classes
predict(tmod1, type = "prob")
predict(tmod1)

Latent Semantic Analysis

Description

Fit the Latent Semantic Analysis scaling model to a dfm, which may be weighted (for instance using quanteda::dfm_tfidf()).

Usage

textmodel_lsa(x, nd = 10, margin = c("both", "documents", "features"))

Arguments

x

the dfm on which the model will be fit

nd

the number of dimensions to be included in output

margin

margin to be smoothed by the SVD

Details

svds in the RSpectra package is applied to enable the fast computation of the SVD.

Value

a textmodel_lsa class object, a list containing:

Note

The number of dimensions nd retained in LSA is an empirical issue. While a reduction in k can remove much of the noise, keeping too few dimensions or factors may lose important information.

Author(s)

Haiyan Wang and Kohei Watanabe

References

Rosario, B. (2000). Latent Semantic Indexing: An Overview. Technical report INFOSYS 240 Spring Paper, University of California, Berkeley.

Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6): 391.

See Also

predict.textmodel_lsa(), coef.textmodel_lsa()

Examples

library("quanteda")
dfmat <- dfm(tokens(data_corpus_irishbudget2010))
# create an LSA space and return its truncated representation in the low-rank space
tmod <- textmodel_lsa(dfmat[1:10, ])
head(tmod$docs)

# matrix in low_rank LSA space
tmod$matrix_low_rank[,1:5]

# fold queries into the space generated by dfmat[1:10,]
# and return its truncated versions of its representation in the new low-rank space
pred <- predict(tmod, newdata = dfmat[11:14, ])
pred$docs_newspace


Post-estimations methods for textmodel_lsa

Description

Post-estimation methods for fitted textmodel_lsa objects.

Usage

## S3 method for class 'textmodel_lsa'
predict(object, newdata = NULL, ...)

## S3 method for class 'textmodel_lsa'
as.dfm(x)

## S3 method for class 'textmodel_lsa'
coef(object, doc_dim = 1, feat_dim = 1, ...)

coefficients.textmodel_lsa(object, doc_dim = 1, feat_dim = 1, ...)

Arguments

object, x

previously fitted textmodel_lsa object

newdata

new matrix to be transformed into the lsa space

...

unused

doc_dim, feat_dim

the document and feature dimension scores to be extracted

Value

predict() returns a predicted textmodel_lsa object, projecting the patterns onto new data.

coef.textmodel_lsa extracts model coefficients from a fitted textmodel_ca object.


Naive Bayes classifier for texts

Description

Fit a multinomial or Bernoulli Naive Bayes model, given a dfm and some training labels.

Usage

textmodel_nb(
  x,
  y,
  smooth = 1,
  prior = c("uniform", "docfreq", "termfreq"),
  distribution = c("multinomial", "Bernoulli")
)

Arguments

x

the dfm on which the model will be fit. Does not need to contain only the training documents.

y

vector of training labels associated with each document identified in train. (These will be converted to factors if not already factors.)

smooth

smoothing parameter for feature counts, added to the feature frequency totals by training class

prior

prior distribution on texts; one of "uniform", "docfreq", or "termfreq". See Prior Distributions below.

distribution

count model for text features, can be multinomial or Bernoulli. To fit a "binary multinomial" model, first convert the dfm to a binary matrix using ⁠[quanteda::dfm_weight](x, scheme = "boolean")⁠.

Value

textmodel_nb() returns a list consisting of the following (where I is the total number of documents, J is the total number of features, and k is the total number of training classes):

call

original function call

param

k \times V; class conditional posterior estimates

x

the N \times V training dfm x

y

the N-length y training class vector, where NAs will not be used will be retained in the saved x matrix

distribution

character; the distribution of x for the NB model

priors

numeric; the class prior probabilities

smooth

numeric; the value of the smoothing parameter

Prior distributions

Prior distributions refer to the prior probabilities assigned to the training classes, and the choice of prior distribution affects the calculation of the fitted probabilities. The default is uniform priors, which sets the unconditional probability of observing the one class to be the same as observing any other class.

"Document frequency" means that the class priors will be taken from the relative proportions of the class documents used in the training set. This approach is so common that it is assumed in many examples, such as the worked example from Manning, Raghavan, and Schütze (2008) below. It is not the default in quanteda, however, since there may be nothing informative in the relative numbers of documents used to train a classifier other than the relative availability of the documents. When training classes are balanced in their number of documents (usually advisable), however, then the empirically computed "docfreq" would be equivalent to "uniform" priors.

Setting prior to "termfreq" makes the priors equal to the proportions of total feature counts found in the grouped documents in each training class, so that the classes with the largest number of features are assigned the largest priors. If the total count of features in each training class was the same, then "uniform" and "termfreq" would be the same.

Smoothing parameter

The smooth value is added to the feature frequencies, aggregated by training class, to avoid zero frequencies in any class. This has the effect of giving more weight to infrequent term occurrences.

Author(s)

Kenneth Benoit

References

Manning, C.D., Raghavan, P., & Schütze, H. (2008). An Introduction to Information Retrieval. Cambridge: Cambridge University Press (Chapter 13). Available at https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf.

Jurafsky, D. & Martin, J.H. (2018). From Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Draft of September 23, 2018 (Chapter 6, Naive Bayes). Available at https://web.stanford.edu/~jurafsky/slp3/.

See Also

predict.textmodel_nb()

Examples

## Example from 13.1 of _An Introduction to Information Retrieval_
library("quanteda")
txt <- c(d1 = "Chinese Beijing Chinese",
         d2 = "Chinese Chinese Shanghai",
         d3 = "Chinese Macao",
         d4 = "Tokyo Japan Chinese",
         d5 = "Chinese Chinese Chinese Tokyo Japan")
x <- dfm(tokens(txt), tolower = FALSE)
y <- factor(c("Y", "Y", "Y", "N", NA), ordered = TRUE)

## replicate IIR p261 prediction for test set (document 5)
(tmod1 <- textmodel_nb(x, y, prior = "docfreq"))
summary(tmod1)
coef(tmod1)
predict(tmod1, type = "prob")
predict(tmod1)

# contrast with other priors
predict(textmodel_nb(x, y, prior = "uniform"))
predict(textmodel_nb(x, y, prior = "termfreq"))

## replicate IIR p264 Bernoulli Naive Bayes
tmod2 <- textmodel_nb(x, y, distribution = "Bernoulli", prior = "docfreq")
predict(tmod2, newdata = x[5, ], type = "prob")
predict(tmod2, newdata = x[5, ])

[experimental] Linear SVM classifier for texts

Description

Fit a fast linear SVM classifier for sparse text matrices, using svmlin C++ code written by Vikas Sindhwani and S. Sathiya Keerthi. This method implements the modified finite Newton L2-SVM method (L2-SVM-MFN) method described in Sindhwani and Keerthi (2006). Currently, textmodel_svmlin() only works for two-class problems.

Usage

textmodel_svmlin(
  x,
  y,
  intercept = TRUE,
  lambda = 1,
  cp = 1,
  cn = 1,
  scale = FALSE,
  center = FALSE
)

Arguments

x

the dfm on which the model will be fit. Does not need to contain only the training documents.

y

vector of training labels associated with each document identified in train. (These will be converted to factors if not already factors.)

intercept

logical; if TRUE, add an intercept to the data

lambda

numeric; regularization parameter lambda (default 1)

cp

numeric; Relative cost for "positive" examples (the second factor level)

cn

numeric; Relative cost for "negative" examples (the first factor level)

scale

logical; if TRUE, normalize the feature counts

center

logical; if TRUE, centre the feature counts

Value

a fitted model object of class textmodel_svmlin

Warning

This function is marked experimental since it's not fully working yet in a way that translates into more standard SVM parameters that we understand. Use with caution after reading the Sindhwani and Keerthi (2006) paper.

References

Vikas Sindhwani and S. Sathiya Keerthi (2006). Large Scale Semi-supervised Linear SVMs. Proceedings of ACM SIGIR. August 6–11, 2006, Seattle.

V. Sindhwani and S. Sathiya Keerthi (2006). Newton Methods for Fast Solution of Semi-supervised Linear SVMs. Book Chapter in Large Scale Kernel Machines, MIT Press, 2006.

See Also

predict.textmodel_svmlin()

Examples

# use Lenihan for govt class and Bruton for opposition
library("quanteda")
docvars(data_corpus_irishbudget2010, "govtopp") <- c("Govt", "Opp", rep(NA, 12))
dfmat <- dfm(tokens(data_corpus_irishbudget2010))

tmod <- textmodel_svmlin(dfmat, y = dfmat$govtopp)
predict(tmod)

Wordfish text model

Description

Estimate Slapin and Proksch's (2008) "wordfish" Poisson scaling model of one-dimensional document positions using conditional maximum likelihood.

Usage

textmodel_wordfish(
  x,
  dir = c(1, 2),
  priors = c(Inf, Inf, 3, 1),
  tol = c(1e-06, 1e-08),
  dispersion = c("poisson", "quasipoisson"),
  dispersion_level = c("feature", "overall"),
  dispersion_floor = 0,
  abs_err = FALSE,
  residual_floor = 0.5
)

Arguments

x

the dfm on which the model will be fit

dir

set global identification by specifying the indexes for a pair of documents such that \hat{\theta}_{dir[1]} < \hat{\theta}_{dir[2]}.

priors

prior precisions for the estimated parameters \alpha_i, \psi_j, \beta_j, and \theta_i, where i indexes documents and j indexes features

tol

tolerances for convergence. The first value is a convergence threshold for the log-posterior of the model, the second value is the tolerance in the difference in parameter values from the iterative conditional maximum likelihood (from conditionally estimating document-level, then feature-level parameters).

dispersion

sets whether a quasi-Poisson quasi-likelihood should be used based on a single dispersion parameter ("poisson"), or quasi-Poisson ("quasipoisson")

dispersion_level

sets the unit level for the dispersion parameter, options are "feature" for term-level variances, or "overall" for a single dispersion parameter

dispersion_floor

constraint for the minimal underdispersion multiplier in the quasi-Poisson model. Used to minimize the distorting effect of terms with rare term or document frequencies that appear to be severely underdispersed. Default is 0, but this only applies if dispersion = "quasipoisson".

abs_err

specifies how the convergence is considered

residual_floor

specifies the threshold for residual matrix when calculating the svds, only applies when sparse = TRUE

Details

The returns match those of Will Lowe's R implementation of wordfish (see the austin package), except that here we have renamed words to be features. (This return list may change.) We have also followed the practice begun with Slapin and Proksch's early implementation of the model that used a regularization parameter of se(\sigma) = 3, through the third element in priors.

Value

An object of class textmodel_fitted_wordfish. This is a list containing:

dir

global identification of the dimension

theta

estimated document positions

alpha

estimated document fixed effects

beta

estimated feature marginal effects

psi

estimated word fixed effects

docs

document labels

features

feature labels

sigma

regularization parameter for betas in Poisson form

ll

log likelihood at convergence

se.theta

standard errors for theta-hats

x

dfm to which the model was fit

Note

In the rare situation where a warning message of "The algorithm did not converge." shows up, removing some documents may work.

Author(s)

Benjamin Lauderdale, Haiyan Wang, and Kenneth Benoit

References

Slapin, J. & Proksch, S.O. (2008). A Scaling Model for Estimating Time-Series Party Positions from Texts. doi:10.1111/j.1540-5907.2008.00338.x. American Journal of Political Science, 52(3), 705–772.

Lowe, W. & Benoit, K.R. (2013). Validating Estimates of Latent Traits from Textual Data Using Human Judgment as a Benchmark. doi:10.1093/pan/mpt002. Political Analysis, 21(3), 298–313.

See Also

predict.textmodel_wordfish()

Examples

(tmod1 <- textmodel_wordfish(quanteda::data_dfm_lbgexample, dir = c(1,5)))
summary(tmod1, n = 10)
coef(tmod1)
predict(tmod1)
predict(tmod1, se.fit = TRUE)
predict(tmod1, interval = "confidence")

## Not run: 
library("quanteda")
dfmat <- dfm(tokens(data_corpus_irishbudget2010))
(tmod2 <- textmodel_wordfish(dfmat, dir = c(6,5)))
(tmod3 <- textmodel_wordfish(dfmat, dir = c(6,5),
                             dispersion = "quasipoisson", dispersion_floor = 0))
(tmod4 <- textmodel_wordfish(dfmat, dir = c(6,5),
                             dispersion = "quasipoisson", dispersion_floor = .5))
plot(tmod3$phi, tmod4$phi, xlab = "Min underdispersion = 0", ylab = "Min underdispersion = .5",
     xlim = c(0, 1.0), ylim = c(0, 1.0))
plot(tmod3$phi, tmod4$phi, xlab = "Min underdispersion = 0", ylab = "Min underdispersion = .5",
     xlim = c(0, 1.0), ylim = c(0, 1.0), type = "n")
underdispersedTerms <- sample(which(tmod3$phi < 1.0), 5)
which(featnames(dfmat) %in% names(topfeatures(dfmat, 20)))
text(tmod3$phi, tmod4$phi, tmod3$features,
     cex = .8, xlim = c(0, 1.0), ylim = c(0, 1.0), col = "grey90")
text(tmod3$phi['underdispersedTerms'], tmod4$phi['underdispersedTerms'],
     tmod3$features['underdispersedTerms'],
     cex = .8, xlim = c(0, 1.0), ylim = c(0, 1.0), col = "black")
if (requireNamespace("austin")) {
    tmod5 <- austin::wordfish(quanteda::as.wfm(dfmat), dir = c(6, 5))
    cor(tmod1$theta, tmod5$theta)
}
## End(Not run)

Wordscores text model

Description

textmodel_wordscores implements Laver, Benoit and Garry's (2003) "Wordscores" method for scaling texts on a single dimension, given a set of anchoring or reference texts whose values are set through reference scores. This scale can be fitted in the linear space (as per LBG 2003) or in the logit space (as per Beauchamp 2012). Estimates of virgin or unknown texts are obtained using the predict() method to score documents from a fitted textmodel_wordscores object.

Usage

textmodel_wordscores(x, y, scale = c("linear", "logit"), smooth = 0)

Arguments

x

the dfm on which the model will be trained

y

vector of training scores associated with each document in x

scale

scale on which to score the words; "linear" for classic LBG linear posterior weighted word class differences, or "logit" for log posterior differences

smooth

a smoothing parameter for word counts; defaults to zero to match the LBG (2003) method. See Value below for additional information on the behaviour of this argument.

Details

The textmodel_wordscores() function and the associated predict() method are designed to function in the same manner as stats::predict.lm(). coef() can also be used to extract the word coefficients from the fitted textmodel_wordscores object, and summary() will print a nice summary of the fitted object.

Value

A fitted textmodel_wordscores object. This object will contain a copy of the input data, but in its original form without any smoothing applied. Calling predict.textmodel_wordscores() on this object without specifying a value for newdata, for instance, will predict on the unsmoothed object. This behaviour differs from versions of quanteda <= 1.2.

Author(s)

Kenneth Benoit

References

Laver, M., Benoit, K.R., & Garry, J. (2003). Estimating Policy Positions from Political Text using Words as Data. American Political Science Review, 97(2), 311–331.

Beauchamp, N. (2012). Using Text to Scale Legislatures with Uninformative Voting. New York University Mimeo.

Martin, L.W. & Vanberg, G. (2007). A Robust Transformation Procedure for Interpreting Political Text. Political Analysis 16(1), 93–100. doi:10.1093/pan/mpm010

See Also

predict.textmodel_wordscores() for methods of applying a fitted textmodel_wordscores model object to predict quantities from (other) documents.

Examples

(tmod <- textmodel_wordscores(quanteda::data_dfm_lbgexample, y = c(seq(-1.5, 1.5, .75), NA)))
summary(tmod)
coef(tmod)
predict(tmod)
predict(tmod, rescaling = "lbg")
predict(tmod, se.fit = TRUE, interval = "confidence", rescaling = "mv")

Influence plot for text scaling models

Description

Plot the results of a fitted scaling model, from (e.g.) a predicted textmodel_affinity model.

Usage

textplot_influence(x, n = 30, ...)

Arguments

x

the object output from influence() run on the fitted or predicted scaling model object to be plotted

n

the number of features whose influence will be plotted

...

additional arguments passed to plot()

Value

Creates a base R plot of feature influences of the median influence by the log10 median rate of the feature, and invisibly returns the elements from the call to plot().

Author(s)

Patrick Perry and Kenneth Benoit

See Also

textmodel_affinity()

influence.predict.textmodel_affinity()

Examples

tmod <- textmodel_affinity(quanteda::data_dfm_lbgexample, y = c("L", NA, NA, NA, "R", NA))
pred <- predict(tmod)
textplot_influence(influence(pred))