Type: | Package |
Title: | Word and Document Vector Models |
Version: | 0.5.1 |
Maintainer: | Kohei Watanabe <watanabe.kohei@gmail.com> |
Description: | Create dense vector representation of words and documents using 'quanteda'. Currently implements Word2vec (Mikolov et al., 2013) <doi:10.48550/arXiv.1310.4546> and Latent Semantic Analysis (Deerwester et al., 1990) <doi:10.1002/(SICI)1097-4571(199009)41:6%3C391::AID-ASI1%3E3.0.CO;2-9>. |
URL: | https://github.com/koheiw/wordvector |
License: | Apache License (≥ 2.0) |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
Depends: | R (≥ 3.5.0) |
Imports: | quanteda (≥ 4.1.0), methods, stringi, Matrix, proxyC, RSpectra, irlba, rsvd |
Suggests: | testthat, word2vec, spelling |
LinkingTo: | Rcpp, quanteda |
Language: | en-US |
LazyData: | true |
NeedsCompilation: | yes |
Packaged: | 2025-06-19 23:32:47 UTC; watan |
Author: | Kohei Watanabe |
Repository: | CRAN |
Date/Publication: | 2025-06-20 08:50:02 UTC |
Convert formula to named character vector
Description
Convert a formula to a named character vector in analogy tasks.
Usage
analogy(formula)
Arguments
formula |
a formula object that defines the relationship between words
using |
Value
a named character vector to be passed to similarity()
.
See Also
Examples
analogy(~ berlin - germany + france)
analogy(~ quick - quickly + slowly)
Extract word vectors
Description
Extract word vectors from a textmodel_wordvector
or textmodel_docvector
object.
Usage
## S3 method for class 'textmodel_wordvector'
as.matrix(x, normalize = TRUE, ...)
Arguments
x |
a |
normalize |
if |
... |
not used. |
Value
a matrix that contain the word vectors in rows.
Yahoo News summaries from 2014
Description
A corpus object containing 2,000 news summaries collected from Yahoo News via RSS feeds in 2014. The title and description of the summaries are concatenated.
Usage
data_corpus_news2014
Format
An object of class corpus
(inherits from character
) of length 20000.
Source
References
Watanabe, K. (2018). Newsmap: A semi-supervised approach to geographical news classification. Digital Journalism, 6(3), 294–309. https://doi.org/10.1080/21670811.2017.1293487
Print method for trained document vectors
Description
Print method for trained document vectors
Usage
## S3 method for class 'textmodel_docvector'
print(x, ...)
Arguments
x |
for print method, the object to be printed |
... |
unused |
Value
an invisible copy of x
.
Print method for trained word vectors
Description
Print method for trained word vectors
Usage
## S3 method for class 'textmodel_wordvector'
print(x, ...)
Arguments
x |
for print method, the object to be printed |
... |
not used. |
Value
an invisible copy of x
.
Compute probability of words
Description
Compute the probability of words given other words.
Usage
probability(x, words, mode = c("words", "values"))
Arguments
x |
a |
words |
words for which probability is computed. |
mode |
specify the type of resulting object. |
Value
a matrix
of probability scores when mode = "values"
or of words
sorted in descending order by the probability scores when mode = "words"
.
When words
is a named numeric vector, probability scores are weighted by
the values.
See Also
Compute similarity between word vectors
Description
Compute the cosine similarity between word vectors for selected words.
Usage
similarity(x, words, mode = c("words", "values"))
Arguments
x |
a |
words |
words for which similarity is computed. |
mode |
specify the type of resulting object. |
Value
a matrix
of cosine similarity scores when mode = "values"
or of
words sorted in descending order by the similarity scores when mode = "words"
.
When words
is a named numeric vector, word vectors are weighted and summed
before computing similarity scores.
See Also
Create distributed representation of documents
Description
Create distributed representation of documents as weighted word vectors.
Usage
textmodel_doc2vec(
x,
model,
normalize = FALSE,
weights = 1,
pattern = NULL,
group_data = FALSE,
...
)
Arguments
x |
a quanteda::tokens or quanteda::dfm object. |
model |
a textmodel_wordvector object. |
normalize |
if |
weights |
weight the word vectors by user-provided values; either a single value or multiple values sorted in the same order as the word vectors. |
pattern |
quanteda::pattern to select words to apply |
group_data |
if |
... |
additional arguments passed to quanteda::object2id. |
Value
Returns a textmodel_docvector object with the following elements:
values |
a matrix for document vectors. |
dim |
the size of the document vectors. |
concatenator |
the concatenator in |
docvars |
document variables copied from |
normalize |
if the document vectors are normalized. |
call |
the command used to execute the function. |
version |
the version of the wordvector package. |
Latent Semantic Analysis model
Description
Train a Latent Semantic Analysis model (Deerwester et al., 1990) on a quanteda::tokens object.
Usage
textmodel_lsa(
x,
dim = 50,
min_count = 5L,
engine = c("RSpectra", "irlba", "rsvd"),
weight = "count",
tolower = TRUE,
verbose = FALSE,
...
)
Arguments
x |
a quanteda::tokens or quanteda::tokens_xptr object. |
dim |
the size of the word vectors. |
min_count |
the minimum frequency of the words. Words less frequent than
this in |
engine |
select the engine perform SVD to generate word vectors. |
weight |
weighting scheme passed to |
tolower |
if |
verbose |
if |
... |
additional arguments. |
Value
Returns a textmodel_wordvector object with the following elements:
values |
a matrix for word vectors values. |
weights |
a matrix for word vectors weights. |
frequency |
the frequency of words in |
engine |
the SVD engine used. |
weight |
weighting scheme. |
min_count |
the value of min_count. |
concatenator |
the concatenator in |
call |
the command used to execute the function. |
version |
the version of the wordvector package. |
References
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. JASIS, 41(6), 391–407.
Examples
library(quanteda)
library(wordvector)
# pre-processing
corp <- corpus_reshape(data_corpus_news2014)
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE) %>%
tokens_remove(stopwords("en", "marimo"), padding = TRUE) %>%
tokens_select("^[a-zA-Z-]+$", valuetype = "regex", case_insensitive = FALSE,
padding = TRUE) %>%
tokens_tolower()
# train LSA
lsa <- textmodel_lsa(toks, dim = 50, min_count = 5, verbose = TRUE)
# find similar words
head(similarity(lsa, c("berlin", "germany", "france"), mode = "words"))
head(similarity(lsa, c("berlin" = 1, "germany" = -1, "france" = 1), mode = "values"))
head(similarity(lsa, analogy(~ berlin - germany + france)))
Word2vec model
Description
Train a Word2vec model (Mikolov et al., 2023) in different architectures on a quanteda::tokens object.
Usage
textmodel_word2vec(
x,
dim = 50,
type = c("cbow", "skip-gram"),
min_count = 5,
window = ifelse(type == "cbow", 5, 10),
iter = 10,
alpha = 0.05,
model = NULL,
use_ns = TRUE,
ns_size = 5,
sample = 0.001,
tolower = TRUE,
include_data = FALSE,
verbose = FALSE,
...
)
Arguments
x |
a quanteda::tokens or quanteda::tokens_xptr object. |
dim |
the size of the word vectors. |
type |
the architecture of the model; either "cbow" (continuous back of words) or "skip-gram". |
min_count |
the minimum frequency of the words. Words less frequent than
this in |
window |
the size of the word window. Words within this window are considered to be the context of a target word. |
iter |
the number of iterations in model training. |
alpha |
the initial learning rate. |
model |
a trained Word2vec model; if provided, its word vectors are updated for |
use_ns |
if |
ns_size |
the size of negative samples. Only used when |
sample |
the rate of sampling of words based on their frequency. Sampling is
disabled when |
tolower |
lower-case all the tokens before fitting the model. |
include_data |
if |
verbose |
if |
... |
additional arguments. |
Details
User can changed the number of processors used for the parallel computing via
options(wordvector_threads)
.
Value
Returns a textmodel_wordvector object with the following elements:
values |
a matrix for word vector values. |
weights |
a matrix for word vector weights. |
dim |
the size of the word vectors. |
type |
the architecture of the model. |
frequency |
the frequency of words in |
window |
the size of the word window. |
iter |
the number of iterations in model training. |
alpha |
the initial learning rate. |
use_ns |
the use of negative sampling. |
ns_size |
the size of negative samples. |
min_count |
the value of min_count. |
concatenator |
the concatenator in |
data |
the original data supplied as |
call |
the command used to execute the function. |
version |
the version of the wordvector package. |
References
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. https://arxiv.org/abs/1310.4546.
Examples
library(quanteda)
library(wordvector)
# pre-processing
corp <- data_corpus_news2014
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE) %>%
tokens_remove(stopwords("en", "marimo"), padding = TRUE) %>%
tokens_select("^[a-zA-Z-]+$", valuetype = "regex", case_insensitive = FALSE,
padding = TRUE) %>%
tokens_tolower()
# train word2vec
w2v <- textmodel_word2vec(toks, dim = 50, type = "cbow", min_count = 5, sample = 0.001)
# find similar words
head(similarity(w2v, c("berlin", "germany", "france"), mode = "words"))
head(similarity(w2v, c("berlin" = 1, "germany" = -1, "france" = 1), mode = "values"))
head(similarity(w2v, analogy(~ berlin - germany + france), mode = "words"))