punycoder (used for IDNA/Punycode encoding and
decoding) is now on CRAN. DESCRIPTION requires
punycoder (>= 1.0.0).case_handling is now
"lower_host" (was "keep" for
safe_parse_url(), safe_parse_urls(),
get_clean_url(), and the get_*() accessors,
and "lower" for get_path()). This is the RFC
3986 §6.2.2.1 normalization: the case-insensitive scheme and host fold
to lowercase while the case-sensitive path is preserved. With the
previous defaults, hosts such as WWW.Example.COM and
www.example.com did not fold to one identity, and
get_path() silently lowercased paths (two pages that differ
only by path casing collapsed to one). Pass
case_handling = "keep" to restore the previous
reconstruction, or "lower" to lowercase the whole URL
including the path. (RURL-lzepdnmm)canonical_join() gains name_A /
name_B arguments to set the output original-URL column
names explicitly. They default to NULL, preserving the
previous deparse(substitute()) behavior; supply them for
stable names when piping or passing anonymous inputs
(e.g. canonical_join(df[df$x > 1, ], get_b())), which
otherwise produced unstable column names. (RURL-fsygrelr)
canonical_join() gains a
join_parse_status argument controlling which parse statuses
yield joinable keys. The default "ok" preserves the
previous behavior (only ok* statuses join);
"ok_or_warning" additionally treats the
parseable-but-suspicious warning-* statuses
(warning-no-tld, warning-invalid-tld,
warning-public-suffix) as joinable, at the cost of more
potential false-positive matches. (RURL-edqdrvfu)
Cache introspection and configuration.
rurl_cache_info() reports the entry count, enabled state,
and any bound for each memoization cache (full_parse,
domain, tld). rurl_cache_config()
enables or disables individual caches and sets an optional
max_full_parse bound on the full-parse cache (default
Inf, preserving the previous unbounded behavior); when the
bound is reached the cache is reset so peak memory stays bounded. The
domain and tld caches remain unbounded by
design — they grow with the number of unique hosts, not with URL/option
combinations — and can be disabled for workloads with very many unique
hosts. (RURL-iuotpaqs)
safe_parse_url() now returns port as an
integer (or NA_integer_), and
safe_parse_urls() no longer errors on URLs that contain an
explicit port (e.g. http://example.com:8080/path).
Previously the scalar parser returned the port as a character string and
the vectorized parser aborted. (RURL-fxyzanfg)http://[2001:db8::1]/) are
now correctly detected as IP hosts: is_ip_host is
TRUE, parse_status is "ok", and
no TLD/domain derivation is attempted — matching how IPv4 hosts were
already handled. An over-escaped detection pattern previously prevented
this. (RURL-jpqjndld)subdomain_levels_to_keep = N (for
N > 0) now keeps the N rightmost subdomain
labels as documented, instead of silently retaining all subdomains. For
example,
safe_parse_url("http://deep.sub.domain.example.com", subdomain_levels_to_keep = 1)
now returns host domain.example.com (was
deep.sub.domain.example.com). N = 0 (strip
all) is unchanged. Code that relied on the previous no-op behavior for
N > 0 will see different output. (RURL-szumhumv)clean_url composition: it is a normalized
canonical key built from scheme, host, and path only. Port, query,
fragment, and userinfo are intentionally excluded, and with
path_encoding = "decode" the path is shown decoded
(human-readable, not guaranteed URL-safe). This matches the existing
behavior and the key used by canonical_join() — no behavior
change. Corrected a lower_host description that implied
userinfo could be retained in clean_url, and fixed a README
example whose input contained a literal space (now percent-encoded) so
it parses as documented. (RURL-jnboujtd)v1.RELEASE_NOTES_v1.md.1.0.0 (see
DESCRIPTION).This release adds powerful capabilities for URL normalization and canonical dataset joining. It significantly improves robustness in handling malformed or inconsistent URLs.
case_handling and
trailing_slash_handling parameters in
safe_parse_url() and get_clean_url() provide
greater control over URL formatting.canonical_join() for joining datasets on
normalized URL keys.htp://.example.com:8080/path).curl::curl_parse_url()
fails internally.This release adds robust support for internationalized domain names (IDNs), improves punycode handling, and ensures accurate extraction of TLDs and registered domains.
urltools is unavailablestringipsl package.update_psl.R script to fetch and process
the PSL during development.@param, @return,
etc.) for CRAN compliance.NAMESPACE and removed unnecessary functions
like hello().get_*() functions are now vectorized and work on
character vectors.curl and
psl.mutate() and other tidy
workflows.