The rurl package provides tools to parse, normalize, and
extract information from URLs using a consistent and safe API.
It is fully vectorized and uses a bundled copy of the Public Suffix List for accurate
domain handling.
Use safe_parse_url() to parse URLs robustly:
safe_parse_url("https://sub.example.co.uk/path?q=1")
#> $original_url
#> [1] "https://sub.example.co.uk/path?q=1"
#>
#> $scheme
#> [1] "https"
#>
#> $host
#> [1] "sub.example.co.uk"
#>
#> $port
#> [1] NA
#>
#> $path
#> [1] "/path"
#>
#> $query
#> [1] "q=1"
#>
#> $fragment
#> [1] NA
#>
#> $user
#> [1] NA
#>
#> $password
#> [1] NA
#>
#> $domain
#> [1] "example.co.uk"
#>
#> $tld
#> [1] "co.uk"
#>
#> $is_ip_host
#> [1] FALSE
#>
#> $clean_url
#> [1] "https://sub.example.co.uk/path"
#>
#> $parse_status
#> [1] "ok"The protocol_handling argument controls how schemes are
handled:
"keep" (default; keeps the current protocol or prepends
http:// if missing)"none" (doesn’t add, remove, or change protocols)"strip" (removes protocols)"http" (changes protocols to http:// or
adds it if missing)"https" (changes protocols to https:// or
adds it if missing)get_scheme("https://sub.example.com")
#> https://sub.example.com
#> "https"
get_host("https://sub.example.com")
#> https://sub.example.com
#> "sub.example.com"
get_path("https://sub.example.com/path/to/page")
#> https://sub.example.com/path/to/page
#> "/path/to/page"Each function works on vectors of URLs and gracefully handles
NA.
These functions rely on the Public Suffix List:
Extracting TLDs from different sources:
Sources include: - "all" (default; will match to the
longest available TLD) - "private" (only extract private
TLDs) - "icann" (only extract ICANN TLDs)
All core functions support vectors and handle malformed inputs safely:
subdomain_levels_to_keepSeveral functions, including safe_parse_url(),
get_host(), and get_clean_url(), support the
subdomain_levels_to_keep argument. This allows for
fine-grained control over how many subdomain levels are preserved in the
host component of a URL, after initial
www_handling has been applied.
NULL (Default): No specific subdomain stripping is
performed beyond www_handling.0: All subdomains are stripped. If
www_handling preserved or added ‘www.’, it remains (e.g.,
‘www.sub.example.com’ becomes ‘www.example.com’; ‘sub.example.com’
becomes ‘example.com’).N > 0: Keeps up to N levels of subdomains, counted
from right-to-left (closest to the registered domain), in addition to
any ‘www.’ prefix.Here are some examples demonstrating its effect on
get_host():
get_host(
"http://www.three.two.one.example.com",
subdomain_levels_to_keep = 0
) # www_handling default is "none"
#> http://www.three.two.one.example.com
#> "www.example.com"
# Expected: "www.example.com"
get_host(
"http://three.two.one.example.com",
www_handling = "strip",
subdomain_levels_to_keep = 0
)
#> http://three.two.one.example.com
#> "example.com"
# Expected: "example.com"
get_host("http://www.three.two.one.example.com", subdomain_levels_to_keep = 1)
#> http://www.three.two.one.example.com
#> "www.one.example.com"
# Expected: "www.one.example.com"
get_host(
"http://three.two.one.example.com",
www_handling = "strip",
subdomain_levels_to_keep = 1
)
#> http://three.two.one.example.com
#> "one.example.com"
# Expected: "one.example.com"
get_host(
"http://www.three.two.one.example.com",
www_handling = "keep",
subdomain_levels_to_keep = 2
)
#> http://www.three.two.one.example.com
#> "www.two.one.example.com"
# Expected: "www.two.one.example.com"And its effect on get_clean_url():
get_clean_url(
"http://www.deep.sub.example.com/some/path",
subdomain_levels_to_keep = 0,
www_handling = "keep"
)
#> http://www.deep.sub.example.com/some/path
#> "http://www.example.com/some/path"
# Expected: "http://www.example.com/some/path"
get_clean_url(
"http://deep.sub.example.com/some/path",
subdomain_levels_to_keep = 1
)
#> http://deep.sub.example.com/some/path
#> "http://sub.example.com/some/path"
# Expected: "http://sub.example.com/some/path"Note that get_domain() also accepts
subdomain_levels_to_keep, but it does not change the
returned domain value. The domain is derived from the host
before this specific host modification occurs. The parameter
influences the host component that might be used in other parts of the
safe_parse_url output, such as the
clean_url.