Safe URL Parsing

Use safe_parse_url() to parse URLs robustly:

safe_parse_url("https://sub.example.co.uk/path?q=1")
#> $original_url
#> [1] "https://sub.example.co.uk/path?q=1"
#> 
#> $scheme
#> [1] "https"
#> 
#> $host
#> [1] "sub.example.co.uk"
#> 
#> $port
#> [1] NA
#> 
#> $path
#> [1] "/path"
#> 
#> $query
#> [1] "q=1"
#> 
#> $fragment
#> [1] NA
#> 
#> $user
#> [1] NA
#> 
#> $password
#> [1] NA
#> 
#> $domain
#> [1] "example.co.uk"
#> 
#> $tld
#> [1] "co.uk"
#> 
#> $is_ip_host
#> [1] FALSE
#> 
#> $clean_url
#> [1] "https://sub.example.co.uk/path"
#> 
#> $parse_status
#> [1] "ok"

The protocol_handling argument controls how schemes are handled:

"keep" (default; keeps the current protocol or prepends http:// if missing)
"none" (doesn’t add, remove, or change protocols)
"strip" (removes protocols)
"http" (changes protocols to http:// or adds it if missing)
"https" (changes protocols to https:// or adds it if missing)

Extracting URL Components

get_scheme("https://sub.example.com")
#> https://sub.example.com 
#>                 "https"
get_host("https://sub.example.com")
#> https://sub.example.com 
#>       "sub.example.com"
get_path("https://sub.example.com/path/to/page")
#> https://sub.example.com/path/to/page 
#>                      "/path/to/page"

Each function works on vectors of URLs and gracefully handles NA.

Domain and TLD Parsing

These functions rely on the Public Suffix List:

get_domain("https://a.b.example.co.uk")
#> https://a.b.example.co.uk 
#>           "example.co.uk"

Extracting TLDs from different sources:

get_tld("https://foo.blogspot.com")
#> https://foo.blogspot.com 
#>           "blogspot.com"

Sources include: - "all" (default; will match to the longest available TLD) - "private" (only extract private TLDs) - "icann" (only extract ICANN TLDs)

Vectorization and Edge Cases

All core functions support vectors and handle malformed inputs safely:

urls <- c("example.com", "http://example.com", NA)
get_clean_url(urls)
#>           example.com    http://example.com                  <NA> 
#> "http://example.com/" "http://example.com/"                    NA

Advanced Host Manipulation with `subdomain_levels_to_keep`

Several functions, including safe_parse_url(), get_host(), and get_clean_url(), support the subdomain_levels_to_keep argument. This allows for fine-grained control over how many subdomain levels are preserved in the host component of a URL, after initial www_handling has been applied.

NULL (Default): No specific subdomain stripping is performed beyond www_handling.
0: All subdomains are stripped. If www_handling preserved or added ‘www.’, it remains (e.g., ‘www.sub.example.com’ becomes ‘www.example.com’; ‘sub.example.com’ becomes ‘example.com’).
N > 0: Keeps up to N levels of subdomains, counted from right-to-left (closest to the registered domain), in addition to any ‘www.’ prefix.

Here are some examples demonstrating its effect on get_host():

get_host(
  "http://www.three.two.one.example.com",
  subdomain_levels_to_keep = 0
) # www_handling default is "none"
#> http://www.three.two.one.example.com 
#>                    "www.example.com"
# Expected: "www.example.com"

get_host(
  "http://three.two.one.example.com",
  www_handling = "strip",
  subdomain_levels_to_keep = 0
)
#> http://three.two.one.example.com 
#>                    "example.com"
# Expected: "example.com"

get_host("http://www.three.two.one.example.com", subdomain_levels_to_keep = 1)
#> http://www.three.two.one.example.com 
#>                "www.one.example.com"
# Expected: "www.one.example.com"

get_host(
  "http://three.two.one.example.com",
  www_handling = "strip",
  subdomain_levels_to_keep = 1
)
#> http://three.two.one.example.com 
#>                "one.example.com"
# Expected: "one.example.com"

get_host(
  "http://www.three.two.one.example.com",
  www_handling = "keep",
  subdomain_levels_to_keep = 2
)
#> http://www.three.two.one.example.com 
#>            "www.two.one.example.com"
# Expected: "www.two.one.example.com"

And its effect on get_clean_url():

get_clean_url(
  "http://www.deep.sub.example.com/some/path",
  subdomain_levels_to_keep = 0,
  www_handling = "keep"
)
#> http://www.deep.sub.example.com/some/path 
#>        "http://www.example.com/some/path"
# Expected: "http://www.example.com/some/path"

get_clean_url(
  "http://deep.sub.example.com/some/path",
  subdomain_levels_to_keep = 1
)
#> http://deep.sub.example.com/some/path 
#>    "http://sub.example.com/some/path"
# Expected: "http://sub.example.com/some/path"

Note that get_domain() also accepts subdomain_levels_to_keep, but it does not change the returned domain value. The domain is derived from the host before this specific host modification occurs. The parameter influences the host component that might be used in other parts of the safe_parse_url output, such as the clean_url.

Getting Started with rurl

Introduction

Safe URL Parsing

Extracting URL Components

Domain and TLD Parsing

Vectorization and Edge Cases

Advanced Host Manipulation with `subdomain_levels_to_keep`

Summary

Getting Started with rurl

Introduction

Safe URL Parsing

Extracting URL Components

Domain and TLD Parsing

Vectorization and Edge Cases

Advanced Host Manipulation with subdomain_levels_to_keep

Summary

Advanced Host Manipulation with `subdomain_levels_to_keep`