Getting Started with rurl

Introduction

The rurl package provides tools to parse, normalize, and extract information from URLs using a consistent and safe API.
It is fully vectorized and uses a bundled copy of the Public Suffix List for accurate domain handling.

Safe URL Parsing

Use safe_parse_url() to parse URLs robustly:

safe_parse_url("https://sub.example.co.uk/path?q=1")
#> $original_url
#> [1] "https://sub.example.co.uk/path?q=1"
#> 
#> $scheme
#> [1] "https"
#> 
#> $host
#> [1] "sub.example.co.uk"
#> 
#> $port
#> [1] NA
#> 
#> $path
#> [1] "/path"
#> 
#> $query
#> [1] "q=1"
#> 
#> $fragment
#> [1] NA
#> 
#> $user
#> [1] NA
#> 
#> $password
#> [1] NA
#> 
#> $domain
#> [1] "example.co.uk"
#> 
#> $tld
#> [1] "co.uk"
#> 
#> $is_ip_host
#> [1] FALSE
#> 
#> $clean_url
#> [1] "https://sub.example.co.uk/path"
#> 
#> $parse_status
#> [1] "ok"

The protocol_handling argument controls how schemes are handled:

Extracting URL Components

get_scheme("https://sub.example.com")
#> https://sub.example.com 
#>                 "https"
get_host("https://sub.example.com")
#> https://sub.example.com 
#>       "sub.example.com"
get_path("https://sub.example.com/path/to/page")
#> https://sub.example.com/path/to/page 
#>                      "/path/to/page"

Each function works on vectors of URLs and gracefully handles NA.

Domain and TLD Parsing

These functions rely on the Public Suffix List:

get_domain("https://a.b.example.co.uk")
#> https://a.b.example.co.uk 
#>           "example.co.uk"

Extracting TLDs from different sources:

get_tld("https://foo.blogspot.com")
#> https://foo.blogspot.com 
#>           "blogspot.com"

Sources include: - "all" (default; will match to the longest available TLD) - "private" (only extract private TLDs) - "icann" (only extract ICANN TLDs)

Vectorization and Edge Cases

All core functions support vectors and handle malformed inputs safely:

urls <- c("example.com", "http://example.com", NA)
get_clean_url(urls)
#>           example.com    http://example.com                  <NA> 
#> "http://example.com/" "http://example.com/"                    NA

Advanced Host Manipulation with subdomain_levels_to_keep

Several functions, including safe_parse_url(), get_host(), and get_clean_url(), support the subdomain_levels_to_keep argument. This allows for fine-grained control over how many subdomain levels are preserved in the host component of a URL, after initial www_handling has been applied.

Here are some examples demonstrating its effect on get_host():

get_host(
  "http://www.three.two.one.example.com",
  subdomain_levels_to_keep = 0
) # www_handling default is "none"
#> http://www.three.two.one.example.com 
#>                    "www.example.com"
# Expected: "www.example.com"

get_host(
  "http://three.two.one.example.com",
  www_handling = "strip",
  subdomain_levels_to_keep = 0
)
#> http://three.two.one.example.com 
#>                    "example.com"
# Expected: "example.com"

get_host("http://www.three.two.one.example.com", subdomain_levels_to_keep = 1)
#> http://www.three.two.one.example.com 
#>                "www.one.example.com"
# Expected: "www.one.example.com"

get_host(
  "http://three.two.one.example.com",
  www_handling = "strip",
  subdomain_levels_to_keep = 1
)
#> http://three.two.one.example.com 
#>                "one.example.com"
# Expected: "one.example.com"

get_host(
  "http://www.three.two.one.example.com",
  www_handling = "keep",
  subdomain_levels_to_keep = 2
)
#> http://www.three.two.one.example.com 
#>            "www.two.one.example.com"
# Expected: "www.two.one.example.com"

And its effect on get_clean_url():

get_clean_url(
  "http://www.deep.sub.example.com/some/path",
  subdomain_levels_to_keep = 0,
  www_handling = "keep"
)
#> http://www.deep.sub.example.com/some/path 
#>        "http://www.example.com/some/path"
# Expected: "http://www.example.com/some/path"

get_clean_url(
  "http://deep.sub.example.com/some/path",
  subdomain_levels_to_keep = 1
)
#> http://deep.sub.example.com/some/path 
#>    "http://sub.example.com/some/path"
# Expected: "http://sub.example.com/some/path"

Note that get_domain() also accepts subdomain_levels_to_keep, but it does not change the returned domain value. The domain is derived from the host before this specific host modification occurs. The parameter influences the host component that might be used in other parts of the safe_parse_url output, such as the clean_url.

Summary