--- title: "Convenience Functions: Joins, Schemas, and Data Wrangling" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Convenience Functions: Joins, Schemas, and Data Wrangling} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` brightspaceR ships with convenience functions that handle the repetitive work of joining Brightspace Data Sets (BDS) and parsing column types. This article walks through each one. ## Fetching datasets ### Single dataset by name `bs_get_dataset()` is the workhorse. It looks up a dataset by name, downloads the latest full extract, parses the CSV with the correct column types, and returns a tidy tibble with snake_case column names: ```{r} library(brightspaceR) users <- bs_get_dataset("Users") users #> # A tibble: 12,430 x 14 #> user_id first_name last_name org_defined_id ... #> ... ``` ### Discovery If you don't know the exact dataset name, use `bs_list_datasets()` to browse everything available, or `bs_search_datasets()` to filter by keyword: ```{r} # All datasets bs_list_datasets() # Find grade-related datasets bs_list_datasets() |> dplyr::filter(grepl("grade", name, ignore.case = TRUE)) ``` ### Bulk download Download everything at once as a named list. Useful for building a local data warehouse or for pipelines that need multiple datasets: ```{r} all_data <- bs_download_all() names(all_data) #> [1] "users" "user_enrollments" "org_units" #> [4] "grade_results" "grade_objects" ... ``` ## Joining datasets BDS datasets are normalized -- users, enrollments, grades, and org units live in separate tables linked by ID columns. brightspaceR provides two ways to join them. ### Smart join: `bs_join()` `bs_join()` examines both data frames, finds columns ending in `_id` that appear in both, and performs a left join on those columns: ```{r} users <- bs_get_dataset("Users") enrollments <- bs_get_dataset("User Enrollments") # Automatically joins on user_id combined <- bs_join(users, enrollments) ``` This works well for most pairs of datasets. Under the hood it uses the schema registry to identify key columns. ### Named join functions For explicit, self-documenting code, use the named join functions. Each specifies exactly which key columns are used: ```{r} # Users + Enrollments (by user_id) bs_join_users_enrollments(users, enrollments) # Enrollments + Grades (by org_unit_id and user_id) grades <- bs_get_dataset("Grade Results") bs_join_enrollments_grades(enrollments, grades) # Grades + Grade Objects (by grade_object_id and org_unit_id) grade_objects <- bs_get_dataset("Grade Objects") bs_join_grades_objects(grades, grade_objects) # Enrollments + Org Units (by org_unit_id) org_units <- bs_get_dataset("Org Units") bs_join_enrollments_orgunits(enrollments, org_units) # Enrollments + Roles (by role_id) roles <- bs_get_dataset("Role Details") bs_join_enrollments_roles(enrollments, roles) # Content Objects + User Progress (by content_object_id and org_unit_id) content <- bs_get_dataset("Content Objects") progress <- bs_get_dataset("Content User Progress") bs_join_content_progress(content, progress) ``` All join functions use `dplyr::left_join()`, so the first argument determines which rows are preserved. ### Chaining joins Build a complete grade report by chaining joins with the pipe: ```{r} grade_report <- bs_get_dataset("User Enrollments") |> bs_join_enrollments_grades(bs_get_dataset("Grade Results")) |> bs_join_grades_objects(bs_get_dataset("Grade Objects")) grade_report #> # A tibble: 523,041 x 28 #> user_id org_unit_id role_id grade_object_id points_numerator ... ``` ## Schemas and column types ### Why schemas matter BDS exports everything as CSV with PascalCase column names and string values. Without schema information, dates come through as character, booleans as "True"/"False" strings, and IDs as text. brightspaceR's schema registry fixes this automatically. ### Registered schemas The package knows the column types for ~20 common BDS datasets: ```{r} bs_list_schemas() #> [1] "users" "user_enrollments" #> [3] "org_units" "org_unit_types" #> [5] "grade_objects" "grade_results" #> [7] "content_objects" "content_user_progress" #> [9] "quiz_attempts" "quiz_user_answers" #> [11] "discussion_posts" "discussion_topics" #> [13] "assignment_submissions" "attendance_registers" #> [15] "attendance_records" "role_details" #> [17] "course_offerings" "final_grades" #> [19] "enrollments_and_withdrawals" "organizational_unit_ancestors" ``` ### Inspecting a schema Each schema defines column types, date columns, boolean columns, and key columns (used for joining): ```{r} schema <- bs_get_schema("grade_results") schema$key_cols #> [1] "GradeObjectId" "OrgUnitId" "UserId" schema$date_cols #> [1] "LastModified" schema$bool_cols #> [1] "IsReleased" "IsDropped" ``` ### Unknown datasets For datasets without a registered schema, brightspaceR applies intelligent type coercion: - Columns that look numeric are parsed as `double` - Columns containing "True"/"False" or "0"/"1" become `logical` - Columns matching ISO 8601 date patterns become `POSIXct` - Everything else stays `character` This means `bs_get_dataset()` returns usable tibbles even for datasets the package doesn't know about. ### Column name conversion All BDS PascalCase names are automatically converted to snake_case: ```{r} bs_clean_names(c("UserId", "OrgUnitId", "FirstName", "LastAccessedDate")) #> [1] "user_id" "org_unit_id" "first_name" #> [4] "last_accessed_date" ``` ## Common patterns ### Enrollment counts by role ```{r} library(dplyr) bs_get_dataset("User Enrollments") |> bs_join_enrollments_roles(bs_get_dataset("Role Details")) |> count(role_name, sort = TRUE) #> # A tibble: 8 x 2 #> role_name n #> #> 1 Student 108330 #> 2 Instructor 5753 #> 3 ... ``` ### Grade summary for a course ```{r} org_units <- bs_get_dataset("Org Units") grades <- bs_get_dataset("Grade Results") grade_objs <- bs_get_dataset("Grade Objects") # Find the course course <- org_units |> filter(grepl("STAT101", name)) # Build grade report grades |> filter(org_unit_id %in% course$org_unit_id) |> bs_join_grades_objects(grade_objs) |> group_by(name) |> summarise( n_students = n_distinct(user_id), mean_score = mean(points_numerator, na.rm = TRUE), .groups = "drop" ) ``` ### Active users in the last 90 days ```{r} bs_get_dataset("Users") |> filter( is_active == TRUE, last_accessed >= Sys.time() - as.difftime(90, units = "days") ) |> nrow() ``` ### Content completion rates ```{r} content <- bs_get_dataset("Content Objects") progress <- bs_get_dataset("Content User Progress") bs_join_content_progress(content, progress) |> group_by(title) |> summarise( n_users = n_distinct(user_id), n_completed = sum(!is.na(completed_date)), completion_rate = n_completed / n_users, .groups = "drop" ) |> arrange(desc(n_users)) ```