Extract Values from an Excel File Based on Row and Column Identifiers
Source:R/extract_values.R
extract_values.Rd
This function reads an Excel file and extracts structured data by identifying row and column labels. It ensures that each row label is only matched with the closest column label to its right and only with column labels from the nearest preceding row that contains column headers.
You can use both regex patterns and explicit identifiers together.
If
row_identifiers
and/orcol_identifiers
are provided, they take priority overrow_pattern
andcol_pattern
.If both explicit identifiers and regex patterns are provided, the function will first match on identifiers. Any additional matches from regex patterns will be included.
Usage
extract_values(
file,
row_pattern = NULL,
col_pattern = NULL,
row_identifiers = NULL,
col_identifiers = NULL,
clean_description = TRUE
)
Arguments
- file
Path to the Excel file.
- row_pattern
A regex pattern to identify row labels. Used only if
row_identifiers
does not fully capture all row labels. (Example:".*_row$"
).- col_pattern
A regex pattern to identify column labels. Used only if
col_identifiers
does not fully capture all column labels. (Example:".*_col$"
).- row_identifiers
A character vector of row label identifiers. If provided, takes precedence over
row_pattern
, but does not disable it.- col_identifiers
A character vector of column label identifiers. If provided, takes precedence over
col_pattern
, but does not disable it.- clean_description
Logical. If TRUE, removes text matching
row_pattern
andcol_pattern
from row and column labels (default: TRUE).
Examples
# Extract values using regex patterns
dataset <- extract_values(rowcolr_example("example.xlsx"), row_pattern = ".*_row$", col_pattern = ".*_col$")
dataset
#> # A tibble: 28 × 13
#> filename sheet row col row_label col_label description data_type error
#> <chr> <chr> <int> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 example.xl… Bala… 2 4 name_row value_col _ character NA
#> 2 example.xl… Bala… 3 4 year_row value_col _ numeric NA
#> 3 example.xl… Bala… 4 4 kvk_row value_col _ numeric NA
#> 4 example.xl… Bala… 7 10 test1_row 2026_col _ character NA
#> 5 example.xl… Bala… 8 10 tes2_row 2026_col _ character NA
#> 6 example.xl… Bala… 9 4 int_fixe… 2025_col _ numeric NA
#> 7 example.xl… Bala… 10 4 tang_fix… 2025_col _ numeric NA
#> 8 example.xl… Bala… 11 4 fin_fixe… 2025_col _ numeric NA
#> 9 example.xl… Bala… 13 4 tot_fixe… 2025_col _ numeric NA
#> 10 example.xl… Bala… 16 4 stock_row 2025_col _ numeric NA
#> # ℹ 18 more rows
#> # ℹ 4 more variables: logical <lgl>, numeric <dbl>, date <dttm>,
#> # character <chr>
# Extract values using a predefined list of labels (ignoring regex)
dataset <- extract_values(rowcolr_example("example.xlsx"),
row_identifiers = c("Total assets (4+9)", "Total equity (10+11+12)"),
col_identifiers = c("2025_col"))
dataset
#> # A tibble: 2 × 13
#> filename sheet row col row_label col_label description data_type error
#> <chr> <chr> <int> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 example.xlsx Bala… 23 4 Total as… 2025_col Total asse… numeric NA
#> 2 example.xlsx Bala… 31 4 Total eq… 2025_col Total equi… numeric NA
#> # ℹ 4 more variables: logical <lgl>, numeric <dbl>, date <dttm>,
#> # character <chr>
# Extract values using BOTH identifiers and regex
dataset <- extract_values(rowcolr_example("example.xlsx"),
row_identifiers = c("Total assets (4+9)"),
row_pattern = ".*_row$")
#> Warning: There were 29 warnings in `dplyr::mutate()`.
#> The first warning was:
#> ℹ In argument: `closest_col_row = purrr::map_dbl(row,
#> ~max(cols$col_row[cols$col_row < .x], na.rm = TRUE))`.
#> ℹ In group 1: `sheet_rows = "Balance"`.
#> Caused by warning in `max()`:
#> ! no non-missing arguments to max; returning -Inf
#> ℹ Run `dplyr::last_dplyr_warnings()` to see the 28 remaining warnings.
dataset
#> # A tibble: 0 × 13
#> # ℹ 13 variables: filename <chr>, sheet <chr>, row <int>, col <int>,
#> # row_label <chr>, col_label <chr>, description <chr>, data_type <chr>,
#> # error <chr>, logical <lgl>, numeric <dbl>, date <dttm>, character <chr>