Skip to contents

[Experimental]

This function reads an Excel file and extracts structured data by identifying row and column labels. It ensures that each row label is only matched with the closest column label to its right and only with column labels from the nearest preceding row that contains column headers.

You can use both regex patterns and explicit identifiers together.

  • If row_identifiers and/or col_identifiers are provided, they take priority over row_pattern and col_pattern.

  • If both explicit identifiers and regex patterns are provided, the function will first match on identifiers. Any additional matches from regex patterns will be included.

Usage

extract_values(
  file,
  row_pattern = NULL,
  col_pattern = NULL,
  row_identifiers = NULL,
  col_identifiers = NULL,
  clean_description = TRUE
)

Arguments

file

Path to the Excel file.

row_pattern

A regex pattern to identify row labels. Used only if row_identifiers does not fully capture all row labels. (Example: ".*_row$").

col_pattern

A regex pattern to identify column labels. Used only if col_identifiers does not fully capture all column labels. (Example: ".*_col$").

row_identifiers

A character vector of row label identifiers. If provided, takes precedence over row_pattern, but does not disable it.

col_identifiers

A character vector of column label identifiers. If provided, takes precedence over col_pattern, but does not disable it.

clean_description

Logical. If TRUE, removes text matching row_pattern and col_pattern from row and column labels (default: TRUE).

Value

A tibble containing extracted values, including sheet name, row, column, and description.

Examples

# Extract values using regex patterns
dataset <- extract_values(rowcolr_example("example.xlsx"), row_pattern = ".*_row$", col_pattern = ".*_col$")
dataset
#> # A tibble: 28 × 13
#>    filename    sheet   row   col row_label col_label description data_type error
#>    <chr>       <chr> <int> <int> <chr>     <chr>     <chr>       <chr>     <chr>
#>  1 example.xl… Bala…     2     4 name_row  value_col _           character NA   
#>  2 example.xl… Bala…     3     4 year_row  value_col _           numeric   NA   
#>  3 example.xl… Bala…     4     4 kvk_row   value_col _           numeric   NA   
#>  4 example.xl… Bala…     7    10 test1_row 2026_col  _           character NA   
#>  5 example.xl… Bala…     8    10 tes2_row  2026_col  _           character NA   
#>  6 example.xl… Bala…     9     4 int_fixe… 2025_col  _           numeric   NA   
#>  7 example.xl… Bala…    10     4 tang_fix… 2025_col  _           numeric   NA   
#>  8 example.xl… Bala…    11     4 fin_fixe… 2025_col  _           numeric   NA   
#>  9 example.xl… Bala…    13     4 tot_fixe… 2025_col  _           numeric   NA   
#> 10 example.xl… Bala…    16     4 stock_row 2025_col  _           numeric   NA   
#> # ℹ 18 more rows
#> # ℹ 4 more variables: logical <lgl>, numeric <dbl>, date <dttm>,
#> #   character <chr>

# Extract values using a predefined list of labels (ignoring regex)
dataset <- extract_values(rowcolr_example("example.xlsx"),
                          row_identifiers = c("Total assets (4+9)", "Total equity (10+11+12)"),
                          col_identifiers = c("2025_col"))
dataset
#> # A tibble: 2 × 13
#>   filename     sheet   row   col row_label col_label description data_type error
#>   <chr>        <chr> <int> <int> <chr>     <chr>     <chr>       <chr>     <chr>
#> 1 example.xlsx Bala…    23     4 Total as… 2025_col  Total asse… numeric   NA   
#> 2 example.xlsx Bala…    31     4 Total eq… 2025_col  Total equi… numeric   NA   
#> # ℹ 4 more variables: logical <lgl>, numeric <dbl>, date <dttm>,
#> #   character <chr>

# Extract values using BOTH identifiers and regex
dataset <- extract_values(rowcolr_example("example.xlsx"),
                          row_identifiers = c("Total assets (4+9)"),
                          row_pattern = ".*_row$")
#> Warning: There were 29 warnings in `dplyr::mutate()`.
#> The first warning was:
#>  In argument: `closest_col_row = purrr::map_dbl(row,
#>   ~max(cols$col_row[cols$col_row < .x], na.rm = TRUE))`.
#>  In group 1: `sheet_rows = "Balance"`.
#> Caused by warning in `max()`:
#> ! no non-missing arguments to max; returning -Inf
#>  Run `dplyr::last_dplyr_warnings()` to see the 28 remaining warnings.
dataset
#> # A tibble: 0 × 13
#> # ℹ 13 variables: filename <chr>, sheet <chr>, row <int>, col <int>,
#> #   row_label <chr>, col_label <chr>, description <chr>, data_type <chr>,
#> #   error <chr>, logical <lgl>, numeric <dbl>, date <dttm>, character <chr>