Extract Values from an Excel File Based on Row and Column Identifiers

This function reads an Excel file and extracts structured data by identifying row and column labels. It ensures that each row label is only matched with the closest column label to its right and only with column labels from the nearest preceding row that contains column headers.

You can use both regex patterns and explicit identifiers together.

If row_identifiers and/or col_identifiers are provided, they take priority over row_pattern and col_pattern.
If both explicit identifiers and regex patterns are provided, the function will first match on identifiers. Any additional matches from regex patterns will be included.

Usage

extract_values(
  file,
  row_pattern = NULL,
  col_pattern = NULL,
  row_identifiers = NULL,
  col_identifiers = NULL,
  clean_description = TRUE,
  fuzzy_threshold = NULL,
  fuzzy_method = "jw",
  sheets = NULL
)

Arguments

file: Path to the Excel file.
row_pattern: A regex pattern to identify row labels. Used only if row_identifiers does not fully capture all row labels. (Example: ".*_row$").
col_pattern: A regex pattern to identify column labels. Used only if col_identifiers does not fully capture all column labels. (Example: ".*_col$").
row_identifiers: A character vector of row label identifiers. If provided, takes precedence over row_pattern, but does not disable it.
col_identifiers: A character vector of column label identifiers. If provided, takes precedence over col_pattern, but does not disable it.
clean_description: Logical. If TRUE, removes text matching row_pattern and col_pattern from row and column labels (default: TRUE).
fuzzy_threshold: Numeric between 0 and 1. If set, enables fuzzy matching for identifiers. 0 = exact match required, 1 = any match accepted. Recommended: 0.8 (default: NULL, disabled).
fuzzy_method: Character. Method for fuzzy matching from stringdist package: "osa", "lv", "dl", "jaccard", "jw" (default: "jw" for Jaro-Winkler).
sheets: Character vector of sheet names or numeric indices to process. NULL processes all sheets (default: NULL).

Value

A tibble containing extracted values with the following columns:

filename: Name of the source Excel file
sheet: Sheet name where the value was found
row, col: Row and column coordinates of the value
row_label, col_label: The matched row and column labels
description: Combined row and column label
data_type: Excel data type of the cell
error, logical, numeric, date, character: Cell values by type
fuzzy_threshold: The fuzzy matching threshold used (if any)
row_similarity: Similarity score for row label match (0-1)
col_similarity: Similarity score for column label match (0-1)

Examples

# Extract values using regex patterns
dataset <- extract_values(rowcolr_example("example.xlsx"), row_pattern = ".*_row$", col_pattern = ".*_col$")
dataset
#> # A tibble: 28 × 13
#>    filename    sheet   row   col row_label col_label description data_type error
#>    <chr>       <chr> <int> <int> <chr>     <chr>     <chr>       <chr>     <chr>
#>  1 example.xl… Bala…     2     4 name_row  value_col _           character NA   
#>  2 example.xl… Bala…     3     4 year_row  value_col _           numeric   NA   
#>  3 example.xl… Bala…     4     4 kvk_row   value_col _           numeric   NA   
#>  4 example.xl… Bala…     7    10 test1_row 2026_col  _           character NA   
#>  5 example.xl… Bala…     8    10 tes2_row  2026_col  _           character NA   
#>  6 example.xl… Bala…     9     4 int_fixe… 2025_col  _           numeric   NA   
#>  7 example.xl… Bala…    10     4 tang_fix… 2025_col  _           numeric   NA   
#>  8 example.xl… Bala…    11     4 fin_fixe… 2025_col  _           numeric   NA   
#>  9 example.xl… Bala…    13     4 tot_fixe… 2025_col  _           numeric   NA   
#> 10 example.xl… Bala…    16     4 stock_row 2025_col  _           numeric   NA   
#> # ℹ 18 more rows
#> # ℹ 4 more variables: logical <lgl>, numeric <dbl>, date <dttm>,
#> #   character <chr>

# Extract values using a predefined list of labels (ignoring regex)
dataset <- extract_values(rowcolr_example("example.xlsx"),
                          row_identifiers = c("Total assets (4+9)", "Total equity (10+11+12)"),
                          col_identifiers = c("2025_col"))
dataset
#> # A tibble: 2 × 13
#>   filename     sheet   row   col row_label col_label description data_type error
#>   <chr>        <chr> <int> <int> <chr>     <chr>     <chr>       <chr>     <chr>
#> 1 example.xlsx Bala…    23     4 Total as… 2025_col  Total asse… numeric   NA   
#> 2 example.xlsx Bala…    31     4 Total eq… 2025_col  Total equi… numeric   NA   
#> # ℹ 4 more variables: logical <lgl>, numeric <dbl>, date <dttm>,
#> #   character <chr>

# Extract values using BOTH identifiers and regex
dataset <- extract_values(rowcolr_example("example.xlsx"),
                          row_identifiers = c("Total assets (4+9)"),
                          row_pattern = ".*_row$")
#> Warning: There were 29 warnings in `dplyr::mutate()`.
#> The first warning was:
#> ℹ In argument: `closest_col_row = purrr::map_dbl(row,
#>   ~max(cols$col_row[cols$col_row < .x], na.rm = TRUE))`.
#> ℹ In group 1: `sheet_rows = "Balance"`.
#> Caused by warning in `max()`:
#> ! no non-missing arguments to max; returning -Inf
#> ℹ Run `dplyr::last_dplyr_warnings()` to see the 28 remaining warnings.
#> Warning: No data extracted. Check your identifiers/patterns.
dataset
#> # A tibble: 0 × 13
#> # ℹ 13 variables: filename <chr>, sheet <chr>, row <int>, col <int>,
#> #   row_label <chr>, col_label <chr>, description <chr>, data_type <chr>,
#> #   error <chr>, logical <lgl>, numeric <dbl>, date <dttm>, character <chr>

# Extract values with fuzzy matching enabled
dataset <- extract_values(rowcolr_example("example.xlsx"),
                          row_identifiers = c("Total assets", "Total equity"),
                          col_identifiers = c("2025"),
                          fuzzy_threshold = 0.8)
#> ℹ Fuzzy matching enabled with threshold 0.8
#> ℹ Found 2 row labels and 1 column label
dataset
#> # A tibble: 2 × 16
#>   filename     sheet   row   col row_label col_label description data_type error
#>   <chr>        <chr> <int> <int> <chr>     <chr>     <chr>       <chr>     <chr>
#> 1 example.xlsx Bala…    23     4 Total as… 2025_col  Total asse… numeric   NA   
#> 2 example.xlsx Bala…    31     4 Total eq… 2025_col  Total equi… numeric   NA   
#> # ℹ 7 more variables: logical <lgl>, numeric <dbl>, date <dttm>,
#> #   character <chr>, fuzzy_threshold <dbl>, row_similarity <dbl>,
#> #   col_similarity <dbl>

# Extract values from specific sheets only
dataset <- extract_values(rowcolr_example("example.xlsx"),
                          row_pattern = ".*_row$", col_pattern = ".*_col$",
                          sheets = c("Balance"))
#> Processing 1 sheet: "Balance"
dataset
#> # A tibble: 28 × 13
#>    filename    sheet   row   col row_label col_label description data_type error
#>    <chr>       <chr> <int> <int> <chr>     <chr>     <chr>       <chr>     <chr>
#>  1 example.xl… Bala…     2     4 name_row  value_col _           character NA   
#>  2 example.xl… Bala…     3     4 year_row  value_col _           numeric   NA   
#>  3 example.xl… Bala…     4     4 kvk_row   value_col _           numeric   NA   
#>  4 example.xl… Bala…     7    10 test1_row 2026_col  _           character NA   
#>  5 example.xl… Bala…     8    10 tes2_row  2026_col  _           character NA   
#>  6 example.xl… Bala…     9     4 int_fixe… 2025_col  _           numeric   NA   
#>  7 example.xl… Bala…    10     4 tang_fix… 2025_col  _           numeric   NA   
#>  8 example.xl… Bala…    11     4 fin_fixe… 2025_col  _           numeric   NA   
#>  9 example.xl… Bala…    13     4 tot_fixe… 2025_col  _           numeric   NA   
#> 10 example.xl… Bala…    16     4 stock_row 2025_col  _           numeric   NA   
#> # ℹ 18 more rows
#> # ℹ 4 more variables: logical <lgl>, numeric <dbl>, date <dttm>,
#> #   character <chr>