Extract Values from an Excel File Based on Row and Column Identifiers
Source:R/extract_values.R
extract_values.RdThis function reads an Excel file and extracts structured data by identifying row and column labels. It ensures that each row label is only matched with the closest column label to its right and only with column labels from the nearest preceding row that contains column headers.
You can use both regex patterns and explicit identifiers together.
If
row_identifiersand/orcol_identifiersare provided, they take priority overrow_patternandcol_pattern.If both explicit identifiers and regex patterns are provided, the function will first match on identifiers. Any additional matches from regex patterns will be included.
Usage
extract_values(
file,
row_pattern = NULL,
col_pattern = NULL,
row_identifiers = NULL,
col_identifiers = NULL,
clean_description = TRUE,
fuzzy_threshold = NULL,
fuzzy_method = "jw",
sheets = NULL
)Arguments
- file
Path to the Excel file.
- row_pattern
A regex pattern to identify row labels. Used only if
row_identifiersdoes not fully capture all row labels. (Example:".*_row$").- col_pattern
A regex pattern to identify column labels. Used only if
col_identifiersdoes not fully capture all column labels. (Example:".*_col$").- row_identifiers
A character vector of row label identifiers. If provided, takes precedence over
row_pattern, but does not disable it.- col_identifiers
A character vector of column label identifiers. If provided, takes precedence over
col_pattern, but does not disable it.- clean_description
Logical. If TRUE, removes text matching
row_patternandcol_patternfrom row and column labels (default: TRUE).- fuzzy_threshold
Numeric between 0 and 1. If set, enables fuzzy matching for identifiers. 0 = exact match required, 1 = any match accepted. Recommended: 0.8 (default: NULL, disabled).
- fuzzy_method
Character. Method for fuzzy matching from stringdist package: "osa", "lv", "dl", "jaccard", "jw" (default: "jw" for Jaro-Winkler).
- sheets
Character vector of sheet names or numeric indices to process. NULL processes all sheets (default: NULL).
Value
A tibble containing extracted values with the following columns:
filename: Name of the source Excel file
sheet: Sheet name where the value was found
row, col: Row and column coordinates of the value
row_label, col_label: The matched row and column labels
description: Combined row and column label
data_type: Excel data type of the cell
error, logical, numeric, date, character: Cell values by type
fuzzy_threshold: The fuzzy matching threshold used (if any)
row_similarity: Similarity score for row label match (0-1)
col_similarity: Similarity score for column label match (0-1)
Examples
# Extract values using regex patterns
dataset <- extract_values(rowcolr_example("example.xlsx"), row_pattern = ".*_row$", col_pattern = ".*_col$")
dataset
#> # A tibble: 28 × 13
#> filename sheet row col row_label col_label description data_type error
#> <chr> <chr> <int> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 example.xl… Bala… 2 4 name_row value_col _ character NA
#> 2 example.xl… Bala… 3 4 year_row value_col _ numeric NA
#> 3 example.xl… Bala… 4 4 kvk_row value_col _ numeric NA
#> 4 example.xl… Bala… 7 10 test1_row 2026_col _ character NA
#> 5 example.xl… Bala… 8 10 tes2_row 2026_col _ character NA
#> 6 example.xl… Bala… 9 4 int_fixe… 2025_col _ numeric NA
#> 7 example.xl… Bala… 10 4 tang_fix… 2025_col _ numeric NA
#> 8 example.xl… Bala… 11 4 fin_fixe… 2025_col _ numeric NA
#> 9 example.xl… Bala… 13 4 tot_fixe… 2025_col _ numeric NA
#> 10 example.xl… Bala… 16 4 stock_row 2025_col _ numeric NA
#> # ℹ 18 more rows
#> # ℹ 4 more variables: logical <lgl>, numeric <dbl>, date <dttm>,
#> # character <chr>
# Extract values using a predefined list of labels (ignoring regex)
dataset <- extract_values(rowcolr_example("example.xlsx"),
row_identifiers = c("Total assets (4+9)", "Total equity (10+11+12)"),
col_identifiers = c("2025_col"))
dataset
#> # A tibble: 2 × 13
#> filename sheet row col row_label col_label description data_type error
#> <chr> <chr> <int> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 example.xlsx Bala… 23 4 Total as… 2025_col Total asse… numeric NA
#> 2 example.xlsx Bala… 31 4 Total eq… 2025_col Total equi… numeric NA
#> # ℹ 4 more variables: logical <lgl>, numeric <dbl>, date <dttm>,
#> # character <chr>
# Extract values using BOTH identifiers and regex
dataset <- extract_values(rowcolr_example("example.xlsx"),
row_identifiers = c("Total assets (4+9)"),
row_pattern = ".*_row$")
#> Warning: There were 29 warnings in `dplyr::mutate()`.
#> The first warning was:
#> ℹ In argument: `closest_col_row = purrr::map_dbl(row,
#> ~max(cols$col_row[cols$col_row < .x], na.rm = TRUE))`.
#> ℹ In group 1: `sheet_rows = "Balance"`.
#> Caused by warning in `max()`:
#> ! no non-missing arguments to max; returning -Inf
#> ℹ Run `dplyr::last_dplyr_warnings()` to see the 28 remaining warnings.
#> Warning: No data extracted. Check your identifiers/patterns.
dataset
#> # A tibble: 0 × 13
#> # ℹ 13 variables: filename <chr>, sheet <chr>, row <int>, col <int>,
#> # row_label <chr>, col_label <chr>, description <chr>, data_type <chr>,
#> # error <chr>, logical <lgl>, numeric <dbl>, date <dttm>, character <chr>
# Extract values with fuzzy matching enabled
dataset <- extract_values(rowcolr_example("example.xlsx"),
row_identifiers = c("Total assets", "Total equity"),
col_identifiers = c("2025"),
fuzzy_threshold = 0.8)
#> ℹ Fuzzy matching enabled with threshold 0.8
#> ℹ Found 2 row labels and 1 column label
dataset
#> # A tibble: 2 × 16
#> filename sheet row col row_label col_label description data_type error
#> <chr> <chr> <int> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 example.xlsx Bala… 23 4 Total as… 2025_col Total asse… numeric NA
#> 2 example.xlsx Bala… 31 4 Total eq… 2025_col Total equi… numeric NA
#> # ℹ 7 more variables: logical <lgl>, numeric <dbl>, date <dttm>,
#> # character <chr>, fuzzy_threshold <dbl>, row_similarity <dbl>,
#> # col_similarity <dbl>
# Extract values from specific sheets only
dataset <- extract_values(rowcolr_example("example.xlsx"),
row_pattern = ".*_row$", col_pattern = ".*_col$",
sheets = c("Balance"))
#> Processing 1 sheet: "Balance"
dataset
#> # A tibble: 28 × 13
#> filename sheet row col row_label col_label description data_type error
#> <chr> <chr> <int> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 example.xl… Bala… 2 4 name_row value_col _ character NA
#> 2 example.xl… Bala… 3 4 year_row value_col _ numeric NA
#> 3 example.xl… Bala… 4 4 kvk_row value_col _ numeric NA
#> 4 example.xl… Bala… 7 10 test1_row 2026_col _ character NA
#> 5 example.xl… Bala… 8 10 tes2_row 2026_col _ character NA
#> 6 example.xl… Bala… 9 4 int_fixe… 2025_col _ numeric NA
#> 7 example.xl… Bala… 10 4 tang_fix… 2025_col _ numeric NA
#> 8 example.xl… Bala… 11 4 fin_fixe… 2025_col _ numeric NA
#> 9 example.xl… Bala… 13 4 tot_fixe… 2025_col _ numeric NA
#> 10 example.xl… Bala… 16 4 stock_row 2025_col _ numeric NA
#> # ℹ 18 more rows
#> # ℹ 4 more variables: logical <lgl>, numeric <dbl>, date <dttm>,
#> # character <chr>