Extract Structured Data from Excel Using Row and Column Identifiers • rowcolr

The goal of rowcolr is to extract structured data from semi-structured Excel files by identifying row and column labels using predefined identifiers or regex patterns. Processing such data can be challenging with existing packages. rowcolr simplifies this process by efficiently locating values at the intersection of row and column labels, making data wrangling and cleaning more seamless.

Built on the powerful tidyxl package, rowcolr ensures robust and reliable extraction of cell contents from Excel files.

Origin

The development of rowcolr was driven by a specific use case: existing Excel files used as input forms by organizations in the arts and culture sector to provide information to the Dutch Ministry of Education, Culture, and Science (OCW). To minimize disruptions to the existing workflow, the forms were kept as similar as possible to their original format while still allowing structured extraction of all relevant data from the Excel files.

Installation

You can install the development version of rowcolr from GitHub with:

devtools::install_github("coeneisma/rowcolr")

Usage

The primary function of rowcolr is extract_values(), which allows you to extract structured data from Excel files by identifying row and column labels. The function works by drawing a horizontal line from a detected row label and a vertical line from a detected column label. The value located at their intersection is extracted and stored.

The package can be loaded with:

library(rowcolr)

Example: Extracting Data from an Excel File

Values can be extracted from an Excel file using row and column identifiers or regex patterns:

# Extract values from an example Excel file with regex patterns
dataset <- extract_values(rowcolr_example("example.xlsx"), 
                          row_pattern = ".*_row$", 
                          col_pattern = ".*_col$")

head(dataset |> dplyr::select(-c(filename, sheet, row, col)))
#> # A tibble: 6 × 9
#>   row_label            col_label description data_type error logical  numeric
#>   <chr>                <chr>     <chr>       <chr>     <chr> <lgl>      <dbl>
#> 1 name_row             value_col _           character <NA>  NA            NA
#> 2 year_row             value_col _           numeric   <NA>  NA          2025
#> 3 kvk_row              value_col _           numeric   <NA>  NA      12345678
#> 4 test1_row            2026_col  _           character <NA>  NA            NA
#> 5 tes2_row             2026_col  _           character <NA>  NA            NA
#> 6 int_fixed_assets_row 2025_col  _           numeric   <NA>  NA        100000
#> # ℹ 2 more variables: date <dttm>, character <chr>

For more details on using the package, refer to the vignette: vignette("rowcolr").