Skip to contents

[Experimental]

This function detects columns and text boxes in a PDF file. To do this, you first need to read the file using the pdftools::pdf_data()-function from the pdftools package.

The function works on both a list of pages, as returned by the pdftools::pdf_data() function, and individual pages extracted from that list. This makes it flexible for use on either the entire document or specific pages within it.

This package directly utilizes the clustering algorithms implemented in the dbscan package. For this a stats::dist() object is created.

Usage

pdf_detect_clusters(
  pdf_data,
  algorithm = "dbscan",
  tolerance_factor = 0.1,
  ...
)

Arguments

pdf_data

result of the pdftools::pdf_data()-function or a page of this result.

algorithm

the algorithm to be used to detect text columns or text boxes

tolerance_factor

numeric; factor used for column detection when renumbering. Higher values allow more variation in x-coordinates. Default is 0.1 (10% of page width).

...

algorithm-specific arguments. See dbscan::dbscan(), dbscan::jpclust(), dbscan::sNNclust() and dbscan::hdbscan() for more information

Value

If the input is a list of pages, a list-object is returned, where each page contains a tibble and each word is assigned to a cluster. If the input is a single page, a tibble is returned directly, with each word assigned to a cluster.

Examples

# First 3 pages
head(npo, 3) |>
   pdf_detect_clusters()
#>  Processing 3 pages
#>  Clusters successfully detected and renumbered on 3 pages.
#> [[1]]
#> # A tibble: 7 × 8
#>   width height     x     y space text      .cluster noise
#>   <int>  <int> <int> <int> <lgl> <chr>     <fct>    <lgl>
#> 1   693    184    85   748 TRUE  Terugblik 0        TRUE 
#> 2   362    184   814   748 FALSE 2023      0        TRUE 
#> 3    65     18    84  1029 TRUE  Terugblik 1        FALSE
#> 4    34     18   154  1029 FALSE 2023      1        FALSE
#> 5    28     18   530  1029 TRUE  Ons       2        FALSE
#> 6    54     18   562  1029 FALSE verhaal   2        FALSE
#> 7    13     18  1821  1029 FALSE 12        0        TRUE 
#> 
#> [[2]]
#> # A tibble: 242 × 8
#>    width height     x     y space text       .cluster noise
#>    <int>  <int> <int> <int> <lgl> <chr>      <fct>    <lgl>
#>  1    41     14    85    36 FALSE Inhoud     0        TRUE 
#>  2    56     14   197    36 FALSE Uitgelicht 0        TRUE 
#>  3    62     14   322    36 FALSE Verdieping 0        TRUE 
#>  4    46     14   458    36 FALSE Bijlagen   0        TRUE 
#>  5   246     59    85   100 FALSE Leeswijzer 0        TRUE 
#>  6    24     22    84   187 TRUE  De         1        FALSE
#>  7    83     22   112   187 TRUE  Terugblik  1        FALSE
#>  8    13     22   199   187 TRUE  is         1        FALSE
#>  9    22     22   217   187 TRUE  de         1        FALSE
#> 10    81     22   245   187 TRUE  jaarlijkse 1        FALSE
#> # ℹ 232 more rows
#> 
#> [[3]]
#> # A tibble: 226 × 8
#>    width height     x     y space text        .cluster noise
#>    <int>  <int> <int> <int> <lgl> <chr>       <fct>    <lgl>
#>  1    40     14    85    36 FALSE Inhoud      0        TRUE 
#>  2    58     14   197    36 FALSE Uitgelicht  0        TRUE 
#>  3    62     14   322    36 FALSE Verdieping  0        TRUE 
#>  4    46     14   458    36 FALSE Bijlagen    0        TRUE 
#>  5   559    110    85    91 FALSE Voortdurend 1        FALSE
#>  6    81    110    85   176 TRUE  in          1        FALSE
#>  7   480    110   186   176 FALSE verbinding  1        FALSE
#>  8    16     22    84   310 TRUE  In          2        FALSE
#>  9    32     22   105   310 TRUE  een         2        FALSE
#> 10   112     22   141   310 TRUE  samenleving 2        FALSE
#> # ℹ 216 more rows
#> 

# 3th page with sNNclust algorithm with minPts = 5
npo[[3]] |>
   pdf_detect_clusters(algorithm = "sNNclust", minPts = 5)
#>  Clusters detected and renumbered: 5 on this page.
#> # A tibble: 226 × 7
#>    width height     x     y space text        .cluster
#>    <int>  <int> <int> <int> <lgl> <chr>       <fct>   
#>  1    40     14    85    36 FALSE Inhoud      1       
#>  2    58     14   197    36 FALSE Uitgelicht  1       
#>  3    62     14   322    36 FALSE Verdieping  1       
#>  4    46     14   458    36 FALSE Bijlagen    1       
#>  5   559    110    85    91 FALSE Voortdurend 1       
#>  6    81    110    85   176 TRUE  in          1       
#>  7   480    110   186   176 FALSE verbinding  1       
#>  8    16     22    84   310 TRUE  In          1       
#>  9    32     22   105   310 TRUE  een         1       
#> 10   112     22   141   310 TRUE  samenleving 1       
#> # ℹ 216 more rows