Detect Columns and Text Boxes in PDF Document
Source:R/pdf_detect_clusters.R
pdf_detect_clusters.Rd
This function detects columns and text boxes in a PDF file. To do this, you
first need to read the file using the pdftools::pdf_data()
-function from
the pdftools package.
The function works on both a list of pages, as returned by the
pdftools::pdf_data()
function, and individual pages extracted from that
list. This makes it flexible for use on either the entire document or
specific pages within it.
This package directly utilizes the clustering algorithms implemented in the
dbscan package. For this a stats::dist()
object is created.
Arguments
- pdf_data
result of the
pdftools::pdf_data()
-function or a page of this result.- algorithm
the algorithm to be used to detect text columns or text boxes
- tolerance_factor
numeric; factor used for column detection when renumbering. Higher values allow more variation in x-coordinates. Default is 0.1 (10% of page width).
- ...
algorithm-specific arguments. See
dbscan::dbscan()
,dbscan::jpclust()
,dbscan::sNNclust()
anddbscan::hdbscan()
for more information
Value
If the input is a list of pages, a list-object is returned, where each page contains a tibble and each word is assigned to a cluster. If the input is a single page, a tibble is returned directly, with each word assigned to a cluster.
Examples
# First 3 pages
head(npo, 3) |>
pdf_detect_clusters()
#> ℹ Processing 3 pages
#> ✔ Clusters successfully detected and renumbered on 3 pages.
#> [[1]]
#> # A tibble: 7 × 8
#> width height x y space text .cluster noise
#> <int> <int> <int> <int> <lgl> <chr> <fct> <lgl>
#> 1 693 184 85 748 TRUE Terugblik 0 TRUE
#> 2 362 184 814 748 FALSE 2023 0 TRUE
#> 3 65 18 84 1029 TRUE Terugblik 1 FALSE
#> 4 34 18 154 1029 FALSE 2023 1 FALSE
#> 5 28 18 530 1029 TRUE Ons 2 FALSE
#> 6 54 18 562 1029 FALSE verhaal 2 FALSE
#> 7 13 18 1821 1029 FALSE 12 0 TRUE
#>
#> [[2]]
#> # A tibble: 242 × 8
#> width height x y space text .cluster noise
#> <int> <int> <int> <int> <lgl> <chr> <fct> <lgl>
#> 1 41 14 85 36 FALSE Inhoud 0 TRUE
#> 2 56 14 197 36 FALSE Uitgelicht 0 TRUE
#> 3 62 14 322 36 FALSE Verdieping 0 TRUE
#> 4 46 14 458 36 FALSE Bijlagen 0 TRUE
#> 5 246 59 85 100 FALSE Leeswijzer 0 TRUE
#> 6 24 22 84 187 TRUE De 1 FALSE
#> 7 83 22 112 187 TRUE Terugblik 1 FALSE
#> 8 13 22 199 187 TRUE is 1 FALSE
#> 9 22 22 217 187 TRUE de 1 FALSE
#> 10 81 22 245 187 TRUE jaarlijkse 1 FALSE
#> # ℹ 232 more rows
#>
#> [[3]]
#> # A tibble: 226 × 8
#> width height x y space text .cluster noise
#> <int> <int> <int> <int> <lgl> <chr> <fct> <lgl>
#> 1 40 14 85 36 FALSE Inhoud 0 TRUE
#> 2 58 14 197 36 FALSE Uitgelicht 0 TRUE
#> 3 62 14 322 36 FALSE Verdieping 0 TRUE
#> 4 46 14 458 36 FALSE Bijlagen 0 TRUE
#> 5 559 110 85 91 FALSE Voortdurend 1 FALSE
#> 6 81 110 85 176 TRUE in 1 FALSE
#> 7 480 110 186 176 FALSE verbinding 1 FALSE
#> 8 16 22 84 310 TRUE In 2 FALSE
#> 9 32 22 105 310 TRUE een 2 FALSE
#> 10 112 22 141 310 TRUE samenleving 2 FALSE
#> # ℹ 216 more rows
#>
# 3th page with sNNclust algorithm with minPts = 5
npo[[3]] |>
pdf_detect_clusters(algorithm = "sNNclust", minPts = 5)
#> ℹ Clusters detected and renumbered: 5 on this page.
#> # A tibble: 226 × 7
#> width height x y space text .cluster
#> <int> <int> <int> <int> <lgl> <chr> <fct>
#> 1 40 14 85 36 FALSE Inhoud 1
#> 2 58 14 197 36 FALSE Uitgelicht 1
#> 3 62 14 322 36 FALSE Verdieping 1
#> 4 46 14 458 36 FALSE Bijlagen 1
#> 5 559 110 85 91 FALSE Voortdurend 1
#> 6 81 110 85 176 TRUE in 1
#> 7 480 110 186 176 FALSE verbinding 1
#> 8 16 22 84 310 TRUE In 1
#> 9 32 22 105 310 TRUE een 1
#> 10 112 22 141 310 TRUE samenleving 1
#> # ℹ 216 more rows