A Neural Approach for Text Extraction from Scholarly FiguresDavid Morris | 2019 DOI: 10.25835/0030443
A Neural Approach for Text Extraction from Scholarly Figures
This is the readme for the supplemental data for our ICDAR 2019 paper.
We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.
The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.
We have made our code available in
code.zip. We will upload code, announce further news, and field questions via the github repo.
Our text detection network is adapted from Argman's EAST implementation. The
EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.
We used a tesseract script to run text extraction from detected text rows. This is inside our code
We used a java script provided by Falk Böschen and adapted to our file structure. We included this as
Parameter sweeps are automated by
param_sweep.rb. This file also shows how to invoke all of these components.
Data and Resources
Cite this as
David Morris (2019). Dataset: A Neural Approach for Text Extraction from Scholarly Figures. https://doi.org/10.25835/0030443
Retrieved: 14:15 12 Nov 2019 (GMT)
|Last Updated||June 27, 2019, 18:36 (CEST)|
|Created||June 27, 2019, 18:29 (CEST)|
|License||Creative Commons Attribution 3.0|