A Neural Approach for Text Extraction from Scholarly Figures

A Neural Approach for Text Extraction from Scholarly Figures

This is the readme for the supplemental data for our ICDAR 2019 paper.


We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.


These datasets contain a readme with license information. Further information about the associated project can be found on the authors' project page. - EconBiz - CHIME-R - CHIME-S


The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.


We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.


We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.

Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.

We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.

We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.

Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.

Data and Resources

Cite this as

David Morris (2019). Dataset: A Neural Approach for Text Extraction from Scholarly Figures. https://doi.org/10.25835/0030443

Retrieved: 15:50 27 Jan 2020 (GMT)

Additional Info

Field Value
Author David Morris
Last Updated June 27, 2019, 18:36 (CEST)
Created June 27, 2019, 18:29 (CEST)
License Creative Commons Attribution 3.0