The dataset viewer is not available because its heuristics could not detect any supported data files. You can try uploading some data files, or configuring the data files location manually.

DP-Bench: Document Parsing Benchmark

Document parsing refers to the process of converting complex documents, such as PDFs and scanned images, into structured text formats like HTML and Markdown. It is especially useful as a preprocessor for RAG systems, as it preserves key structural information from visually rich documents.

While various parsers are available on the market, there is currently no standard evaluation metric to assess their performance. To address this gap, we propose a set of new evaluation metrics along with a benchmark dataset designed to measure parser performance.

Metrics

We propose assessing the performance of parsers using three key metrics: NID for element detection and serialization, TEDS and TEDS-S for table structure recognition.

Element detection and serialization

NID (Normalized Indel Distance). NID evaluates how well a parser detects and serializes document elements according to human reading order. NID is similar to the normalized edit distance metric but excludes substitutions during evaluation, making it more sensitive to length differences between strings.

The NID metric is computed as follows:

NID=1distancelen(reference)+len(prediction) NID = 1 - \frac{\text{distance}}{\text{len(reference)} + \text{len(prediction)}}

The normalized distance in the equation measures the similarity between the reference and predicted text, with values ranging from 0 to 1, where 0 represents perfect alignment and 1 denotes complete dissimilarity. Here, the predicted text is compared against the reference text to determine how many character-level insertions and deletions are needed to match it. A higher NID score reflects better performance in both recognizing and ordering the text within the document's detected layout regions.

Table structure recognition

Tables are one of the most complex elements in documents, often presenting both structural and content-related challenges. Yet, during NID evaluation, table elements (as well as figures and charts) are excluded, allowing the metric to focus on text elements such as paragraphs, headings, indexes, and footnotes. To specifically evaluate table structure and content extraction, we use the TEDS and TEDS-S metrics.

The traditional metric fails to account for the hierarchical nature of tables (rows, columns, cells), but TEDS/TEDS-S measures the similarity between the predicted and ground-truth tables by comparing both structural layout and content, offering a more comprehensive evaluation.

TEDS (Tree Edit Distance-based Similarity). The TEDS metric is computed as follows:

TEDS(Ta,Tb)=1EditDist(Ta,Tb)max(Ta,Tb) TEDS(T_a, T_b) = 1 - \frac{EditDist(T_a, T_b)}{\max(|T_a|, |T_b|)}

The equation evaluates the similarity between two tables by modeling them as tree structures TaT_a and TbT_b. This metric evaluates how accurately the table structure is predicted, including the content of each cell. A higher TEDS score indicates better overall performance in capturing both the table structure and the content of each cell.

TEDS-S (Tree Edit Distance-based Similarity-Struct). TEDS-S stands for Tree Edit Distance-based Similarity-Struct, measuring the structural similarity between the predicted and reference tables. While the metric formulation is identical to TEDS, it uses modified tree representations, denoted as TaT_a' and TbT_b', where the nodes correspond solely to the table structure, omitting the content of each cell. This allows TEDS-S to concentrate on assessing the structural similarity of the tables, such as row and column alignment, without being influenced by the contents within the cells.

Benchmark dataset

Document sources

The benchmark dataset is gathered from three sources: 90 samples from the Library of Congress; 90 samples from Open Educational Resources; and 20 samples from Upstage's internal documents. Together, these sources provide a broad and specialized range of information.

Sources Count
Library of Congress 90
Open educational resources 90
Upstage 20

Layout elements

While works like ReadingBank often focus solely on text conversion in document parsing, we have taken a more detailed approach by dividing the document into specific elements, with a particular emphasis on table performance.

This benchmark dataset was created by extracting pages with various layout elements from multiple types of documents. The layout elements consist of 12 element types: Table, Paragraph, Figure, Chart, Header, Footer, Caption, Equation, Heading1, List, Index, Footnote. This diverse set of layout elements ensures that our evaluation covers a wide range of document structures and complexities, providing a comprehensive assessment of document parsing capabilities.

Note that only Heading1 is included among various heading sizes because it represents the main structural divisions in most documents, serving as the primary section title. This high-level segmentation is sufficient for assessing the core structure without adding unnecessary complexity. Detailed heading levels like Heading2 and Heading3 are omitted to keep the evaluation focused and manageable.

Category Count
Paragraph 804
Heading1 194
Footer 168
Caption 154
Header 101
List 91
Chart 67
Footnote 63
Equation 58
Figure 57
Table 55
Index 10

Dataset format

The dataset is in JSON format, representing elements extracted from a PDF file, with each element defined by its position, layout class, and content. The category field represents various layout classes, including but not limited to text regions, headings, footers, captions, tables, and more. The content field has three options: the text field contains text-based content, html represents layout regions where equations are in LaTeX and tables in HTML, and markdown distinguishes between regions like Heading1 and other text-based regions such as paragraphs, captions, and footers. Each element includes coordinates (x, y), a unique ID, and the page number it appears on. The dataset’s structure supports flexible representation of layout classes and content formats for document parsing.

{
    "01030000000001.pdf": {
        "elements": [
            {
                "coordinates": [
                    {
                        "x": 170.9176246670229,
                        "y": 102.3493458064781
                    },
                    {
                        "x": 208.5023846755278,
                        "y": 102.3493458064781
                    },
                    {
                        "x": 208.5023846755278,
                        "y": 120.6598699131856
                    },
                    {
                        "x": 170.9176246670229,
                        "y": 120.6598699131856
                    }
                ],
                "category": "Header",
                "id": 0,
                "page": 1,
                "content": {
                    "text": "314",
                    "html": "",
                    "markdown": ""
                }
            },
            ...
    ...

Document domains

Domain Subdomain Count
Social Sciences Economics 26
Social Sciences Political Science 18
Social Sciences Sociology 16
Social Sciences Law 12
Social Sciences Cultural Anthropology 11
Social Sciences Education 8
Social Sciences Psychology 4
Natural Sciences Environmental Science 26
Natural Sciences Biology 10
Natural Sciences Astronomy 4
Technology Technology 33
Mathematics and Information Sciences Mathematics 13
Mathematics and Information Sciences Informatics 9
Mathematics and Information Sciences Computer Science 8
Mathematics and Information Sciences Statistics 2

Usage

Setup

Before setting up the environment, make sure to install Git LFS, which is required for handling large files. Once installed, you can clone the repository and install the necessary dependencies by running the following commands:

$ git clone https://huggingface.co/datasets/upstage/dp-bench.git
$ cd dp-bench
$ pip install -r requirements.txt

The repository includes necessary scripts for inference and evaluation, as described in the following sections.

Inference

We offer inference scripts that let you request results from various document parsing services. For more details, refer to this README.

Evaluation

The benchmark dataset can be found in the dataset folder. It contains a wide range of document layouts, from text-heavy pages to complex tables, enabling a thorough evaluation of the parser’s performance. The dataset comes with annotations for layout elements such as paragraphs, headings, and tables.

The following options are required for evaluation:

  • --ref_path: Specifies the path to the reference JSON file, predefined as dataset/reference.json for evaluation purposes.
  • --pred_path: Indicates the path to the predicted JSON file. You can either use a sample result located in the dataset/sample_results folder, or generate your own by using the inference script provided in the scripts folder.

Element detection and serialization evaluation

This evaluation will compute the NID metric to assess how accurately the text in the document is recognized considering the structure and order of the document layout. To evaluate the document layout results, run the following command:

$ python evaluate.py \
  --ref_path <path to the reference json file> \
  --pred_path <path to the predicted json file> \
  --mode layout

Table structure recognition evaluation

This will compute TEDS-S (structural accuracy) and TEDS (structural and textual accuracy). To evaluate table recognition performance, use the following command:

$ python evaluate.py \
  --ref_path <path to the reference json file> \
  --pred_path <path to the predicted json file> \
  --mode table

Leaderboard

Source Request date TEDS ↑ TEDS-S ↑ NID ↑ Avg. Time (secs) ↓
upstage 2024-10-10 92.06 93.81 96.23 3.79
aws 2024-10-10 86.39 90.22 95.94 14.47
llamaparse 2024-10-10 73.36 76.29 92.22 4.14
unstructured 2024-10-10 64.49 69.90 90.42 13.14
google 2024-10-10 64.64 70.95 90.09 5.85
microsoft 2024-10-10 85.54 89.07 87.03 4.44
Downloads last month
2