Implemented markup on document analysis using LlamaIndex ParseBench using Python, Hugging Face and evaluation metrics



In this tutorial, we explore how to use parsebench A dataset for evaluating document analysis systems in a structured and practical way. We start by loading the dataset directly from Hugging Face, examining its multiple dimensions, such as text, tables, charts, and layout, and transforming it into a unified data frame for deeper analysis. As we progress, we identify key fields, detect associated PDFs, and build a lightweight baseline using PyMuPDF to extract and compare text. Throughout the process, we focus on creating a flexible pipeline that allows us to understand the schema of the dataset, evaluate the quality of the analysis, and prepare inputs for more advanced OCR or vision language models.

Copy the codeCopiedUse a different browser
!pip install -q -U datasets huggingface_hub pandas matplotlib rich pymupdf rapidfuzz tqdm


import json, re, textwrap, random, math
from pathlib import Path
from collections import Counter
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.auto import tqdm
from rich.console import Console
from rich.table import Table
from rich.panel import Panel
from huggingface_hub import hf_hub_download, list_repo_files
from rapidfuzz import fuzz
import fitz


console = Console()
DATASET_ID = "llamaindex/ParseBench"
WORKDIR = Path("/content/parsebench_tutorial")
WORKDIR.mkdir(parents=True, exist_ok=True)


console.print(Panel.fit("Advanced ParseBench Tutorial on Google Colab", style="bold green"))


files = list_repo_files(DATASET_ID, repo_type="dataset")
jsonl_files = [f for f in files if f.endswith(".jsonl")]
pdf_files = [f for f in files if f.endswith(".pdf")]


console.print(f"Found {len(jsonl_files)} JSONL files")
console.print(f"Found {len(pdf_files)} PDF files")


table = Table(title="ParseBench JSONL Files")
table.add_column("File")
table.add_column("Dimension")
for f in jsonl_files:
   table.add_row(f, Path(f).stem)
console.print(table)

We install all the required libraries and prepare our working environment for the tutorial. We initialize the dataset source and set up a workspace to store all the outputs. We also fetch and insert all JSONL and PDF files from ParseBench repository to understand the structure of the dataset.

Copy the codeCopiedUse a different browser
def load_jsonl_from_hf(filename, max_rows=None):
   path = hf_hub_download(repo_id=DATASET_ID, filename=filename, repo_type="dataset")
   rows = []
   with open(path, "r", encoding="utf-8") as fp:
       for i, line in enumerate(fp):
           if max_rows and i andgt;= max_rows:
               break
           line = line.strip()
           if line:
               rows.append(json.loads(line))
   return rows, path


def flatten_dict(d, parent_key="", sep="."):
   items = {}
   if isinstance(d, dict):
       for k, v in d.items():
           new_key = f"{parent_key}{sep}{k}" if parent_key else str(k)
           if isinstance(v, dict):
               items.update(flatten_dict(v, new_key, sep=sep))
           else:
               items[new_key] = v
   return items


dimension_data = {}
for jf in jsonl_files:
   rows, local_path = load_jsonl_from_hf(jf)
   dimension_data[Path(jf).stem] = rows
   console.print(f"{jf}: {len(rows)} examples loaded")


summary_rows = []
for dim, rows in dimension_data.items():
   keys = Counter()
   for r in rows[:100]:
       keys.update(flatten_dict(r).keys())
   summary_rows.append({
       "dimension": dim,
       "examples": len(rows),
       "top_fields": ", ".join([k for k, _ in keys.most_common(12)])
   })


summary_df = pd.DataFrame(summary_rows)
display(summary_df)


plt.figure(figsize=(10, 5))
plt.bar(summary_df["dimension"], summary_df["examples"])
plt.title("ParseBench Examples by Dimension")
plt.xlabel("Dimension")
plt.ylabel("Number of Examples")
plt.xticks(rotation=30, ha="right")
plt.show()


for dim, rows in dimension_data.items():
   console.print(Panel.fit(f"Sample schema for {dim}", style="bold cyan"))
   if rows:
       console.print(json.dumps(rows[0], indent=2)[:3000])

We load JSONL files from the dataset and convert them into usable Python objects. We flatten nested structures for easy analysis in tabular form. We also summarize each dimension and visualize the distribution of examples across different analysis tasks.

Copy the codeCopiedUse a different browser
all_records = []
for dim, rows in dimension_data.items():
   for i, r in enumerate(rows):
       flat = flatten_dict(r)
       flat["_dimension"] = dim
       flat["_row_id"] = i
       all_records.append(flat)


df = pd.DataFrame(all_records)
console.print(f"Combined dataframe shape: {df.shape}")
display(df.head())


missing_report = []
for col in df.columns:
   missing_report.append({
       "column": col,
       "non_null": int(df[col].notna().sum()),
       "missing": int(df[col].isna().sum()),
       "coverage_pct": round(100 * df[col].notna().mean(), 2)
   })


missing_df = pd.DataFrame(missing_report).sort_values("coverage_pct", ascending=False)
display(missing_df.head(40))


def find_candidate_columns(df, keywords):
   cols = []
   for c in df.columns:
       lc = c.lower()
       if any(k.lower() in lc for k in keywords):
           cols.append(c)
   return cols


doc_cols = find_candidate_columns(df, ["doc", "pdf", "file", "path", "source", "image"])
text_cols = find_candidate_columns(df, ["text", "content", "markdown", "ground", "answer", "expected", "target", "reference"])
rule_cols = find_candidate_columns(df, ["rule", "check", "assert", "criteria", "question", "prompt"])
bbox_cols = find_candidate_columns(df, ["bbox", "box", "polygon", "coordinates", "layout"])


console.print("[bold]Possible document columns:[/bold]", doc_cols[:30])
console.print("[bold]Possible text/reference columns:[/bold]", text_cols[:30])
console.print("[bold]Possible rule/question columns:[/bold]", rule_cols[:30])
console.print("[bold]Possible layout columns:[/bold]", bbox_cols[:30])

We combine all analyzed records into a single data frame for unified analysis. We evaluate missing values ​​and identify the most informative fields across the dataset. We also detect candidate columns related to documents, text, rules, and layout to guide downstream processing.

Copy the codeCopiedUse a different browser
def pick_first_existing(row, candidates):
   for c in candidates:
       if c in row and pd.notna(row[c]):
           value = row[c]
           if isinstance(value, str) and value.strip():
               return value
           if not isinstance(value, str):
               return value
   return None


def normalize_text(x):
   if x is None or (isinstance(x, float) and math.isnan(x)):
       return ""
   x = str(x)
   x = re.sub(r"\s+", " ", x)
   return x.strip().lower()


def simple_text_similarity(a, b):
   a = normalize_text(a)
   b = normalize_text(b)
   if not a or not b:
       return None
   return fuzz.token_set_ratio(a, b) / 100


def locate_pdf_path(value):
   if value is None:
       return None
   value = str(value)
   candidates = []
   if value.endswith(".pdf"):
       candidates.append(value)
       candidates.extend([f for f in pdf_files if f.endswith(value.split("/")[-1])])
   else:
       candidates.extend([
           f for f in pdf_files
           if value in f or Path(f).stem in value or value in Path(f).stem
       ])
   return candidates[0] if candidates else None


def extract_pdf_text_from_hf(pdf_repo_path, max_pages=2):
   local_pdf = hf_hub_download(repo_id=DATASET_ID, filename=pdf_repo_path, repo_type="dataset")
   doc = fitz.open(local_pdf)
   texts = []
   for page_idx in range(min(max_pages, len(doc))):
       texts.append(doc[page_idx].get_text("text"))
   doc.close()
   return "\n".join(texts), local_pdf


def render_pdf_first_page(pdf_repo_path, zoom=2):
   local_pdf = hf_hub_download(repo_id=DATASET_ID, filename=pdf_repo_path, repo_type="dataset")
   doc = fitz.open(local_pdf)
   page = doc[0]
   pix = page.get_pixmap(matrix=fitz.Matrix(zoom, zoom))
   out_path = WORKDIR / (Path(pdf_repo_path).stem + "_page1.png")
   pix.save(out_path)
   doc.close()
   return out_path


sample_records = df.sample(min(25, len(df)), random_state=42).to_dict("records")
pdf_candidates = []


for row in sample_records:
   for c in doc_cols:
       pdf_path = locate_pdf_path(row.get(c))
       if pdf_path:
           pdf_candidates.append((row["_dimension"], row["_row_id"], pdf_path))
           break


pdf_candidates = list(dict.fromkeys(pdf_candidates))
console.print(f"Detected {len(pdf_candidates)} PDF-linked sampled records")


if pdf_candidates:
   dim, row_id, pdf_path = pdf_candidates[0]
   console.print(Panel.fit(f"Rendering sample PDF\nDimension: {dim}\nRow: {row_id}\nPDF: {pdf_path}", style="bold yellow"))
   image_path = render_pdf_first_page(pdf_path)
   img = plt.imread(image_path)
   plt.figure(figsize=(10, 12))
   plt.imshow(img)
   plt.axis("off")
   plt.title(f"{dim}: {Path(pdf_path).name}")
   plt.show()
else:
   console.print("[yellow]No PDF-linked rows were detected from the sample.[/yellow]")

We define auxiliary functions for text normalization, similarity scoring, and PDF processing. We locate PDF files associated with dataset entries, download them, and extract their textual content. We also provide a sample PDF page for visual inspection of the document structure.

Copy the codeCopiedUse a different browser
preferred_gt_cols = [
   c for c in text_cols
   if any(k in c.lower() for k in ["ground", "expected", "target", "answer", "content", "text", "markdown", "reference"])
]


evaluation_rows = []
eval_sample = df.sample(min(50, len(df)), random_state=7).to_dict("records")


for row in tqdm(eval_sample, desc="Running lightweight PDF text extraction baseline"):
   pdf_path = None
   for c in doc_cols:
       pdf_path = locate_pdf_path(row.get(c))
       if pdf_path:
           break


   if not pdf_path:
       evaluation_rows.append({
           "dimension": row.get("_dimension"),
           "row_id": row.get("_row_id"),
           "pdf": None,
           "ground_truth_column": None,
           "similarity_score": None,
           "status": "no_pdf_detected"
       })
       continue


   gt_col = None
   gt = None
   for c in preferred_gt_cols:
       if c in row and pd.notna(row[c]):
           gt_col = c
           gt = row[c]
           break


   if gt is None:
       evaluation_rows.append({
           "dimension": row.get("_dimension"),
           "row_id": row.get("_row_id"),
           "pdf": pdf_path,
           "ground_truth_column": None,
           "similarity_score": None,
           "status": "no_reference_detected"
       })
       continue


   try:
       extracted, local_pdf = extract_pdf_text_from_hf(pdf_path, max_pages=2)
       score = simple_text_similarity(extracted, gt)
       evaluation_rows.append({
           "dimension": row.get("_dimension"),
           "row_id": row.get("_row_id"),
           "pdf": pdf_path,
           "ground_truth_column": gt_col,
           "similarity_score": score,
           "extracted_chars": len(extracted),
           "ground_truth_chars": len(str(gt)),
           "status": "scored"
       })
   except Exception as e:
       evaluation_rows.append({
           "dimension": row.get("_dimension"),
           "row_id": row.get("_row_id"),
           "pdf": pdf_path,
           "ground_truth_column": gt_col,
           "similarity_score": None,
           "status": "error",
           "error": str(e)
       })


eval_df = pd.DataFrame(evaluation_rows)


if eval_df.empty:
   eval_df = pd.DataFrame(columns=[
       "dimension", "row_id", "pdf", "ground_truth_column",
       "similarity_score", "extracted_chars", "ground_truth_chars",
       "status", "error"
   ])


display(eval_df.head(30))


if "status" in eval_df.columns:
   display(eval_df["status"].value_counts().reset_index().rename(columns={"index": "status", "status": "count"}))


if not eval_df.empty and "similarity_score" in eval_df.columns:
   valid_eval = eval_df.dropna(subset=["similarity_score"])


   if len(valid_eval):
       console.print(f"Average lightweight text similarity: {valid_eval['similarity_score'].mean():.3f}")


       plt.figure(figsize=(8, 5))
       plt.hist(valid_eval["similarity_score"], bins=10)
       plt.title("Lightweight Baseline Similarity Distribution")
       plt.xlabel("RapidFuzz Token Set Similarity")
       plt.ylabel("Count")
       plt.show()


       per_dim = valid_eval.groupby("dimension")["similarity_score"].mean().reset_index()
       display(per_dim)


       plt.figure(figsize=(9, 5))
       plt.bar(per_dim["dimension"], per_dim["similarity_score"])
       plt.title("Average Baseline Similarity by Dimension")
       plt.xlabel("Dimension")
       plt.ylabel("Average Similarity")
       plt.xticks(rotation=30, ha="right")
       plt.show()
   else:
       console.print("[yellow]No valid similarity scores were produced. This usually means sampled rows did not contain both detectable PDFs and reference text.[/yellow]")
else:
   console.print("[yellow]No similarity_score column found.[/yellow]")

We run a lightweight evaluation pipeline by comparing the extracted text with available reference fields. We calculate similarity scores and analyze how well simple extraction performs across different dimensions. We also visualize results to understand performance trends and limitations.

Copy the codeCopiedUse a different browser
def inspect_dimension(dimension_name, n=3):
   rows = dimension_data.get(dimension_name, [])
   console.print(Panel.fit(f"Inspecting {dimension_name}: {len(rows)} rows", style="bold magenta"))
   for idx, row in enumerate(rows[:n]):
       console.print(f"\n[bold]Example {idx}[/bold]")
       console.print(json.dumps(row, indent=2)[:2500])


for dim in list(dimension_data.keys())[:5]:
   inspect_dimension(dim, n=1)


def make_parsebench_subset(dimension=None, n=20, seed=123):
   subset = df.copy()
   if dimension:
       subset = subset[subset["_dimension"] == dimension]
   if len(subset) == 0:
       return subset
   return subset.sample(min(n, len(subset)), random_state=seed)


subset = make_parsebench_subset(n=20)
display(subset.head())


def create_llm_parser_prompt(row):
   dimension = row.get("_dimension", "unknown")
   candidate_truth = pick_first_existing(row, preferred_gt_cols)
   rule_hint = pick_first_existing(row, rule_cols)


   prompt = f"""
You are evaluating a document parser on ParseBench.


Dimension:
{dimension}


Task:
Parse the PDF page into a structured representation that preserves the information needed for agentic workflows.


Relevant benchmark hint or rule:
{rule_hint if rule_hint is not None else "No obvious rule field detected."}


Reference field preview:
{str(candidate_truth)[:1000] if candidate_truth is not None else "No obvious reference field detected."}


Return:
1. Markdown representation
2. Extracted tables as JSON arrays when tables exist
3. Extracted chart values as JSON when charts exist
4. Layout-sensitive notes when visual grounding matters
"""
   return textwrap.dedent(prompt).strip()


prompt_examples = []
if len(subset):
   for _, row in subset.head(3).iterrows():
       prompt_examples.append(create_llm_parser_prompt(row.to_dict()))


if prompt_examples:
   console.print(Panel.fit("Example prompt for testing an external OCR or VLM parser", style="bold blue"))
   console.print(prompt_examples[0])
else:
   console.print("[yellow]No prompt examples could be created because the subset is empty.[/yellow]")


def compare_parser_outputs(reference, candidate):
   return {
       "token_set_similarity": simple_text_similarity(reference, candidate),
       "partial_ratio": fuzz.partial_ratio(normalize_text(reference), normalize_text(candidate)) / 100 if reference and candidate else None,
       "candidate_length": len(str(candidate)) if candidate else 0,
       "reference_length": len(str(reference)) if reference else 0
   }


if not eval_df.empty and "similarity_score" in eval_df.columns:
   scored_eval = eval_df.dropna(subset=["similarity_score"])


   if len(scored_eval):
       best = scored_eval.sort_values("similarity_score", ascending=False).head(1)
       worst = scored_eval.sort_values("similarity_score", ascending=True).head(1)


       console.print(Panel.fit("Best lightweight baseline example", style="bold green"))
       display(best)


       console.print(Panel.fit("Worst lightweight baseline example", style="bold red"))
       display(worst)
   else:
       console.print("[yellow]No valid similarity scores were available for best/worst comparison.[/yellow]")


output_path = WORKDIR / "parsebench_flattened_sample.csv"
df.head(500).to_csv(output_path, index=False)
console.print(f"Saved flattened sample to: {output_path}")


console.print(Panel.fit("""
Tutorial complete.


What we build:
1. Load ParseBench files directly from Hugging Face.
2. Inspect benchmark dimensions and schemas.
3. Flatten records into a dataframe.
4. Detect linked PDFs and render sample pages when possible.
5. Run a lightweight PyMuPDF extraction baseline.
6. Score extracted text when reference fields are available.
7. Generate reusable prompts for OCR, VLM, and document parser evaluation.
""", style="bold green"))

We sample the dataset and create subsets for experimentation. We create structured prompts to evaluate external analysis systems, such as optical character recognition and vision language models. We also compare the outputs, identify the best and worst cases, and save the processed data for future use.

In conclusion, we have built a complete workflow that allows us to analyse, evaluate, and experiment with document analysis using the ParseBench dataset. We extracted and collated textual content and also created structured prompts to test external analysis systems, such as OCR engines and VLMs. This approach helps us move beyond simple text mining and toward building agent-ready representations that preserve structure, layout, and semantic meaning. We’ve also created a strong foundation that we can further expand to measure performance, improve analysis models, and integrate document understanding into real-world AI pipelines.


verify Full codes here. Also, feel free to follow us on twitter And don’t forget to join us 130k+ ml SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.

Do you need to partner with us to promote your GitHub Repo page, face hug page, product release, webinar, etc.?Contact us

The post Implementing Coding on Benchmark Document Analysis with LlamaIndex ParseBench Using Python, Hugging Face, and Evaluation Metrics appeared first on MarkTechPost.

Leave a Reply