Skip to content

Updating marker-pdf

Introduction

marker-pdf had several new updates and improvements over the past months, including a new major release. I'm upgrading to the new version.

Among the changes:

  • Improve performance by 10-15% (0.3.10)
  • 2x faster due to a new layout model (1.0.0)
  • Consistent internal schema for blocks and pages (1.0.0)
  • Much higher quality output (1.0.0)
  • Fix lots of misc bugs, including encoding, empty page problems, and image rendering (1.0.1)
  • Improve list processing with joining and nesting (1.0.1)
  • Add in blockquotes (1.0.1)
  • Slightly improve performance (1.0.1)
  • Automatically detect bad OCR text and re-OCR the document. This consists of some PDF-level heuristics and a new OCR quality model. (1.2.0)
  • Layout model is now half the size and ~2x faster (most of the runtime in the general case is layout, so this should result in a big overall speedup). It's also more accurate. (1.2.0)
  • Tables now handle colspans and rowspans properly (1.3.0)
  • Improved table model with better accuracy (1.3.0)
  • Links and references are now pulled out of the pdf, and are clickable (1.3.0)
  • Anchors are placed on elements as targets (1.3.0)
  • Better inline math detection with an improved model. (1.6.0)

Lot's of good things for almost no cost!

Manual Testing

image

image

Merge request reports

Loading