Features
- Extract tables from PDF files with ease.
- Support for multi-page PDFs.
- Handles text with overlapping and complex layouts.
- Configurable options to tailor table extraction.
- Lightweight and easy to integrate.
Installation
Install the library using npm:Usage
Here’s a basic example of how to use the library:Options
ThePdfDocument
constructor accepts the following configuration options:
Option | Type | Default | Description |
---|---|---|---|
hasTitles | boolean | true | Indicates whether tables have title rows. |
threshold | number | 1.5 | Sensitivity for grouping rows by y-axis. |
maxStrLength | number | 30 | Maximum string length for table cells. |
ignoreTexts | string[] | [] | Array of texts to ignore during extraction. |
API
PdfDocument
Properties:
numPages
: Number of pages in the PDF document.pages
: Array of parsed pages, each containing:pageNumber
: Page number in the PDF.tables
: Array of extracted tables.
Methods:
load(source: string | Buffer): Promise<void>
: Loads and processes the PDF file.
PdfTable
Properties:
tableNumber
: Identifier for the table.numrows
: Number of rows in the table.numcols
: Number of columns in the table.data
: 2D array representing table data.
Dependencies
- pdfjs-dist: PDF rendering library.
- tslib: Runtime library for TypeScript.