Features
- Extract tables from PDF files with ease.
- Support for multi-page PDFs.
- Handles text with overlapping and complex layouts.
- Configurable options to tailor table extraction.
- Lightweight and easy to integrate.
Installation
Install the library using npm:Usage
Here’s a basic example of how to use the library:Options
ThePdfDocument constructor accepts the following configuration options:
| Option | Type | Default | Description |
|---|---|---|---|
hasTitles | boolean | true | Indicates whether tables have title rows. |
threshold | number | 1.5 | Sensitivity for grouping rows by y-axis. |
maxStrLength | number | 30 | Maximum string length for table cells. |
ignoreTexts | string[] | [] | Array of texts to ignore during extraction. |
API
PdfDocument
Properties:
numPages: Number of pages in the PDF document.pages: Array of parsed pages, each containing:pageNumber: Page number in the PDF.tables: Array of extracted tables.
Methods:
load(source: string | Buffer): Promise<void>: Loads and processes the PDF file.
PdfTable
Properties:
tableNumber: Identifier for the table.numrows: Number of rows in the table.numcols: Number of columns in the table.data: 2D array representing table data.
Dependencies
- pdfjs-dist: PDF rendering library.
- tslib: Runtime library for TypeScript.