module documentation
Undocumented
| Function | get |
Opens a PDF using PyMuPDF (fitz). |
| Function | read |
Extracts plain text from all pages of a PDF. |
| Function | read |
Extracts structured data from a PDF and loads it into a DataFrame. |
Opens a PDF using PyMuPDF (fitz).
:param content: File path (str) or binary stream (bytes). :type content: str | bytes
:return: A PyMuPDF Document object. :rtype: fitz.Document
:raises ValueError: If content is neither a valid path nor a binary stream.
Extracts plain text from all pages of a PDF.
Uses PyPDF for simple, reliable text extraction.
:param content: Binary PDF content. :type content: bytes
:return: Full text content concatenated from all pages. :rtype: str
Extracts structured data from a PDF and loads it into a DataFrame.
Each row includes:
- Page number
- Text content
- List of extracted images (as bytes)
:param content: Path to PDF or binary content. :type content: str | bytes
:return: DataFrame with columns ['Page', 'Content', 'Images']. :rtype: pd.DataFrame