pyalma.pdfreader

module documentation

Undocumented

Function	`get_doc`	Opens a PDF using PyMuPDF (fitz).
Function	`read_pdf_as_text`	Extracts plain text from all pages of a PDF.
Function	`read_pdf_to_dataframe`	Extracts structured data from a PDF and loads it into a DataFrame.

def get_doc(content): ¶

Opens a PDF using PyMuPDF (fitz).

:param content: File path (str) or binary stream (bytes). :type content: str | bytes

:return: A PyMuPDF Document object. :rtype: fitz.Document

:raises ValueError: If content is neither a valid path nor a binary stream.

def read_pdf_as_text(content): ¶

Extracts plain text from all pages of a PDF.

Uses PyPDF for simple, reliable text extraction.

:param content: Binary PDF content. :type content: bytes

:return: Full text content concatenated from all pages. :rtype: str

def read_pdf_to_dataframe(content): ¶

Extracts structured data from a PDF and loads it into a DataFrame.

Each row includes:

:param content: Path to PDF or binary content. :type content: str | bytes

:return: DataFrame with columns ['Page', 'Content', 'Images']. :rtype: pd.DataFrame