module documentation

Undocumented

Function get_doc Opens a PDF using PyMuPDF (fitz).
Function read_pdf_as_text Extracts plain text from all pages of a PDF.
Function read_pdf_to_dataframe Extracts structured data from a PDF and loads it into a DataFrame.
def get_doc(content):

Opens a PDF using PyMuPDF (fitz).

:param content: File path (str) or binary stream (bytes). :type content: str | bytes

:return: A PyMuPDF Document object. :rtype: fitz.Document

:raises ValueError: If content is neither a valid path nor a binary stream.

def read_pdf_as_text(content):

Extracts plain text from all pages of a PDF.

Uses PyPDF for simple, reliable text extraction.

:param content: Binary PDF content. :type content: bytes

:return: Full text content concatenated from all pages. :rtype: str

def read_pdf_to_dataframe(content):

Extracts structured data from a PDF and loads it into a DataFrame.

Each row includes:

  • Page number
  • Text content
  • List of extracted images (as bytes)

:param content: Path to PDF or binary content. :type content: str | bytes

:return: DataFrame with columns ['Page', 'Content', 'Images']. :rtype: pd.DataFrame