--- title: Medical Document Summarizer emoji: 🔥 colorFrom: yellow colorTo: green sdk: gradio sdk_version: 5.23.2 app_file: app.py pinned: false license: mit short_description: Upload your files and get a brief summary! --- # Medical Document Summarizer This project is designed to automatically extract and summarize key information from clinical trial documents (e.g., PDF files of research articles) using state-of-the-art NLP models. The pipeline leverages the BigBird-Pegasus model for long-form summarization and includes content filtering, text cleaning, and post-processing to produce concise bullet-point and paragraph summaries. ## Features *Note*: User has to upload medical document into the file directory before running the model. - **PDF Extraction:** Reads and filters PDF files to capture only pages with core content (e.g., Abstract, Methods, Results, Conclusions). - **Text Cleaning:** Removes noisy metadata, citations, and excess whitespace. - **Core Section Extraction:** Attempts to identify and extract important sections using regex; falls back to header removal when sections are not detected. - **Chunking & Summarization:** Splits the text into manageable chunks and uses the BigBird-Pegasus summarization model for each chunk. - **Post-Processing:** Formats the final summary into bullet points and neatly wraps it into a paragraph. - **Modular and Extensible:** Each step is modular, making it easy to adjust, extend, or integrate with other systems. ## Requirements - Python 3.7+ - [spaCy](https://spacy.io/) with the `en_core_web_sm` model - [NLTK](https://www.nltk.org/) (with the `punkt` tokenizer) - [Transformers](https://huggingface.co/transformers/) - [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/) - [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) ## Installation 1. **Clone the repository:** ```bash git clone https://github.com/yourusername/Medical_Doc_Summarization.git cd Medical_Doc_Summarization ```