Llama-3.2-11B-Vision-Surgical-CholecT50 Overview

Description

Llama-3.2-11B-Vision-Surgical-CholecT50 is a multimodal large language model fine-tuned on the CholecT50 dataset of laparoscopic cholecystectomy procedures to recognize and describe surgical actions. This model is designed to help identify instruments, actions (verbs), and targets in endoscopic video frames for research in surgical workflow analyses and fine-grained action recognition. Multi-turn questions and answers were generated from the computer vision ground-truth of each image frame using GPT4.

This model is for research and development only.

License/Terms of Use

This release is intended for non-commercial, research-based usage related to surgical vision tasks. For licensing or usage inquiries beyond research use, please refer to the base Llama 3.2 Community License and any applicable requirements from the CholecT50 data providers.

For this model, developers must ensure compliance with all relevant laws (including data privacy and patient confidentiality), especially in healthcare or medical contexts. This model does not constitute medical advice or diagnosis and must be validated by professional clinicians or domain experts prior to any practical deployment in a clinical environment.

Deployment Geography

Global

Use Case

Primarily intended for surgical researchers, healthcare AI developers, or academic institutions exploring laparoscopic action recognition and surgical workflow analytics.

Release Date

Huggingface: 06/04/2025

Reference(s)

Twinanda, A. P., Shehata, S., Mutter, D., Marescaux, J., De Mathelin, M., & Padoy, N. (2016). Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE transactions on medical imaging, 36(1), 86-97. C.I. Nwoye, N. Padoy. Data Splits and Metrics for Benchmarking Methods on Surgical Action Triplet Datasets. arXiv PrePrint arXiv:2204.05235. 2022. Link to Meta’s Llama 3.2 Model Card

Model Architecture

Architecture Type: Transformer-based Large Language Model with a Vision Adapter
Network Architecture: Llama-3.2-11B-Vision (with cross-attention layers for image features)

This model was developed based on Llama-3.2-11B-Vision and has 11 billion parameters in its text and vision components.

Standard initialization techniques inherited from Llama 3.2.

Fine-tuning hyperparameters:
Learning rate ~1e-5 for the final layers (vision adapter and a few top LLM layers), batch size scaled to match limited dataset size; minimal regularization with dropout ~0.1; fine-tuning for ~3 epochs on the CholecT50 frames.

Input

Input Type(s): Image (endoscopic frame), (Optional) Text Prompt
Input Format(s): Red, Green, Blue (RGB) images or frames from laparoscopic video, String
Input Parameters: Image: Two-Dimensional (2D) laparoscopic image frames (extracted at 1 frames per second (fps)), Text Prompt: One-Dimensional (1D) Other Properties Related to Input:
Recommended resolution: 480p or higher to capture surgical instruments clearly.
Pre-processing: Minimal resizing (e.g., 224×224) if required by the model’s vision encoder.
Token limit for any accompanying text: Up to ~4k tokens if textual context is added.

Output

Output Type(s): Text
Output Format: Natural language or short descriptors (e.g., “Grasping forceps removing gallbladder tissue”)
Output Parameters: One-Dimensional (1D) -Other Properties Related to Output:
- The model returns a textual token sequence describing recognized surgical actions, instruments, or answering surgery-related queries.
- By default, no bounding box or segmentation map is returned—only text describing the surgical scene or answering questions.
- No additional post-processing is strictly required; however, downstream systems may parse or transform the text output for further analytics.
- NVIDIA GPUs or equivalent GPU-accelerated hardware can significantly reduce inference time compared to CPU-only deployments.

Software Integration

Runtime Engine(s): Potentially deployable via standard LLM frameworks.

Supported Hardware Microarchitecture Compatibility:

H100/A100 or other modern GPU architectures are recommended for faster inference.

Operating System(s):

Linux-based distributions for typical research pipelines
Others possible if the underlying framework is supported

This AI model can be integrated as an API call within a standard multimodal pipeline (e.g., a Python service that feeds frames to the model, obtains textual responses, and merges them into a surgical analysis application).

Model Version(s)

v1.0 (Finetuned on CholecT50)

Developers may integrate the model into an AI system by providing laparoscopic frames as input along with optional textual prompts (e.g., “Identify the instrument and action in this frame”). The output is textual. Additional scaffolding for bounding box or segmentation tasks is not included but can be implemented downstream if desired.

Training, Testing, and Evaluation Datasets

Total size: ~50 endoscopic videos from the CholecT50 dataset, extracted at 1 fps, with multiple annotated frames per video.
Dataset partition: 80% training, 10% validation, 10% test (approximate).
Time period of data collection: The CholecT50 dataset is derived from surgeries in Strasbourg, France; original videos from the Cholec80 and in-house expansions (Cholec120).

Training Dataset

Link: CholecT50 dataset
Data Collection Method: Hybrid: Automated, Human (instrument bounding boxes, action triplets).
Labeling Method: Human
Properties:
~50 laparoscopic surgery videos, 1 fps frames, ~10K–15K frames used for training.
- Human experts annotated surgical instruments, actions, and targets. Annotations are <instrument, verb, target> triplets plus bounding box info for 5 videos.

Testing Dataset

Link: Same CholecT50, but holdout portion.
Data Collection Method: Hybrid: Automated, Human
Labeling Method: Human
Properties:
- ~1–2K frames for testing (approx).

Evaluation Dataset

Link: Same CholecT50 (dedicated set of frames never seen during training).
Benchmark Score: F1-score for Triplets: Instrument: 0.81, Verb: 0.64, Target (Anatomy): 0.60
Data Collection Method: Hybrid: Automated, Human
Labeling Method: Human
Properties:
- ~1–2K frames for final evaluation.

Key performance: The model can generate detailed descriptions for instruments, actions, and surgical targets.

Inference

Engine: [Any standard LLM-serving solution (e.g., PyTorch with Triton Inference Server)]
Test Hardware: Tested on A6000 and 3090.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns here.