license: llama3.1
language:
- en
metrics:
- accuracy
base_model:
- meta-llama/Llama-3.1-8B-Instruct
tags:
- medical
- clinical
model-index:
- name: J1-8B
results:
- task:
type: text-generation
dataset:
name: MedQA
type: MedQA
metrics:
- name: Accuracy
type: Accuracy
value: 79.34%
- task:
type: text-generation
dataset:
name: PubMedQA
type: PubMedQA
metrics:
- name: Accuracy
type: Accuracy
value: 81.00%
Juvoly J1
Model Details
- Base Model: Llama 3.1-8B-Instruct
- Training: 2688 B200 hours
- Dataset: 100B tokens of synthetic data based on CC-BY articles from PubMed.
- Intended Use: Experimental
- Repository: GitHub
Description
Juvoly J1 is an experimental clinical reasoning model designed for testing in limited healthcare environments. The model has been fine-tuned on a carefully curated dataset of synthetic PubMed articles to enhance its medical reasoning capabilities while maintaining the general knowledge from its Llama 3.1-8B-Instruct base.
This model represents our initial effort toward creating accessible, efficient clinical reasoning tools that can operate within modest computational constraints. With a parameter count of only 8B, J1 demonstrates strong performance on medical benchmarks while requiring significantly fewer resources than larger alternatives.
Performance
Comparable size ~8B parameters
Model | MedQA | PubMedQA | Average Tokens per Question |
---|---|---|---|
Juvoly J1 (this repo) | 79.34% | 81% | MedQA: 2012, PubMedQA: 1214 |
Qwen3-8B (link) | 75.81% | 79.50% | MedQA: 2608, PubMedQA: 894 |
HuatuoGPT-o1-8B (link) | 72.6% | 79.2% | - |
Delphyr M1 (link) | 64.7% | 76.8% | - |
Llama 3.1-8B-Instruct (link) | 58.7% | 75.2% | - |
Benchmarks on larger models
Model | MedQA | PubMedQA | Average Tokens per Question |
---|---|---|---|
GPT4o | 88% | - | - |
HuatuoGPT-o1-70B (link) | 83.3% | 80.6% | - |
Deepseek V3 | 80.9% | - | - |
Note: The tables are sorted by MedQA performance in descending order.
Intended Use
This model is intended for experimental use only. It may be helpful for:
- Testing medical reasoning capabilities in controlled environments
- Research on smaller-scale medical language models
- Exploring the balance between model size and clinical reasoning performance
- Prototyping healthcare applications with reduced computational requirements
The model should not be used for actual medical decision-making or clinical applications without proper validation, oversight, and compliance with relevant regulations.
Roadmap
We're actively developing our model ecosystem with several key initiatives underway:
- Multilingual Support: Expanding capabilities to support multiple languages for broader global access
- Specialized Domain Variants: Creating focused versions for specific medical specialties
- Enhanced Reasoning: Improving the model's ability to follow complex chains of medical logic
- Reduced Token Usage: Optimizing response generation for more efficient inference
Evaluation
Steps to reproduce our evaluation results can be found in our GitHub repository.
The experiments were run using:
python -m sglang.launch_server --port 8000 --model-path Juvoly/J1-Llama-8B-exp --tp-size 8 --mem-fraction-static 0.8
Limitations
- As an experimental model, J1 may produce incorrect or incomplete medical information
- The model has not undergone clinical validation or regulatory approval
- Performance varies across different medical domains and question types
- The model inherits limitations from the base Llama 3.1-8B-Instruct architecture
Citation
If you use this model in your research, please cite our repository:
@software{juvoly_j1_2025,
author = {Juvoly Team},
title = {J1: An Experimental Clinical Reasoning Model},
year = {2025},
url = {https://github.com/juvoly/j1}
}
License
This model follows the license terms of the Llama 3.1-8B-Instruct base model, with additional terms for our fine-tuning process. Please refer to our repository for complete licensing information.