|
--- |
|
license: llama3.1 |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
base_model: |
|
- meta-llama/Llama-3.1-8B-Instruct |
|
tags: |
|
- medical |
|
- clinical |
|
model-index: |
|
- name: J1-8B |
|
results: |
|
- task: |
|
type: text-generation |
|
dataset: |
|
name: MedQA |
|
type: MedQA |
|
metrics: |
|
- name: Accuracy |
|
type: Accuracy |
|
value: 79.34% |
|
- task: |
|
type: text-generation |
|
dataset: |
|
name: PubMedQA |
|
type: PubMedQA |
|
metrics: |
|
- name: Accuracy |
|
type: Accuracy |
|
value: 81.00% |
|
|
|
--- |
|
# Juvoly J1 |
|
|
|
## Model Details |
|
|
|
- **Base Model**: Llama 3.1-8B-Instruct |
|
- **Training**: 2688 B200 hours |
|
- **Dataset**: 100B tokens of synthetic data based on CC-BY articles from PubMed. |
|
- **Intended Use**: Experimental |
|
- **Repository**: [GitHub](https://github.com/juvoly/j1) |
|
|
|
## Description |
|
|
|
Juvoly J1 is an experimental clinical reasoning model designed for testing in limited healthcare environments. The model has been fine-tuned on a carefully curated dataset of synthetic PubMed articles to enhance its medical reasoning capabilities while maintaining the general knowledge from its Llama 3.1-8B-Instruct base. |
|
|
|
This model represents our initial effort toward creating accessible, efficient clinical reasoning tools that can operate within modest computational constraints. With a parameter count of only 8B, J1 demonstrates strong performance on medical benchmarks while requiring significantly fewer resources than larger alternatives. |
|
|
|
## Performance |
|
|
|
|
|
### Comparable size ~8B parameters |
|
| Model | MedQA | PubMedQA | Average Tokens per Question | |
|
|-------|-------|----------|----------------------------| |
|
| **Juvoly J1 (this repo)** | **79.34%** | **81%** | MedQA: **2012**, PubMedQA: 1214 | |
|
| Qwen3-8B ([link](https://huggingface.co/Qwen/Qwen3-8B)) | 75.81% | 79.50% | MedQA: 2608, PubMedQA: **894** | |
|
| HuatuoGPT-o1-8B ([link](https://huggingface.co/FreedomIntelligence/HuatuoGPT-o1-8B)) | 72.6% | 79.2% | - | |
|
| Delphyr M1 ([link](https://www.delphyr.ai/blog/delphyr-m1-best-in-class-medical-model)) | 64.7% | 76.8% | - | |
|
| Llama 3.1-8B-Instruct ([link](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)) | 58.7% | 75.2% | - | |
|
|
|
### Benchmarks on larger models |
|
| Model | MedQA | PubMedQA | Average Tokens per Question | |
|
|-------|-------|----------|----------------------------| |
|
| GPT4o | 88% | - | - | |
|
| HuatuoGPT-o1-70B ([link](https://huggingface.co/FreedomIntelligence/HuatuoGPT-o1-70B)) | 83.3% | 80.6% | - | |
|
| Deepseek V3 | 80.9% | - | - | |
|
|
|
|
|
*Note: The tables are sorted by MedQA performance in descending order.* |
|
|
|
## Intended Use |
|
|
|
This model is intended for **experimental use only**. It may be helpful for: |
|
|
|
- Testing medical reasoning capabilities in controlled environments |
|
- Research on smaller-scale medical language models |
|
- Exploring the balance between model size and clinical reasoning performance |
|
- Prototyping healthcare applications with reduced computational requirements |
|
|
|
The model should not be used for actual medical decision-making or clinical applications without proper validation, oversight, and compliance with relevant regulations. |
|
|
|
## Roadmap |
|
|
|
We're actively developing our model ecosystem with several key initiatives underway: |
|
|
|
- **Multilingual Support**: Expanding capabilities to support multiple languages for broader global access |
|
- **Specialized Domain Variants**: Creating focused versions for specific medical specialties |
|
- **Enhanced Reasoning**: Improving the model's ability to follow complex chains of medical logic |
|
- **Reduced Token Usage**: Optimizing response generation for more efficient inference |
|
|
|
## Evaluation |
|
|
|
Steps to reproduce our evaluation results can be found in our [GitHub repository](https://github.com/juvoly/j1). |
|
|
|
The experiments were run using: |
|
``` |
|
python -m sglang.launch_server --port 8000 --model-path Juvoly/J1-Llama-8B-exp --tp-size 8 --mem-fraction-static 0.8 |
|
``` |
|
|
|
## Limitations |
|
|
|
- As an experimental model, J1 may produce incorrect or incomplete medical information |
|
- The model has not undergone clinical validation or regulatory approval |
|
- Performance varies across different medical domains and question types |
|
- The model inherits limitations from the base Llama 3.1-8B-Instruct architecture |
|
|
|
## Citation |
|
|
|
If you use this model in your research, please cite our repository: |
|
|
|
``` |
|
@software{juvoly_j1_2025, |
|
author = {Juvoly Team}, |
|
title = {J1: An Experimental Clinical Reasoning Model}, |
|
year = {2025}, |
|
url = {https://github.com/juvoly/j1} |
|
} |
|
``` |
|
|
|
## License |
|
|
|
This model follows the license terms of the Llama 3.1-8B-Instruct base model, with additional terms for our fine-tuning process. Please refer to our repository for complete licensing information. |