metadata

license: llama3.1
language:
  - en
metrics:
  - accuracy
base_model:
  - meta-llama/Llama-3.1-8B-Instruct
tags:
  - medical
  - clinical
model-index:
  - name: J1-8B
    results:
      - task:
          type: text-generation
        dataset:
          name: MedQA
          type: MedQA
        metrics:
          - name: Accuracy
            type: Accuracy
            value: 79.34%
      - task:
          type: text-generation
        dataset:
          name: PubMedQA
          type: PubMedQA
        metrics:
          - name: Accuracy
            type: Accuracy
            value: 81.00%

Juvoly J1

Model Details

Base Model: Llama 3.1-8B-Instruct
Training: 2688 B200 hours
Dataset: 100B tokens of synthetic data based on CC-BY articles from PubMed.
Intended Use: Experimental
Repository: GitHub

Description

Juvoly J1 is an experimental clinical reasoning model designed for testing in limited healthcare environments. The model has been fine-tuned on a carefully curated dataset of synthetic PubMed articles to enhance its medical reasoning capabilities while maintaining the general knowledge from its Llama 3.1-8B-Instruct base.

This model represents our initial effort toward creating accessible, efficient clinical reasoning tools that can operate within modest computational constraints. With a parameter count of only 8B, J1 demonstrates strong performance on medical benchmarks while requiring significantly fewer resources than larger alternatives.

Performance

Comparable size ~8B parameters

Model	MedQA	PubMedQA	Average Tokens per Question
Juvoly J1 (this repo)	79.34%	81%	MedQA: 2012, PubMedQA: 1214
Qwen3-8B (link)	75.81%	79.50%	MedQA: 2608, PubMedQA: 894
HuatuoGPT-o1-8B (link)	72.6%	79.2%	-
Delphyr M1 (link)	64.7%	76.8%	-
Llama 3.1-8B-Instruct (link)	58.7%	75.2%	-

Benchmarks on larger models

Model	MedQA	PubMedQA	Average Tokens per Question
GPT4o	88%	-	-
HuatuoGPT-o1-70B (link)	83.3%	80.6%	-
Deepseek V3	80.9%	-	-

Note: The tables are sorted by MedQA performance in descending order.

Intended Use

This model is intended for experimental use only. It may be helpful for:

Testing medical reasoning capabilities in controlled environments
Research on smaller-scale medical language models
Exploring the balance between model size and clinical reasoning performance
Prototyping healthcare applications with reduced computational requirements

The model should not be used for actual medical decision-making or clinical applications without proper validation, oversight, and compliance with relevant regulations.

Roadmap

We're actively developing our model ecosystem with several key initiatives underway:

Multilingual Support: Expanding capabilities to support multiple languages for broader global access
Specialized Domain Variants: Creating focused versions for specific medical specialties
Enhanced Reasoning: Improving the model's ability to follow complex chains of medical logic
Reduced Token Usage: Optimizing response generation for more efficient inference

Evaluation

Steps to reproduce our evaluation results can be found in our GitHub repository.

The experiments were run using:

python -m sglang.launch_server --port 8000 --model-path Juvoly/J1-Llama-8B-exp --tp-size 8 --mem-fraction-static 0.8

Limitations

As an experimental model, J1 may produce incorrect or incomplete medical information
The model has not undergone clinical validation or regulatory approval
Performance varies across different medical domains and question types
The model inherits limitations from the base Llama 3.1-8B-Instruct architecture

Citation

If you use this model in your research, please cite our repository:

@software{juvoly_j1_2025,
  author = {Juvoly Team},
  title = {J1: An Experimental Clinical Reasoning Model},
  year = {2025},
  url = {https://github.com/juvoly/j1}
}

License

This model follows the license terms of the Llama 3.1-8B-Instruct base model, with additional terms for our fine-tuning process. Please refer to our repository for complete licensing information.