J1-Llama-8B-exp / README.md

Update README.md

6a0ce9e verified 27 days ago

4.75 kB

	---
	license: llama3.1
	language:
	- en
	metrics:
	- accuracy
	base_model:
	- meta-llama/Llama-3.1-8B-Instruct
	tags:
	- medical
	- clinical
	model-index:
	- name: J1-8B
	results:
	- task:
	type: text-generation
	dataset:
	name: MedQA
	type: MedQA
	metrics:
	- name: Accuracy
	type: Accuracy
	value: 79.34%
	- task:
	type: text-generation
	dataset:
	name: PubMedQA
	type: PubMedQA
	metrics:
	- name: Accuracy
	type: Accuracy
	value: 81.00%

	---
	# Juvoly J1

	## Model Details

	- Base Model: Llama 3.1-8B-Instruct
	- Training: 2688 B200 hours
	- Dataset: 100B tokens of synthetic data based on CC-BY articles from PubMed.
	- Intended Use: Experimental
	- Repository: [GitHub](https://github.com/juvoly/j1)

	## Description

	Juvoly J1 is an experimental clinical reasoning model designed for testing in limited healthcare environments. The model has been fine-tuned on a carefully curated dataset of synthetic PubMed articles to enhance its medical reasoning capabilities while maintaining the general knowledge from its Llama 3.1-8B-Instruct base.

	This model represents our initial effort toward creating accessible, efficient clinical reasoning tools that can operate within modest computational constraints. With a parameter count of only 8B, J1 demonstrates strong performance on medical benchmarks while requiring significantly fewer resources than larger alternatives.

	## Performance


	### Comparable size ~8B parameters
	\| Model \| MedQA \| PubMedQA \| Average Tokens per Question \|
	\|-------\|-------\|----------\|----------------------------\|
	\| Juvoly J1 (this repo) \| 79.34% \| 81% \| MedQA: 2012, PubMedQA: 1214 \|
	\| Qwen3-8B ([link](https://huggingface.co/Qwen/Qwen3-8B)) \| 75.81% \| 79.50% \| MedQA: 2608, PubMedQA: 894 \|
	\| HuatuoGPT-o1-8B ([link](https://huggingface.co/FreedomIntelligence/HuatuoGPT-o1-8B)) \| 72.6% \| 79.2% \| - \|
	\| Delphyr M1 ([link](https://www.delphyr.ai/blog/delphyr-m1-best-in-class-medical-model)) \| 64.7% \| 76.8% \| - \|
	\| Llama 3.1-8B-Instruct ([link](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)) \| 58.7% \| 75.2% \| - \|

	### Benchmarks on larger models
	\| Model \| MedQA \| PubMedQA \| Average Tokens per Question \|
	\|-------\|-------\|----------\|----------------------------\|
	\| GPT4o \| 88% \| - \| - \|
	\| HuatuoGPT-o1-70B ([link](https://huggingface.co/FreedomIntelligence/HuatuoGPT-o1-70B)) \| 83.3% \| 80.6% \| - \|
	\| Deepseek V3 \| 80.9% \| - \| - \|


	Note: The tables are sorted by MedQA performance in descending order.

	## Intended Use

	This model is intended for experimental use only. It may be helpful for:

	- Testing medical reasoning capabilities in controlled environments
	- Research on smaller-scale medical language models
	- Exploring the balance between model size and clinical reasoning performance
	- Prototyping healthcare applications with reduced computational requirements

	The model should not be used for actual medical decision-making or clinical applications without proper validation, oversight, and compliance with relevant regulations.

	## Roadmap

	We're actively developing our model ecosystem with several key initiatives underway:

	- Multilingual Support: Expanding capabilities to support multiple languages for broader global access
	- Specialized Domain Variants: Creating focused versions for specific medical specialties
	- Enhanced Reasoning: Improving the model's ability to follow complex chains of medical logic
	- Reduced Token Usage: Optimizing response generation for more efficient inference

	## Evaluation

	Steps to reproduce our evaluation results can be found in our [GitHub repository](https://github.com/juvoly/j1).

	The experiments were run using:
	```
	python -m sglang.launch_server --port 8000 --model-path Juvoly/J1-Llama-8B-exp --tp-size 8 --mem-fraction-static 0.8
	```

	## Limitations

	- As an experimental model, J1 may produce incorrect or incomplete medical information
	- The model has not undergone clinical validation or regulatory approval
	- Performance varies across different medical domains and question types
	- The model inherits limitations from the base Llama 3.1-8B-Instruct architecture

	## Citation

	If you use this model in your research, please cite our repository:

	```
	@software{juvoly_j1_2025,
	author = {Juvoly Team},
	title = {J1: An Experimental Clinical Reasoning Model},
	year = {2025},
	url = {https://github.com/juvoly/j1}
	}
	```

	## License

	This model follows the license terms of the Llama 3.1-8B-Instruct base model, with additional terms for our fine-tuning process. Please refer to our repository for complete licensing information.