metadata

license: apache-2.0
language:
  - ja
pipeline_tag: text-generation
tags:
  - RAG

Kurage

An anime image of a pink and blue jellyfish surrounded by bubbles

Kurage is a multipurpose RAG model from Lightblue.

This version of the model has been trained to perform RAG in Japanese.

Features / How to use

Multi-chunk RAG

This model can take multiple contexts and a question as input, and it will first output the references of the relevant contexts before outputting an answer to the question.

Single-chunk RAG

This model can also take a single context and a question as input, and it will determine whether it can answer the question based on the context, outputting an answer if it can. This allows for parallel computing of multiple contexts at the same time.

Answer extension

By default, this model is trained to output the shortest possible answer to a question. However, if you require a longer answer, you can prompt the model to write a longer answer by writing " <>" after your question.

Multilinguality

We have trained our model to be able to answer questions in Japanese based on texts in other languages too!

Q&A generation

This model can also generate questions and answers based on a piece of text. This can be useful for pre-indexing a database or fine-tuning IR models that will then be used for RAG.

Training data

We trained on chunks sourced from the documents in MADLAD-400 dataset that had been evaluated to contain a higher amount of educational information according to a state-of-the-art LLM.

We took chunks of size 250 tokens, 500 tokens, and 1000 tokens randomly for each document.

We then used these chunks to generate questions and answers based on this text using a state-of-the-art LLM.

Finally, we selected negatives for each chunk using the similarity from the dense embeddings of the BAAI/bge-m3 model.