Update FlexRAG retriever
Browse files- README.md +66 -2
- config.yaml +7 -4
- corpus.jsonl +3 -0
- corpus.mmindex.json +2 -2
- params.index.json +1 -1
- retriever.id +1 -0
README.md
CHANGED
@@ -9,9 +9,28 @@ tags:
|
|
9 |
- RAG
|
10 |
---
|
11 |
|
12 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
|
14 |
-
This is a BM25SRetriever created with the [`FlexRAG`](https://github.com/ictnlp/flexrag) library (version `0.1.8`).
|
15 |
|
16 |
## Installation
|
17 |
|
@@ -35,6 +54,51 @@ retriever = LocalRetriever.load_from_hub("FlexRAG/wiki2021_atlas_bm25s")
|
|
35 |
results = retriever.search("Who is Bruce Wayne?")
|
36 |
```
|
37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
38 |
FlexRAG Related Links:
|
39 |
* 📚[Documentation](https://flexrag.readthedocs.io/en/latest/)
|
40 |
* 💻[GitHub Repository](https://github.com/ictnlp/flexrag)
|
|
|
9 |
- RAG
|
10 |
---
|
11 |
|
12 |
+
# The BM25SRetriever for the wiki2021 corpus
|
13 |
+
|
14 |
+
The corpus was created by the [Atlas](https://github.com/facebookresearch/atlas) project and the index was built using the [FlexRAG](https://github.com/ictnlp/flexrag) library.
|
15 |
+
|
16 |
+
| Corpus Attribute | Value |
|
17 |
+
| ---------------- | --------------------------------------------------------------- |
|
18 |
+
| Language | English |
|
19 |
+
| Domain | Wikipedia |
|
20 |
+
| Size | 37.5M (33.1M text, 4.3M infobox) |
|
21 |
+
| Dump Date | Dec 2021 |
|
22 |
+
| Provideer | [Atlas](https://github.com/facebookresearch/atlas) |
|
23 |
+
| License | [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) |
|
24 |
+
|
25 |
+
|
26 |
+
| Index Attribute | Value |
|
27 |
+
| --------------- | --------------------------------------------------------------- |
|
28 |
+
| Index Type | BM25S |
|
29 |
+
| Index Method | Lucene |
|
30 |
+
| Preprocessing | LengthFilter(min_char=10, max_char=4096) |
|
31 |
+
| Provideer | [FlexRAG](https://github.com/ictnlp/flexrag) |
|
32 |
+
| License | [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) |
|
33 |
|
|
|
34 |
|
35 |
## Installation
|
36 |
|
|
|
54 |
results = retriever.search("Who is Bruce Wayne?")
|
55 |
```
|
56 |
|
57 |
+
## Running the RAG application with the retriever
|
58 |
+
|
59 |
+
You can run the **GUI application** of the RAG assistant with this retriever. Here is an example:
|
60 |
+
|
61 |
+
```bash
|
62 |
+
python -m flexrag.entrypoints.run_interactive \
|
63 |
+
assistant_type=modular \
|
64 |
+
modular_config.used_fields=[title,text] \
|
65 |
+
modular_config.retriever_type="FlexRAG/wiki2021_atlas_bm25s" \
|
66 |
+
modular_config.response_type=original \
|
67 |
+
modular_config.generator_type=openai \
|
68 |
+
modular_config.openai_config.model_name='gpt-4o-mini' \
|
69 |
+
modular_config.openai_config.api_key=$OPENAI_KEY \
|
70 |
+
modular_config.do_sample=False
|
71 |
+
```
|
72 |
+
|
73 |
+
You can also run the **FlexRAG's RAG evaluation pipeline** with this retriever. Here is an example that evaluates the **ModularAssistant** with the retriever on the *Natural Questions* test split:
|
74 |
+
|
75 |
+
```bash
|
76 |
+
OUTPUT_PATH=<path_to_output>
|
77 |
+
DB_PATH=<path_to_database>
|
78 |
+
OPENAI_KEY=<your_openai_key>
|
79 |
+
|
80 |
+
python -m flexrag.entrypoints.run_assistant \
|
81 |
+
name=nq \
|
82 |
+
split=test \
|
83 |
+
output_path=${OUTPUT_PATH} \
|
84 |
+
assistant_type=modular \
|
85 |
+
modular_config.used_fields=[title,text] \
|
86 |
+
modular_config.retriever_type="FlexRAG/wiki2021_atlas_bm25s" \
|
87 |
+
modular_config.generator_type=openai \
|
88 |
+
modular_config.openai_config.model_name='gpt-4o-mini' \
|
89 |
+
modular_config.openai_config.api_key=$OPENAI_KEY \
|
90 |
+
modular_config.do_sample=False \
|
91 |
+
eval_config.metrics_type=[retrieval_success_rate,generation_f1,generation_em] \
|
92 |
+
eval_config.retrieval_success_rate_config.context_preprocess.processor_type=[simplify_answer] \
|
93 |
+
eval_config.retrieval_success_rate_config.eval_field=text \
|
94 |
+
eval_config.response_preprocess.processor_type=[simplify_answer]
|
95 |
+
```
|
96 |
+
|
97 |
+
## License
|
98 |
+
As the corpus is based on the [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) license, the retriever is also licensed under the same license.
|
99 |
+
|
100 |
+
## Related Links
|
101 |
+
|
102 |
FlexRAG Related Links:
|
103 |
* 📚[Documentation](https://flexrag.readthedocs.io/en/latest/)
|
104 |
* 💻[GitHub Repository](https://github.com/ictnlp/flexrag)
|
config.yaml
CHANGED
@@ -1,6 +1,6 @@
|
|
1 |
-
log_interval:
|
2 |
top_k: 10
|
3 |
-
batch_size:
|
4 |
query_preprocess_pipeline:
|
5 |
processor_type: []
|
6 |
length_filter_config:
|
@@ -32,7 +32,7 @@ query_preprocess_pipeline:
|
|
32 |
hf_tokenizer_path: null
|
33 |
tiktok_tokenizer_name: null
|
34 |
lang: null
|
35 |
-
database_path:
|
36 |
method: lucene
|
37 |
idf_method: null
|
38 |
backend: auto
|
@@ -40,4 +40,7 @@ k1: 1.5
|
|
40 |
b: 0.75
|
41 |
delta: 0.5
|
42 |
lang: english
|
43 |
-
indexed_fields:
|
|
|
|
|
|
|
|
1 |
+
log_interval: 100000
|
2 |
top_k: 10
|
3 |
+
batch_size: 512
|
4 |
query_preprocess_pipeline:
|
5 |
processor_type: []
|
6 |
length_filter_config:
|
|
|
32 |
hf_tokenizer_path: null
|
33 |
tiktok_tokenizer_name: null
|
34 |
lang: null
|
35 |
+
database_path: /data/zhangzhuocheng/Lab/Python/LLM/datasets/RAG/wikipedia/wiki_2021/bm25s_lucene
|
36 |
method: lucene
|
37 |
idf_method: null
|
38 |
backend: auto
|
|
|
40 |
b: 0.75
|
41 |
delta: 0.5
|
42 |
lang: english
|
43 |
+
indexed_fields:
|
44 |
+
- title
|
45 |
+
- section
|
46 |
+
- text
|
corpus.jsonl
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:19ee000506e8b826a4a29076ead7e8978a656184c267b57732de52346710a411
|
3 |
+
size 24971177710
|
corpus.mmindex.json
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:c688cc6edacd0d497a977268659d421eb76b761399e7858601b81c0741c2be8c
|
3 |
+
size 433595094
|
params.index.json
CHANGED
@@ -9,4 +9,4 @@
|
|
9 |
"num_docs": 37507469,
|
10 |
"version": "0.2.1",
|
11 |
"backend": "numpy"
|
12 |
-
}
|
|
|
9 |
"num_docs": 37507469,
|
10 |
"version": "0.2.1",
|
11 |
"backend": "numpy"
|
12 |
+
}
|
retriever.id
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
BM25SRetriever
|