FlexRAG commited on
Commit
6da1e1c
·
verified ·
1 Parent(s): eeb82a9

Update FlexRAG retriever

Browse files
Files changed (6) hide show
  1. README.md +66 -2
  2. config.yaml +7 -4
  3. corpus.jsonl +3 -0
  4. corpus.mmindex.json +2 -2
  5. params.index.json +1 -1
  6. retriever.id +1 -0
README.md CHANGED
@@ -9,9 +9,28 @@ tags:
9
  - RAG
10
  ---
11
 
12
- # FlexRAG Retriever
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
- This is a BM25SRetriever created with the [`FlexRAG`](https://github.com/ictnlp/flexrag) library (version `0.1.8`).
15
 
16
  ## Installation
17
 
@@ -35,6 +54,51 @@ retriever = LocalRetriever.load_from_hub("FlexRAG/wiki2021_atlas_bm25s")
35
  results = retriever.search("Who is Bruce Wayne?")
36
  ```
37
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
  FlexRAG Related Links:
39
  * 📚[Documentation](https://flexrag.readthedocs.io/en/latest/)
40
  * 💻[GitHub Repository](https://github.com/ictnlp/flexrag)
 
9
  - RAG
10
  ---
11
 
12
+ # The BM25SRetriever for the wiki2021 corpus
13
+
14
+ The corpus was created by the [Atlas](https://github.com/facebookresearch/atlas) project and the index was built using the [FlexRAG](https://github.com/ictnlp/flexrag) library.
15
+
16
+ | Corpus Attribute | Value |
17
+ | ---------------- | --------------------------------------------------------------- |
18
+ | Language | English |
19
+ | Domain | Wikipedia |
20
+ | Size | 37.5M (33.1M text, 4.3M infobox) |
21
+ | Dump Date | Dec 2021 |
22
+ | Provideer | [Atlas](https://github.com/facebookresearch/atlas) |
23
+ | License | [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) |
24
+
25
+
26
+ | Index Attribute | Value |
27
+ | --------------- | --------------------------------------------------------------- |
28
+ | Index Type | BM25S |
29
+ | Index Method | Lucene |
30
+ | Preprocessing | LengthFilter(min_char=10, max_char=4096) |
31
+ | Provideer | [FlexRAG](https://github.com/ictnlp/flexrag) |
32
+ | License | [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) |
33
 
 
34
 
35
  ## Installation
36
 
 
54
  results = retriever.search("Who is Bruce Wayne?")
55
  ```
56
 
57
+ ## Running the RAG application with the retriever
58
+
59
+ You can run the **GUI application** of the RAG assistant with this retriever. Here is an example:
60
+
61
+ ```bash
62
+ python -m flexrag.entrypoints.run_interactive \
63
+ assistant_type=modular \
64
+ modular_config.used_fields=[title,text] \
65
+ modular_config.retriever_type="FlexRAG/wiki2021_atlas_bm25s" \
66
+ modular_config.response_type=original \
67
+ modular_config.generator_type=openai \
68
+ modular_config.openai_config.model_name='gpt-4o-mini' \
69
+ modular_config.openai_config.api_key=$OPENAI_KEY \
70
+ modular_config.do_sample=False
71
+ ```
72
+
73
+ You can also run the **FlexRAG's RAG evaluation pipeline** with this retriever. Here is an example that evaluates the **ModularAssistant** with the retriever on the *Natural Questions* test split:
74
+
75
+ ```bash
76
+ OUTPUT_PATH=<path_to_output>
77
+ DB_PATH=<path_to_database>
78
+ OPENAI_KEY=<your_openai_key>
79
+
80
+ python -m flexrag.entrypoints.run_assistant \
81
+ name=nq \
82
+ split=test \
83
+ output_path=${OUTPUT_PATH} \
84
+ assistant_type=modular \
85
+ modular_config.used_fields=[title,text] \
86
+ modular_config.retriever_type="FlexRAG/wiki2021_atlas_bm25s" \
87
+ modular_config.generator_type=openai \
88
+ modular_config.openai_config.model_name='gpt-4o-mini' \
89
+ modular_config.openai_config.api_key=$OPENAI_KEY \
90
+ modular_config.do_sample=False \
91
+ eval_config.metrics_type=[retrieval_success_rate,generation_f1,generation_em] \
92
+ eval_config.retrieval_success_rate_config.context_preprocess.processor_type=[simplify_answer] \
93
+ eval_config.retrieval_success_rate_config.eval_field=text \
94
+ eval_config.response_preprocess.processor_type=[simplify_answer]
95
+ ```
96
+
97
+ ## License
98
+ As the corpus is based on the [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) license, the retriever is also licensed under the same license.
99
+
100
+ ## Related Links
101
+
102
  FlexRAG Related Links:
103
  * 📚[Documentation](https://flexrag.readthedocs.io/en/latest/)
104
  * 💻[GitHub Repository](https://github.com/ictnlp/flexrag)
config.yaml CHANGED
@@ -1,6 +1,6 @@
1
- log_interval: 100
2
  top_k: 10
3
- batch_size: 32
4
  query_preprocess_pipeline:
5
  processor_type: []
6
  length_filter_config:
@@ -32,7 +32,7 @@ query_preprocess_pipeline:
32
  hf_tokenizer_path: null
33
  tiktok_tokenizer_name: null
34
  lang: null
35
- database_path: ./bm25s_lucene
36
  method: lucene
37
  idf_method: null
38
  backend: auto
@@ -40,4 +40,7 @@ k1: 1.5
40
  b: 0.75
41
  delta: 0.5
42
  lang: english
43
- indexed_fields: null
 
 
 
 
1
+ log_interval: 100000
2
  top_k: 10
3
+ batch_size: 512
4
  query_preprocess_pipeline:
5
  processor_type: []
6
  length_filter_config:
 
32
  hf_tokenizer_path: null
33
  tiktok_tokenizer_name: null
34
  lang: null
35
+ database_path: /data/zhangzhuocheng/Lab/Python/LLM/datasets/RAG/wikipedia/wiki_2021/bm25s_lucene
36
  method: lucene
37
  idf_method: null
38
  backend: auto
 
40
  b: 0.75
41
  delta: 0.5
42
  lang: english
43
+ indexed_fields:
44
+ - title
45
+ - section
46
+ - text
corpus.jsonl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:19ee000506e8b826a4a29076ead7e8978a656184c267b57732de52346710a411
3
+ size 24971177710
corpus.mmindex.json CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f61677c4e0b18a994bbe28d5e5fd54ed57a56108accf9219bb9a97259fb735d1
3
- size 432224577
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c688cc6edacd0d497a977268659d421eb76b761399e7858601b81c0741c2be8c
3
+ size 433595094
params.index.json CHANGED
@@ -9,4 +9,4 @@
9
  "num_docs": 37507469,
10
  "version": "0.2.1",
11
  "backend": "numpy"
12
- }
 
9
  "num_docs": 37507469,
10
  "version": "0.2.1",
11
  "backend": "numpy"
12
+ }
retriever.id ADDED
@@ -0,0 +1 @@
 
 
1
+ BM25SRetriever