FlexRAG commited on
Commit
08f2fe0
·
verified ·
1 Parent(s): f5c85f3

Upload retriever to Hugging Face Hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ database.lmdb/data.mdb filter=lfs diff=lfs merge=lfs -text
37
+ indexes/bm25/vocab.index.json filter=lfs diff=lfs merge=lfs -text
38
+ indexes/contriever/index.faiss filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ library_name: FlexRAG
4
+ tags:
5
+ - FlexRAG
6
+ - retrieval
7
+ - search
8
+ - lexical
9
+ - RAG
10
+ - IR
11
+ ---
12
+
13
+ # FlexRAG Retriever
14
+
15
+ This is a FlexRetriever created with the [`FlexRAG`](https://github.com/ictnlp/FlexRAG) library (version `0.3.0`).
16
+
17
+ ## Retriever Attributes
18
+ The `enwiki_2018_atlas` retriever is a FlexRetriever that provides access to the English Wikipedia corpus from December 2018. It is designed for information retrieval tasks, allowing users to search and retrieve relevant documents based on their queries.
19
+ The corpus of this retriever was created by the [Atlas](https://github.com/facebookresearch/atlas) project and the index was built using the [FlexRAG](https://github.com/ictnlp/FlexRAG) library.
20
+
21
+ | Corpus Attribute | Value |
22
+ | ---------------- | --------------------------------------------------------------- |
23
+ | Language | English |
24
+ | Domain | Wikipedia |
25
+ | Saved Fields | title, section, text |
26
+ | Size | 30.4M (26.9M text, 2.7M infobox) |
27
+ | Dump Date | Dec 2018 |
28
+ | Provideer | [Atlas](https://github.com/facebookresearch/atlas) |
29
+ | License | [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) |
30
+
31
+
32
+ | Index Attribute | Value |
33
+ | --------------- | --------------------------------------------------------------- |
34
+ | Index Name | bm25 |
35
+ | Index Type | Sparse |
36
+ | Index Method | Lucene |
37
+ | Indexed Fields | title, section, text (concat) |
38
+ | Preprocessing | LengthFilter(min_char=10, max_char=4096) |
39
+ | Provideer | [FlexRAG](https://github.com/ictnlp/flexrag) |
40
+ | License | [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) |
41
+
42
+ | Index Attribute | Value |
43
+ | --------------- | --------------------------------------------------------------- |
44
+ | Index Name | contriever |
45
+ | Index Type | Dense |
46
+ | Index Method | IVFPQ |
47
+ | Indexed Fields | title, section, text (concat) |
48
+ | Query Encoder | `facebook/contriever-msmarco` |
49
+ | Passage Encoder | `facebook/contriever-msmarco` |
50
+ | Preprocessing | LengthFilter(min_char=10, max_char=4096) |
51
+ | Provideer | [FlexRAG](https://github.com/ictnlp/flexrag) |
52
+ | License | [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) |
53
+
54
+ ## Usage
55
+
56
+ ### Installation
57
+ You can install the `FlexRAG` library with `pip`:
58
+
59
+ ```bash
60
+ pip install flexrag faiss-cpu
61
+ ```
62
+
63
+ ### Loading the `FlexRAG` retriever
64
+
65
+ You can use this retriever for information retrieval tasks. Here is an example:
66
+
67
+ ```python
68
+ from flexrag.retriever import LocalRetriever
69
+
70
+
71
+ # Load the retriever from the HuggingFace Hub
72
+ retriever = LocalRetriever.load_from_hub("FlexRAG/enwiki_2018_atlas")
73
+
74
+
75
+ # You can retrieve relevant documents now
76
+ results = retriever.search("Who is Bruce Wayne?")
77
+ ```
78
+
79
+ ### Running the RAG demo with the retriever
80
+
81
+ You can run the **GUI application** of the RAG assistant with this retriever. Here is an example:
82
+
83
+ ```bash
84
+ python -m flexrag.entrypoints.run_interactive \
85
+ assistant_type=modular \
86
+ modular_config.used_fields=[title,text] \
87
+ modular_config.retriever_type="FlexRAG/enwiki_2018_atlas" \
88
+ modular_config.response_type=original \
89
+ modular_config.generator_type=openai \
90
+ modular_config.openai_config.model_name='gpt-4o-mini' \
91
+ modular_config.openai_config.api_key=$OPENAI_KEY \
92
+ modular_config.do_sample=False
93
+ ```
94
+
95
+ ## License
96
+ As the corpus is based on the [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) license, the retriever is also licensed under the same license.
97
+
98
+
99
+ FlexRAG Related Links:
100
+ * 📚[Documentation](https://flexrag.readthedocs.io/en/latest/)
101
+ * 💻[GitHub Repository](https://github.com/ictnlp/flexrag)
cls.id ADDED
@@ -0,0 +1 @@
 
 
1
+ FlexRetriever
config.yaml ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ log_interval: 100000
2
+ top_k: 10
3
+ batch_size: 4096
4
+ query_preprocess_pipeline:
5
+ processor_type: []
6
+ length_filter_config:
7
+ max_tokens: null
8
+ min_tokens: null
9
+ max_chars: null
10
+ min_chars: null
11
+ max_bytes: null
12
+ min_bytes: null
13
+ tokenizer_config:
14
+ tokenizer_type: moses
15
+ hf_tokenizer_path: null
16
+ tiktok_tokenizer_name: null
17
+ lang: null
18
+ token_normalize_config:
19
+ lang: en
20
+ penn: true
21
+ norm_quote_commas: true
22
+ norm_numbers: true
23
+ pre_replace_unicode_punct: false
24
+ post_remove_control_chars: false
25
+ perl_parity: false
26
+ truncate_config:
27
+ max_chars: null
28
+ max_bytes: null
29
+ max_tokens: null
30
+ tokenizer_config:
31
+ tokenizer_type: moses
32
+ hf_tokenizer_path: null
33
+ tiktok_tokenizer_name: null
34
+ lang: null
35
+ retriever_path: null
36
+ indexes_merge_method: linear
37
+ used_indexes: null
database.lmdb/data.mdb ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c266cb282533ef23d437ddd6e370816863a44276a7c82b7d611dff34562efdc5
3
+ size 28694089728
database.lmdb/lock.mdb ADDED
Binary file (8.19 kB). View file
 
indexes/bm25/cls.id ADDED
@@ -0,0 +1 @@
 
 
1
+ BM25Index
indexes/bm25/config.yaml ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ log_interval: 10000
2
+ batch_size: 512
3
+ index_path: /data/zhangzhuocheng/Lab/Python/LLM/datasets/RAG/Corpus/enwiki_2018_atlas/flex/indexes/bm25
4
+ method: lucene
5
+ idf_method: null
6
+ backend: auto
7
+ k1: 1.5
8
+ b: 0.75
9
+ delta: 0.5
10
+ lang: english
indexes/bm25/context_mapping.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9a9813df1610580866468946adab882bfeb9ef4b067217154ce93f0b5ff63810
3
+ size 942169490
indexes/bm25/data.csc.index.npy ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fa0a05322a20d4abc5866f1e7b7ec5b44b813a913b1b48f576a1ca8078889948
3
+ size 6591838820
indexes/bm25/indices.csc.index.npy ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:149c01e2f984f32106a0648651f517965d11c538c16e14a7f431e5dc98f44878
3
+ size 6591838820
indexes/bm25/indptr.csc.index.npy ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:65855cea4e1ec6645c39ffa02fdd6c99d279bf82cc4351155c00fad49dded9ab
3
+ size 41208084
indexes/bm25/multi_field_index_config.yaml ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ indexed_fields:
2
+ - title
3
+ - section
4
+ - text
5
+ merge_method: concat
indexes/bm25/params.index.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "k1": 1.5,
3
+ "b": 0.75,
4
+ "delta": 0.5,
5
+ "method": "lucene",
6
+ "idf_method": "lucene",
7
+ "dtype": "float32",
8
+ "int_dtype": "int32",
9
+ "num_docs": 32023701,
10
+ "version": "0.2.11",
11
+ "backend": "numpy"
12
+ }
indexes/bm25/vocab.index.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e2430ed7b035d343553756556f1c0e404d3099c660d72fe70c9c799b7a108250
3
+ size 208594633
indexes/contriever/cls.id ADDED
@@ -0,0 +1 @@
 
 
1
+ FaissIndex
indexes/contriever/config.yaml ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ log_interval: 100000
2
+ batch_size: 2048
3
+ index_path: null
4
+ query_encoder_config:
5
+ encoder_type: hf
6
+ cohere_config: null
7
+ hf_config:
8
+ batch_size: 32
9
+ log_interval: 1000
10
+ model_path: facebook/contriever-msmarco
11
+ tokenizer_path: null
12
+ trust_remote_code: false
13
+ device_id:
14
+ - 0
15
+ load_dtype: auto
16
+ max_encode_length: 512
17
+ encode_method: mean
18
+ normalize: false
19
+ prompt: ''
20
+ task: ''
21
+ hf_clip_config: null
22
+ jina_config: null
23
+ ollama_config: null
24
+ openai_config: null
25
+ sentence_transformer_config: null
26
+ passage_encoder_config:
27
+ encoder_type: hf
28
+ cohere_config: null
29
+ hf_config:
30
+ batch_size: 32
31
+ log_interval: 1000
32
+ model_path: facebook/contriever-msmarco
33
+ tokenizer_path: null
34
+ trust_remote_code: false
35
+ device_id:
36
+ - 0
37
+ - 1
38
+ - 2
39
+ - 3
40
+ load_dtype: auto
41
+ max_encode_length: 512
42
+ encode_method: mean
43
+ normalize: false
44
+ prompt: ''
45
+ task: ''
46
+ hf_clip_config: null
47
+ jina_config: null
48
+ ollama_config: null
49
+ openai_config: null
50
+ sentence_transformer_config: null
51
+ distance_function: IP
52
+ index_type: auto
53
+ n_subquantizers: 8
54
+ n_bits: 8
55
+ n_list: 1000
56
+ factory_str: null
57
+ index_train_num: -1
58
+ n_probe: 512
59
+ device_id: []
60
+ k_factor: 10
61
+ polysemous_ht: 0
62
+ efSearch: 100
indexes/contriever/context_mapping.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9a9813df1610580866468946adab882bfeb9ef4b067217154ce93f0b5ff63810
3
+ size 942169490
indexes/contriever/index.faiss ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:70a3029e71c0d3d079099652650489cba2213e75320d4def527c19f00e8c26a0
3
+ size 6429691256
indexes/contriever/multi_field_index_config.yaml ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ indexed_fields:
2
+ - title
3
+ - section
4
+ - text
5
+ merge_method: concat