Upload retriever to Hugging Face Hub
Browse files- .gitattributes +3 -0
- README.md +101 -0
- cls.id +1 -0
- config.yaml +37 -0
- database.lmdb/data.mdb +3 -0
- database.lmdb/lock.mdb +0 -0
- indexes/bm25/cls.id +1 -0
- indexes/bm25/config.yaml +10 -0
- indexes/bm25/context_mapping.pkl +3 -0
- indexes/bm25/data.csc.index.npy +3 -0
- indexes/bm25/indices.csc.index.npy +3 -0
- indexes/bm25/indptr.csc.index.npy +3 -0
- indexes/bm25/multi_field_index_config.yaml +5 -0
- indexes/bm25/params.index.json +12 -0
- indexes/bm25/vocab.index.json +3 -0
- indexes/contriever/cls.id +1 -0
- indexes/contriever/config.yaml +62 -0
- indexes/contriever/context_mapping.pkl +3 -0
- indexes/contriever/index.faiss +3 -0
- indexes/contriever/multi_field_index_config.yaml +5 -0
.gitattributes
CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
database.lmdb/data.mdb filter=lfs diff=lfs merge=lfs -text
|
37 |
+
indexes/bm25/vocab.index.json filter=lfs diff=lfs merge=lfs -text
|
38 |
+
indexes/contriever/index.faiss filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
@@ -0,0 +1,101 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: en
|
3 |
+
library_name: FlexRAG
|
4 |
+
tags:
|
5 |
+
- FlexRAG
|
6 |
+
- retrieval
|
7 |
+
- search
|
8 |
+
- lexical
|
9 |
+
- RAG
|
10 |
+
- IR
|
11 |
+
---
|
12 |
+
|
13 |
+
# FlexRAG Retriever
|
14 |
+
|
15 |
+
This is a FlexRetriever created with the [`FlexRAG`](https://github.com/ictnlp/FlexRAG) library (version `0.3.0`).
|
16 |
+
|
17 |
+
## Retriever Attributes
|
18 |
+
The `enwiki_2018_atlas` retriever is a FlexRetriever that provides access to the English Wikipedia corpus from December 2018. It is designed for information retrieval tasks, allowing users to search and retrieve relevant documents based on their queries.
|
19 |
+
The corpus of this retriever was created by the [Atlas](https://github.com/facebookresearch/atlas) project and the index was built using the [FlexRAG](https://github.com/ictnlp/FlexRAG) library.
|
20 |
+
|
21 |
+
| Corpus Attribute | Value |
|
22 |
+
| ---------------- | --------------------------------------------------------------- |
|
23 |
+
| Language | English |
|
24 |
+
| Domain | Wikipedia |
|
25 |
+
| Saved Fields | title, section, text |
|
26 |
+
| Size | 30.4M (26.9M text, 2.7M infobox) |
|
27 |
+
| Dump Date | Dec 2018 |
|
28 |
+
| Provideer | [Atlas](https://github.com/facebookresearch/atlas) |
|
29 |
+
| License | [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) |
|
30 |
+
|
31 |
+
|
32 |
+
| Index Attribute | Value |
|
33 |
+
| --------------- | --------------------------------------------------------------- |
|
34 |
+
| Index Name | bm25 |
|
35 |
+
| Index Type | Sparse |
|
36 |
+
| Index Method | Lucene |
|
37 |
+
| Indexed Fields | title, section, text (concat) |
|
38 |
+
| Preprocessing | LengthFilter(min_char=10, max_char=4096) |
|
39 |
+
| Provideer | [FlexRAG](https://github.com/ictnlp/flexrag) |
|
40 |
+
| License | [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) |
|
41 |
+
|
42 |
+
| Index Attribute | Value |
|
43 |
+
| --------------- | --------------------------------------------------------------- |
|
44 |
+
| Index Name | contriever |
|
45 |
+
| Index Type | Dense |
|
46 |
+
| Index Method | IVFPQ |
|
47 |
+
| Indexed Fields | title, section, text (concat) |
|
48 |
+
| Query Encoder | `facebook/contriever-msmarco` |
|
49 |
+
| Passage Encoder | `facebook/contriever-msmarco` |
|
50 |
+
| Preprocessing | LengthFilter(min_char=10, max_char=4096) |
|
51 |
+
| Provideer | [FlexRAG](https://github.com/ictnlp/flexrag) |
|
52 |
+
| License | [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) |
|
53 |
+
|
54 |
+
## Usage
|
55 |
+
|
56 |
+
### Installation
|
57 |
+
You can install the `FlexRAG` library with `pip`:
|
58 |
+
|
59 |
+
```bash
|
60 |
+
pip install flexrag faiss-cpu
|
61 |
+
```
|
62 |
+
|
63 |
+
### Loading the `FlexRAG` retriever
|
64 |
+
|
65 |
+
You can use this retriever for information retrieval tasks. Here is an example:
|
66 |
+
|
67 |
+
```python
|
68 |
+
from flexrag.retriever import LocalRetriever
|
69 |
+
|
70 |
+
|
71 |
+
# Load the retriever from the HuggingFace Hub
|
72 |
+
retriever = LocalRetriever.load_from_hub("FlexRAG/enwiki_2018_atlas")
|
73 |
+
|
74 |
+
|
75 |
+
# You can retrieve relevant documents now
|
76 |
+
results = retriever.search("Who is Bruce Wayne?")
|
77 |
+
```
|
78 |
+
|
79 |
+
### Running the RAG demo with the retriever
|
80 |
+
|
81 |
+
You can run the **GUI application** of the RAG assistant with this retriever. Here is an example:
|
82 |
+
|
83 |
+
```bash
|
84 |
+
python -m flexrag.entrypoints.run_interactive \
|
85 |
+
assistant_type=modular \
|
86 |
+
modular_config.used_fields=[title,text] \
|
87 |
+
modular_config.retriever_type="FlexRAG/enwiki_2018_atlas" \
|
88 |
+
modular_config.response_type=original \
|
89 |
+
modular_config.generator_type=openai \
|
90 |
+
modular_config.openai_config.model_name='gpt-4o-mini' \
|
91 |
+
modular_config.openai_config.api_key=$OPENAI_KEY \
|
92 |
+
modular_config.do_sample=False
|
93 |
+
```
|
94 |
+
|
95 |
+
## License
|
96 |
+
As the corpus is based on the [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) license, the retriever is also licensed under the same license.
|
97 |
+
|
98 |
+
|
99 |
+
FlexRAG Related Links:
|
100 |
+
* 📚[Documentation](https://flexrag.readthedocs.io/en/latest/)
|
101 |
+
* 💻[GitHub Repository](https://github.com/ictnlp/flexrag)
|
cls.id
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
FlexRetriever
|
config.yaml
ADDED
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
log_interval: 100000
|
2 |
+
top_k: 10
|
3 |
+
batch_size: 4096
|
4 |
+
query_preprocess_pipeline:
|
5 |
+
processor_type: []
|
6 |
+
length_filter_config:
|
7 |
+
max_tokens: null
|
8 |
+
min_tokens: null
|
9 |
+
max_chars: null
|
10 |
+
min_chars: null
|
11 |
+
max_bytes: null
|
12 |
+
min_bytes: null
|
13 |
+
tokenizer_config:
|
14 |
+
tokenizer_type: moses
|
15 |
+
hf_tokenizer_path: null
|
16 |
+
tiktok_tokenizer_name: null
|
17 |
+
lang: null
|
18 |
+
token_normalize_config:
|
19 |
+
lang: en
|
20 |
+
penn: true
|
21 |
+
norm_quote_commas: true
|
22 |
+
norm_numbers: true
|
23 |
+
pre_replace_unicode_punct: false
|
24 |
+
post_remove_control_chars: false
|
25 |
+
perl_parity: false
|
26 |
+
truncate_config:
|
27 |
+
max_chars: null
|
28 |
+
max_bytes: null
|
29 |
+
max_tokens: null
|
30 |
+
tokenizer_config:
|
31 |
+
tokenizer_type: moses
|
32 |
+
hf_tokenizer_path: null
|
33 |
+
tiktok_tokenizer_name: null
|
34 |
+
lang: null
|
35 |
+
retriever_path: null
|
36 |
+
indexes_merge_method: linear
|
37 |
+
used_indexes: null
|
database.lmdb/data.mdb
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:c266cb282533ef23d437ddd6e370816863a44276a7c82b7d611dff34562efdc5
|
3 |
+
size 28694089728
|
database.lmdb/lock.mdb
ADDED
Binary file (8.19 kB). View file
|
|
indexes/bm25/cls.id
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
BM25Index
|
indexes/bm25/config.yaml
ADDED
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
log_interval: 10000
|
2 |
+
batch_size: 512
|
3 |
+
index_path: /data/zhangzhuocheng/Lab/Python/LLM/datasets/RAG/Corpus/enwiki_2018_atlas/flex/indexes/bm25
|
4 |
+
method: lucene
|
5 |
+
idf_method: null
|
6 |
+
backend: auto
|
7 |
+
k1: 1.5
|
8 |
+
b: 0.75
|
9 |
+
delta: 0.5
|
10 |
+
lang: english
|
indexes/bm25/context_mapping.pkl
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:9a9813df1610580866468946adab882bfeb9ef4b067217154ce93f0b5ff63810
|
3 |
+
size 942169490
|
indexes/bm25/data.csc.index.npy
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:fa0a05322a20d4abc5866f1e7b7ec5b44b813a913b1b48f576a1ca8078889948
|
3 |
+
size 6591838820
|
indexes/bm25/indices.csc.index.npy
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:149c01e2f984f32106a0648651f517965d11c538c16e14a7f431e5dc98f44878
|
3 |
+
size 6591838820
|
indexes/bm25/indptr.csc.index.npy
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:65855cea4e1ec6645c39ffa02fdd6c99d279bf82cc4351155c00fad49dded9ab
|
3 |
+
size 41208084
|
indexes/bm25/multi_field_index_config.yaml
ADDED
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
indexed_fields:
|
2 |
+
- title
|
3 |
+
- section
|
4 |
+
- text
|
5 |
+
merge_method: concat
|
indexes/bm25/params.index.json
ADDED
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"k1": 1.5,
|
3 |
+
"b": 0.75,
|
4 |
+
"delta": 0.5,
|
5 |
+
"method": "lucene",
|
6 |
+
"idf_method": "lucene",
|
7 |
+
"dtype": "float32",
|
8 |
+
"int_dtype": "int32",
|
9 |
+
"num_docs": 32023701,
|
10 |
+
"version": "0.2.11",
|
11 |
+
"backend": "numpy"
|
12 |
+
}
|
indexes/bm25/vocab.index.json
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:e2430ed7b035d343553756556f1c0e404d3099c660d72fe70c9c799b7a108250
|
3 |
+
size 208594633
|
indexes/contriever/cls.id
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
FaissIndex
|
indexes/contriever/config.yaml
ADDED
@@ -0,0 +1,62 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
log_interval: 100000
|
2 |
+
batch_size: 2048
|
3 |
+
index_path: null
|
4 |
+
query_encoder_config:
|
5 |
+
encoder_type: hf
|
6 |
+
cohere_config: null
|
7 |
+
hf_config:
|
8 |
+
batch_size: 32
|
9 |
+
log_interval: 1000
|
10 |
+
model_path: facebook/contriever-msmarco
|
11 |
+
tokenizer_path: null
|
12 |
+
trust_remote_code: false
|
13 |
+
device_id:
|
14 |
+
- 0
|
15 |
+
load_dtype: auto
|
16 |
+
max_encode_length: 512
|
17 |
+
encode_method: mean
|
18 |
+
normalize: false
|
19 |
+
prompt: ''
|
20 |
+
task: ''
|
21 |
+
hf_clip_config: null
|
22 |
+
jina_config: null
|
23 |
+
ollama_config: null
|
24 |
+
openai_config: null
|
25 |
+
sentence_transformer_config: null
|
26 |
+
passage_encoder_config:
|
27 |
+
encoder_type: hf
|
28 |
+
cohere_config: null
|
29 |
+
hf_config:
|
30 |
+
batch_size: 32
|
31 |
+
log_interval: 1000
|
32 |
+
model_path: facebook/contriever-msmarco
|
33 |
+
tokenizer_path: null
|
34 |
+
trust_remote_code: false
|
35 |
+
device_id:
|
36 |
+
- 0
|
37 |
+
- 1
|
38 |
+
- 2
|
39 |
+
- 3
|
40 |
+
load_dtype: auto
|
41 |
+
max_encode_length: 512
|
42 |
+
encode_method: mean
|
43 |
+
normalize: false
|
44 |
+
prompt: ''
|
45 |
+
task: ''
|
46 |
+
hf_clip_config: null
|
47 |
+
jina_config: null
|
48 |
+
ollama_config: null
|
49 |
+
openai_config: null
|
50 |
+
sentence_transformer_config: null
|
51 |
+
distance_function: IP
|
52 |
+
index_type: auto
|
53 |
+
n_subquantizers: 8
|
54 |
+
n_bits: 8
|
55 |
+
n_list: 1000
|
56 |
+
factory_str: null
|
57 |
+
index_train_num: -1
|
58 |
+
n_probe: 512
|
59 |
+
device_id: []
|
60 |
+
k_factor: 10
|
61 |
+
polysemous_ht: 0
|
62 |
+
efSearch: 100
|
indexes/contriever/context_mapping.pkl
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:9a9813df1610580866468946adab882bfeb9ef4b067217154ce93f0b5ff63810
|
3 |
+
size 942169490
|
indexes/contriever/index.faiss
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:70a3029e71c0d3d079099652650489cba2213e75320d4def527c19f00e8c26a0
|
3 |
+
size 6429691256
|
indexes/contriever/multi_field_index_config.yaml
ADDED
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
indexed_fields:
|
2 |
+
- title
|
3 |
+
- section
|
4 |
+
- text
|
5 |
+
merge_method: concat
|