update
Browse files
README.md
CHANGED
@@ -28,7 +28,7 @@ TODO
|
|
28 |
**Tips: Our inference code still under updating, you could update it by assign "--include '\*.py'" in huggingface-cli to only update the inference code, avoid downloading the whole model.*
|
29 |
|
30 |
---
|
31 |
-
### w/o.
|
32 |
```python
|
33 |
from transformers import AutoTokenizer, AutoModel, AutoConfig, BitsAndBytesConfig
|
34 |
import torch
|
@@ -66,7 +66,7 @@ print(response)
|
|
66 |
```
|
67 |
|
68 |
---
|
69 |
-
### w.
|
70 |
Chunk-based prefill significantly reduces memory demands and response latency by encoding video input in a streaming manner. This advantage becomes particularly noticeable with longer videos.
|
71 |
|
72 |
To enable this mode, you need to set `enable_chunk_prefill` to `True` and configure the `prefill_config` parameters:
|
@@ -130,7 +130,7 @@ print(response)
|
|
130 |
```
|
131 |
|
132 |
---
|
133 |
-
### w.
|
134 |
coming soon
|
135 |
```python
|
136 |
|
|
|
28 |
**Tips: Our inference code still under updating, you could update it by assign "--include '\*.py'" in huggingface-cli to only update the inference code, avoid downloading the whole model.*
|
29 |
|
30 |
---
|
31 |
+
### 1. Inference w/o. Efficiency Optimization
|
32 |
```python
|
33 |
from transformers import AutoTokenizer, AutoModel, AutoConfig, BitsAndBytesConfig
|
34 |
import torch
|
|
|
66 |
```
|
67 |
|
68 |
---
|
69 |
+
### 2. Inference w. Chunk-based Pre-filling
|
70 |
Chunk-based prefill significantly reduces memory demands and response latency by encoding video input in a streaming manner. This advantage becomes particularly noticeable with longer videos.
|
71 |
|
72 |
To enable this mode, you need to set `enable_chunk_prefill` to `True` and configure the `prefill_config` parameters:
|
|
|
130 |
```
|
131 |
|
132 |
---
|
133 |
+
### 3. Inference w. Chunk-based Pre-filling & Bi-level KVs Decoding
|
134 |
coming soon
|
135 |
```python
|
136 |
|