tencent
/

POINTS-Reader

@@ -22,6 +22,7 @@ We are delighted to announce that the WePOINTS family has welcomed a new member:
 ## News
 - 2025.08.26: We released the weights of the most recent version of POINT-Reader🔥🔥🔥.
 - 2025.08.21: POINTS-Reader is accepted by **EMNLP 2025** for presentation at the **Main Conference**🎉🎉🎉.
@@ -37,7 +38,7 @@ We are delighted to announce that the WePOINTS family has welcomed a new member:
 ## Results
-We take the following results from [OmniDocBench](https://github.com/opendatalab/OmniDocBench/tree/main) and POINTS-Reader for comparison:
 <table style="width: 92%; margin: auto; border-collapse: collapse;">
 <thead>
@@ -225,8 +226,8 @@ We take the following results from [OmniDocBench](https://github.com/opendatalab
 <td>0.641</td>
 </tr>
 <tr>
-<td rowspan="10">Expert VLMs</td>
-<td>POINTS-Reader-3B</td>
 <td>0.133</td>
 <td>0.212</td>
 <td>0.062</td>
@@ -561,7 +562,7 @@ This following code snippet has been tested with following environment:
 ```
 python==3.10.12
 torch==2.5.1
-transformers==4.46.1
 cuda==12.1
 ```
@@ -569,17 +570,8 @@ If you encounter environment issues, please feel free to open an issue.
 ### Run with Transformers
-Before you run the following code, make sure you install the `WePOINTS` package by running:
-```
-git clone https://github.com/WePOINTS/WePOINTS.git
-cd WePOINTS
-pip install -e.
-```
 ```python
-from wepoints.utils.images import Qwen2ImageProcessorForPOINTSV15
-from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
@@ -593,11 +585,11 @@ prompt = (
 image_path = '/path/to/your/local/image'
 model_path = 'tencent/POINTS-Reader'
 model = AutoModelForCausalLM.from_pretrained(model_path,
-                                                    trust_remote_code=True,
-                                                    torch_dtype=torch.float16,
-                                                    device_map='cuda')
 tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
-image_processor = Qwen2ImageProcessorForPOINTSV15.from_pretrained(model_path)
 content = [
             dict(type='image', image=image_path),
             dict(type='text', text=prompt)
@@ -629,12 +621,123 @@ If you encounter issues like repeation, please try to increase the resolution of
 ### Deploy with SGLang
-We will create a Pull Request to SGLang, please stay tuned.
 ## Known Issues
-- **Complex Document Parsing**: POINTS-Reader can struggle with complex layouts (e.g., newspapers), often producing repeated or missing content.
-- **Handwritten Document Parsing**: It also has difficulty handling handwritten inputs (e.g., receipts, notes), which can lead to recognition errors or omissions.
 - **Multi-language Document Parsing**: POINTS-Reader currently supports only English and Chinese, limiting its effectiveness on other languages.
 ## Citation
@@ -645,7 +748,7 @@ If you use this model in your work, please cite the following paper:
 @article{points-reader,
   title={POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion},
   author={Liu, Yuan and Zhongyin Zhao and Tian, Le and Haicheng Wang and Xubing Ye and Yangxiu You and Zilin Yu and Chuhan Wu and  Zhou, Xiao and Yu, Yang and Zhou, Jie},
-  journal={},
   year={2025}
 }

 ## News
+- 2025.08.27: Support deploying POINTS-Reader with SGLang💪💪💪.
 - 2025.08.26: We released the weights of the most recent version of POINT-Reader🔥🔥🔥.
 - 2025.08.21: POINTS-Reader is accepted by **EMNLP 2025** for presentation at the **Main Conference**🎉🎉🎉.
 ## Results
+For comparison, we use the results reported by [OmniDocBench](https://github.com/opendatalab/OmniDocBench/tree/main) and POINTS-Reader. Compared with the version submitted to EMNLP 2025, the current release provides (1) improved performance and (2) support for Chinese documents. Both enhancements build upon the methods proposed in this paper.
 <table style="width: 92%; margin: auto; border-collapse: collapse;">
 <thead>
 <td>0.641</td>
 </tr>
 <tr>
+<td rowspan="11">Expert VLMs</td>
+<td><strong style="color: green;">POINTS-Reader-3B</strong></td>
 <td>0.133</td>
 <td>0.212</td>
 <td>0.062</td>
 ```
 python==3.10.12
 torch==2.5.1
+transformers==4.55.2
 cuda==12.1
 ```
 ### Run with Transformers
 ```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, Qwen2VLImageProcessor
 import torch
 image_path = '/path/to/your/local/image'
 model_path = 'tencent/POINTS-Reader'
 model = AutoModelForCausalLM.from_pretrained(model_path,
+                                             trust_remote_code=True,
+                                             torch_dtype=torch.float16,
+                                             device_map='cuda')
 tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+image_processor = Qwen2VLImageProcessor.from_pretrained(model_path)
 content = [
             dict(type='image', image=image_path),
             dict(type='text', text=prompt)
 ### Deploy with SGLang
+We have created a [Pull Request](https://github.com/sgl-project/sglang/pull/9651) for SGLang. You can check out this branch and install SGLang in editable mode by following the [official guide](https://docs.sglang.ai/get_started/install.html) prior to the merging of this PR.
+#### How to Deploy
+You can deploy POINTS-Reader with SGLang using the following command:
+```
+python3 -m sglang.launch_server \
+--model-path tencent/POINTS-Reader \
+--tp-size 1 \
+--dp-size 1 \
+--chat-template points-v15-chat \
+--trust-remote-code \
+--port 8081
+```
+#### How to Use
+You can use the following code to obtain results from SGLang:
+```python
+from typing import List
+import requests
+import json
+def call_wepoints(messages: List[dict],
+                 temperature: float = 0.0,
+                 max_new_tokens: int = 2048,
+                 repetition_penalty: float = 1.05,
+                 top_p: float = 0.8,
+                 top_k: int = 20,
+                 do_sample: bool = True,
+                 url: str = 'http://127.0.0.1:8081/v1/chat/completions') -> str:
+    """Query WePOINTS model to generate a response.
+    Args:
+        messages (List[dict]): A list of messages to be sent to WePOINTS. The
+            messages should be the standard OpenAI messages, like:
+            [
+                {
+                    'role': 'user',
+                    'content': [
+                        {
+                            'type': 'text',
+                            'text': 'Please describe this image in short'
+                        },
+                        {
+                            'type': 'image_url',
+                            'image_url': {'url': /path/to/image.jpg}
+                        }
+                    ]
+                }
+            ]
+        temperature (float, optional): The temperature of the model.
+            Defaults to 0.0.
+        max_new_tokens (int, optional): The maximum number of new tokens to generate.
+            Defaults to 2048.
+        repetition_penalty (float, optional): The penalty for repetition.
+            Defaults to 1.05.
+        top_p (float, optional): The top-p probability threshold.
+            Defaults to 0.8.
+        top_k (int, optional): The top-k sampling vocabulary size.
+            Defaults to 20.
+        do_sample (bool, optional): Whether to use sampling or greedy decoding.
+            Defaults to True.
+        url (str, optional): The URL of the WePOINTS model.
+            Defaults to 'http://127.0.0.1:8081/v1/chat/completions'.
+    Returns:
+        str: The generated response from WePOINTS.
+    """
+    data = {
+        'model': 'WePoints',
+        'messages': messages,
+        'max_new_tokens': max_new_tokens,
+        'temperature': temperature,
+        'repetition_penalty': repetition_penalty,
+        'top_p': top_p,
+        'top_k': top_k,
+        'do_sample': do_sample,
+    }
+    response = requests.post(url,
+                             json=data)
+    response = json.loads(response.text)
+    response = response['choices'][0]['message']['content']
+    return response
+prompt = (
+    'Please extract all the text from the image with the following requirements:\n'
+    '1. Return tables in HTML format.\n'
+    '2. Return all other text in Markdown format.'
+)
+messages = [{
+              'role': 'user',
+              'content': [
+                  {
+                      'type': 'text',
+                      'text': prompt
+                  },
+                  {
+                      'type': 'image_url',
+                      'image_url': {'url': '/path/to/image.jpg'}
+                  }
+              ]
+            }]
+response = call_wepoints(messages)
+print(response)
+```
 ## Known Issues
+- **Complex Document Parsing**: POINTS-Reader can struggle with complex layouts (e.g., newspapers), often producing repeated or missing content.
+- **Handwritten Document Parsing**: It also has difficulty handling handwritten inputs (e.g., receipts, notes), which can lead to recognition errors or omissions.
 - **Multi-language Document Parsing**: POINTS-Reader currently supports only English and Chinese, limiting its effectiveness on other languages.
 ## Citation
 @article{points-reader,
   title={POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion},
   author={Liu, Yuan and Zhongyin Zhao and Tian, Le and Haicheng Wang and Xubing Ye and Yangxiu You and Zilin Yu and Chuhan Wu and  Zhou, Xiao and Yu, Yang and Zhou, Jie},
+  journal={EMNLP2025},
   year={2025}
 }