File size: 7,356 Bytes
ffe845b
 
85e3e97
 
 
 
ffe845b
 
 
 
 
 
 
 
e96d022
 
 
 
 
 
 
 
613406c
ffe845b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
613406c
ffe845b
 
613406c
ffe845b
 
 
 
 
 
 
 
 
 
 
613406c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e96d022
ffe845b
613406c
ffe845b
e96d022
 
 
ffe845b
613406c
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
license: apache-2.0
tags:
- code
- generation
- debugging
---

# Code Debugger v0.1

Hardware requirements for ChatGPT GPT-4o level inference speed for the following models on an RTX 3090: >=24 GB VRAM.

Note: The following results are based on my day-to-day workflows only. My goal was to run private models that could beat GPT-4o and Claude-3.5 in code debugging and generation to ‘load balance’ between OpenAI/Anthropic’s free plan and local models to avoid hitting rate limits, and to upload as few lines of my code and ideas to their servers as possible.

An example of a complex debugging scenario is where you build library A on top of library B that requires library C as a dependency but the root cause was a variable in library C. In this case, the following workflow guided me to correctly identify the problem.

<br>

## Throughput

![](./model_v0.1_throughput_comparison.png)

IQ here refers to Imatrix Quantization. For performance comparison against regular GGUF, please read [this Reddit post](https://www.reddit.com/r/LocalLLaMA/comments/1993iro/ggufs_quants_can_punch_above_their_weights_now/).

<br>

## Personal Preference Ranking

Evaluated on two programming tasks: debugging and generation. It may be a bit subjective. `DeepSeekV2 Coder Instruct` is ranked lower because their privacy policy says that they may collect "text input, prompt" and there's no way around it.

| **Rank** | **Model Name**                               | **Token Speed (tokens/s)** | **Debugging Performance**                                             | **Code Generation Performance**                                      | **Notes**                                                                                 |
|----------|----------------------------------------------|----------------------------|------------------------------------------------------------------------|-----------------------------------------------------------------------|-------------------------------------------------------------------------------------------|
| 1        | codestral-22b-v0.1-IQ6_K.gguf (this model)   | 34.21                       | Excellent at complex debugging, often surpasses GPT-4o and Claude-3.5  | Good, but may not be par with GPT-4o                                  | Best overall for debugging in my workflow, use Balanced Mode.                             |
| 2        | Claude-3.5-Sonnet                            | N/A                         | Poor in complex debugging compared to Codestral                         | Excellent, better than GPT-4o in long code generation                  | Great for code generation, but weaker in debugging.                                       |
| 3        | GPT-4o                                       | N/A                         | Good at complex debugging but can be outperformed by Codestral          | Excellent, generally reliable for code generation                      | Balanced performance between code debugging and generation.                               |
| 4        | DeepSeekV2 Coder Instruct                    | N/A                         | Poor, outputs the same code in complex scenarios                        | Great at general code generation, rivals GPT-4o                        | Excellent at code generation, but has data privacy concerns as per Privacy Policy.        |
| 5        | qwen2 7b instruct bf16                       | 78.22                       | Average, can think of correct approaches                                | Sometimes helps generate new ideas                                     | High speed, useful for generating ideas.                                                  |
| 6        | GPT-4o-mini                                  | N/A                         | Decent, but struggles with complex debugging tasks                      | Reliable for shorter or simpler code generation tasks                  | Suitable for less complex coding tasks.                                                   |
| 7        | AutoCoder.IQ4_K.gguf                         | 26.43                       | Average, offers different approaches but can be incorrect               | Generates useful short code segments                                   | Use Precise Mode for better results.                                                      |
| 8        | Meta-Llama-3.1-70B-Instruct-IQ2_XS.gguf      | 2.55                        | Poor, too slow to be practical in day-to-day workflows                  | Occasionally helps generate ideas                                      | Speed is a significant limitation.                                                        |
| 9        | Trinity-2-Codestral-22B-Q6_K_L               | N/A                         | Poor, similar issues to DeepSeekV2 in outputing the same code           | Decent, but often repeats code                                         | Similar problem to DeepSeekV2, not recommended for my complex tasks.                      |
| 10       | DeepSeekV2 Coder Lite Instruct Q_8L          | N/A                         | Poor, repeats code similar to other models in its family                | Not as effective in my context                                         | Not recommended overall based on my criteria.                                             |

Code debugging prompt template used:
```
<code>
<current output>
<the problem description of the current output>
<expected output (in English is fine)>
<any hints>
Think step by step. Solve this problem without removing any existing functionalities, logic, or checks, except any incorrect code that interferes with your edits.
```

<br>

## Generation Kwargs

Balanced Mode:
```python
generation_kwargs = {
    "max_tokens":8192,
    "stop":["<|EOT|>", "</s>", "<|end▁of▁sentence|>", "<eos>", "<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>"],
    "temperature":0.7,
    "stream":True,
    "top_k":50,
    "top_p":0.95,
}
```

Precise Mode:
```python
generation_kwargs = {
    "max_tokens":8192,
    "stop":["<|EOT|>", "</s>", "<|end▁of▁sentence|>", "<eos>", "<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>"],
    "temperature":0.0,
    "stream":True,
    "top_p":1.0,
}
```

Qwen2 7B:
```python
generation_kwargs = {
    "max_tokens":8192,
    "stop":["<|EOT|>", "</s>", "<|end▁of▁sentence|>", "<eos>", "<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>"],
    "temperature":0.4,
    "stream":True,
    "top_k":20,
    "top_p":0.8,
}
```

Other variations in temperature, top_k, and top_p were tested 5-8 times per model too, but I'm sticking to the above three.

<br>

## New Discoveries

The following are tested in my workflow, but may not generalize well to other workflows.

- In general, if there's an error in the code, copy pasting the last few rows of stacktrace to the LLM seems to work.
- Adding "Now, reflect." sometimes allows Claude-3.5-Sonnet to generate the correct solution.
- If GPT-4o reasons correctly in its first response and the conversation is then sent to GPT-4-mini, the mini model can maintain comparable level of reasoning/accuracy as GPT-4o.

<br>

## Download

```
pip install -U "huggingface_hub[cli]"
```

```
huggingface-cli download FredZhang7/claudegpt-code-debugger-v0.1 --include "codestral-22b-v0.1-IQ6_K.gguf" --local-dir ./
```