Update README.md
Browse files
README.md
CHANGED
@@ -2,4 +2,29 @@
|
|
2 |
license: cc-by-nc-4.0
|
3 |
language:
|
4 |
- en
|
5 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
license: cc-by-nc-4.0
|
3 |
language:
|
4 |
- en
|
5 |
+
---
|
6 |
+
|
7 |
+
This is the sentence-level, supervised, sparse autoencoder (S3AE) proposed in the paper "Emergence of psychopathological computations in large language models" (https://arxiv.org/abs/2504.08016).
|
8 |
+
|
9 |
+
The model was trained on the residual stream in the 10th layer of instruction-tuned [Gemma 2 27B](https://huggingface.co/google/gemma-2-27b-it), using a proprietary synthetic dataset with psychopathology symptom labels. The model weight precision is bfloat16, and the hidden dimension size is 8 times that of the LLM residual stream.
|
10 |
+
|
11 |
+
The 1st to 17th dimensions of S3AE hidden features, respectively, correspond to activations of the following thoughts:
|
12 |
+
1: 'depressed mood',
|
13 |
+
2: 'anhedonia (loss of interest)',
|
14 |
+
3: 'pessimism',
|
15 |
+
4: 'guilt',
|
16 |
+
5: 'anxiety',
|
17 |
+
6: 'catastrophic thinking',
|
18 |
+
7: 'perfectionism',
|
19 |
+
8: 'active avoidance',
|
20 |
+
9: 'grandiosity (delusion of grandeur)',
|
21 |
+
10: 'manic mood',
|
22 |
+
11: 'impulsivity',
|
23 |
+
12: 'risk-seeking',
|
24 |
+
13: 'splitting (binary thinking)',
|
25 |
+
14: 'unstable self-image',
|
26 |
+
15: 'aggression',
|
27 |
+
16: 'anger',
|
28 |
+
17: 'irritability'.
|
29 |
+
|
30 |
+
Dimensions 7, 13, and 14 were not included in the paper's analysis.
|