English
syleetolow commited on
Commit
1d1484b
·
verified ·
1 Parent(s): eca9e39

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -1
README.md CHANGED
@@ -2,4 +2,29 @@
2
  license: cc-by-nc-4.0
3
  language:
4
  - en
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: cc-by-nc-4.0
3
  language:
4
  - en
5
+ ---
6
+
7
+ This is the sentence-level, supervised, sparse autoencoder (S3AE) proposed in the paper "Emergence of psychopathological computations in large language models" (https://arxiv.org/abs/2504.08016).
8
+
9
+ The model was trained on the residual stream in the 10th layer of instruction-tuned [Gemma 2 27B](https://huggingface.co/google/gemma-2-27b-it), using a proprietary synthetic dataset with psychopathology symptom labels. The model weight precision is bfloat16, and the hidden dimension size is 8 times that of the LLM residual stream.
10
+
11
+ The 1st to 17th dimensions of S3AE hidden features, respectively, correspond to activations of the following thoughts:
12
+ 1: 'depressed mood',
13
+ 2: 'anhedonia (loss of interest)',
14
+ 3: 'pessimism',
15
+ 4: 'guilt',
16
+ 5: 'anxiety',
17
+ 6: 'catastrophic thinking',
18
+ 7: 'perfectionism',
19
+ 8: 'active avoidance',
20
+ 9: 'grandiosity (delusion of grandeur)',
21
+ 10: 'manic mood',
22
+ 11: 'impulsivity',
23
+ 12: 'risk-seeking',
24
+ 13: 'splitting (binary thinking)',
25
+ 14: 'unstable self-image',
26
+ 15: 'aggression',
27
+ 16: 'anger',
28
+ 17: 'irritability'.
29
+
30
+ Dimensions 7, 13, and 14 were not included in the paper's analysis.