syleetolow
/

s3ae

English

Model card Files Files and versions Community

syleetolow commited on Apr 14

Commit

1d1484b

verified ·

1 Parent(s): eca9e39

Update README.md

Browse files

Files changed (1) hide show

README.md +26 -1

README.md CHANGED Viewed

@@ -2,4 +2,29 @@
 license: cc-by-nc-4.0
 language:
 - en
----

 license: cc-by-nc-4.0
 language:
 - en
+---
+This is the sentence-level, supervised, sparse autoencoder (S3AE) proposed in the paper "Emergence of psychopathological computations in large language models" (https://arxiv.org/abs/2504.08016).
+The model was trained on the residual stream in the 10th layer of instruction-tuned [Gemma 2 27B](https://huggingface.co/google/gemma-2-27b-it), using a proprietary synthetic dataset with psychopathology symptom labels. The model weight precision is bfloat16, and the hidden dimension size is 8 times that of the LLM residual stream.
+The 1st to 17th dimensions of S3AE hidden features, respectively, correspond to activations of the following thoughts:
+                    1: 'depressed mood',
+                    2: 'anhedonia (loss of interest)',
+                    3: 'pessimism',
+                    4: 'guilt',
+                    5: 'anxiety',
+                    6: 'catastrophic thinking',
+                    7: 'perfectionism',
+                    8: 'active avoidance',
+                    9: 'grandiosity (delusion of grandeur)',
+                    10: 'manic mood',
+                    11: 'impulsivity',
+                    12: 'risk-seeking',
+                    13: 'splitting (binary thinking)',
+                    14: 'unstable self-image',
+                    15: 'aggression',
+                    16: 'anger',
+                    17: 'irritability'.
+Dimensions 7, 13, and 14 were not included in the paper's analysis.