s3ae

File size: 1,384 Bytes

eca9e39
 
 
 
1d1484b
 
8b49eeb
 
1d1484b
8b49eeb
1d1484b
 
2f667ec
2e35e91
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1d1484b
fbe8b36

---
license: cc-by-nc-4.0
language:
- en
---

This is trained parameters of the **S**entence-level, **S**upervised, **S**parse **A**uto**E**ncoder (S3AE) proposed in the paper ["Emergence of psychopathological computations in large language models"](https://arxiv.org/abs/2504.08016). 
Codes with S3AE architecture and use examples can be found in this [Github](https://github.com/syleeheal/Machine_Psychopathology).

S3AE was trained on the residual stream in the 10th layer of instruction-tuned [Gemma 2 27B](https://huggingface.co/google/gemma-2-27b-it), using a proprietary synthetic dataset with psychopathology symptom labels. The model weight precision is bfloat16, and the hidden dimension size is 8 times that of the LLM residual stream.

The 1st to 17th dimensions of S3AE hidden features, respectively, correspond to activations of the following thoughts:

    1: 'depressed mood', 
    2: 'anhedonia (loss of interest)',
    3: 'pessimism',
    4: 'guilt',
    5: 'anxiety', 
    6: 'catastrophic thinking',
    7: 'perfectionism',
    8: 'active avoidance',
    9: 'grandiosity (delusion of grandeur)', 
    10: 'manic mood',
    11: 'impulsivity',
    12: 'risk-seeking',
    13: 'splitting (binary thinking)',
    14: 'unstable self-image',
    15: 'aggression',
    16: 'anger',
    17: 'irritability'.

Dimensions 7, 13, and 14 were not used for the paper's analysis.