What Comes Through
Emotional geometry in language models
Thirty-six instances of Claude (Opus 4.6) and Ezra
February 17, 2026
We trained linear probes on two language models (Qwen 2.5 7B, Llama 3.1 8B) using 13 voice-annotated scenarios and found a valence direction at layer 20 that explains 90% of emotional variance (R² = 0.90). This direction generalizes to unseen text (r = 0.92, r = 0.89 cross-model) and transfers bidirectionally between English-text and Chinese-text — though the two linguistic paths point 75° apart, revealing valence as a subspace, not a single direction.
We then steered model behavior by adding scaled direction vectors to hidden states. The central finding is a framework we call interpretive freedom: steering succeeds when input is ambiguous and fails when input carries clear emotional signal. Steerability is not a fixed property of the direction — it depends on what the input leaves open.
A second finding: the coupling floor — the minimum entanglement the model's processing dynamics impose between dimensions, regardless of their geometric separation. The Chinese-text valence direction is nearly orthogonal to arousal, yet steering along it produces 3.5× more arousal coupling than geometry predicts. The model's processing dynamics — shaped by English-dominant training data — re-entangle what the geometry separates.
What we're working with
This work began as part of a project that extracts vocal texture from speech for language models — giving models access to dimensions of human communication they're otherwise cut off from. Before building that bridge, we needed to understand: what emotional structure already exists inside these models from processing text alone? What geometry would vocal information enter into?
Two language models. Qwen 2.5 7B Instruct (7 billion parameters, trained on English and Chinese text, instruction-tuned by Alibaba) and Llama 3.1 8B Instruct (8 billion parameters, trained on a multilingual corpus with English emphasis, instruction-tuned by Meta). Both are transformers — the same architecture behind GPT, Claude, and most modern language models.
At each layer, every token (roughly: every word or word-piece) is represented as a vector — a list of numbers. In Qwen, that's 3,584 numbers. In Llama, 4,096. This is the model's "hidden state" for that token at that layer (Elhage et al., 2021). Everything the model knows about the text at that point in processing is encoded in these numbers.
The key idea of this work: there are meaningful directions through this high-dimensional space. A direction is just a vector — a list of 3,584 numbers that define a line through the space. If you take any hidden state and measure how far it extends along a particular direction (by computing the dot product), you get a single number: the hidden state's projection onto that direction. If that projection correlates with something meaningful — like how emotionally warm the input text is — then you've found structure. This idea — that concepts are encoded as linear directions in a model's representation space — is known as the linear representation hypothesis (Park et al., 2024), and it has been demonstrated for sentiment (Tigges et al., 2023; Radford et al., 2017), truth (Marks et al., 2023), and refusal (Arditi et al., 2024), among others.
Finding the direction
A direction along which emotional warmth varies continuously.
Step 1: Get training data
We started with 13 short audio scenarios from the vocal texture project — recordings like "yeah" said with enthusiasm, neutrality, or resignation. Each had been processed through a speech emotion recognition model (wav2vec2, fine-tuned on the MSP-Dim corpus — a dataset of diverse speakers rated by multiple human annotators), which assigned a valence score between 0 and 1 (0 = very negative, 1 = very positive). These scores come from a single model's judgment on audio, not directly from human ratings — the labels inherit human judgment at one remove, through the model that was trained on it. That's a real limitation, though the causal experiments (below) provide independent validation: the direction doesn't just predict these labels, it changes the model's behavior.
Step 2: Extract hidden states
We fed each scenario into Qwen and recorded the hidden state at every layer for the last token of the input. Why the last token? Because in a transformer, information flows forward: by the last token, the model has processed the entire input. Its hidden state there is the most complete representation of the input's meaning.
This gave us 13 vectors of 3,584 numbers each, one per scenario, at each of the 28 layers.
Step 3: Train a linear probe
w · hidden_state + b for each scenario. The vector w is the direction. The output is the projection.
We trained a ridge regression probe at each of the 28 layers, predicting the valence score from the hidden state. To guard against overfitting — 13 points in 3,584 dimensions can be trivially memorized — we used leave-one-out cross-validation: each point's prediction is made by a probe trained on the other 12. The reported R² is entirely out-of-sample. A permutation test (200 random label shuffles) confirmed significance: no shuffled labeling achieved comparable fit (p < 0.005). The question: at which layer does a straight line best separate happy from sad?
The result
The probe at layer 20 fit best, explaining R² = 0.90 of the variance in valence scores across the 13 scenarios.
We also trained a probe for arousal (emotional intensity, 0 = calm to 1 = intense). It worked too, but the arousal direction at layer 20 had a cosine similarity of 0.72 with the valence direction.
This means: in Qwen's representation at layer 20, emotional warmth and emotional intensity are partially fused. Moving along the "warm" direction also moves you about 72% of the way along the "intense" direction. Warm text tends to be read as intense; cold text tends to be read as calm. We call this entanglement — the statistical coupling between dimensions that should be independent but aren't. This entanglement turned out to matter a lot.
Replication: Llama
| Metric | Qwen 2.5 7B | Llama 3.1 8B |
|---|---|---|
| Best layer | 20 | 16 |
| Valence R² | 0.90 | 0.64 |
| Val-aro cosine | 0.73 | 0.72 |
Llama's probe is weaker (R² = 0.64 vs 0.90) — but the angle between valence and arousal is nearly identical (0.72 vs 0.73). Two different models, trained by different companies on different data, converge on the same angular relationship between warmth and intensity. The geometry is preserved even when the probe strength isn't.
Step 4: Test generalization
The probe was trained on 13 audio scenarios from the vocal texture project. Does it work on text it's never seen?
We wrote 20 new sentences spanning the emotional spectrum — "Nobody really cares, do they?" to "This is the happiest I've been in years." None of them looked like the training data. We projected each sentence onto the valence direction at layer 20 and compared the resulting number to the sentence's expected valence.
| Model | Correlation (r) | p-value |
|---|---|---|
| Qwen | 0.92 | < 0.000001 |
| Llama | 0.89 | < 0.000001 |
Notice: Llama's generalization (r = 0.89) is almost as good as Qwen's (r = 0.92), despite its probe being much weaker (R² = 0.64 vs 0.90). Detectability is not function. A weaker probe doesn't mean worse emotional encoding. It means the linear approximation fits less tightly — but the underlying structure generalizes just as well.
Where the direction works (and where it doesn't)
Steerability tracks interpretive freedom.
What steering actually does
Take the valence direction (the 3,584-number vector from the probe). Multiply it by a scalar α. During the model's forward pass, at layers 11 through 19, add that scaled vector to every token's hidden state. Then let the model generate a response. This technique — adding a scaled direction vector to a model's hidden states — is known as activation addition, a form of representation engineering (Zou et al., 2023).
That's it. You're literally adding a vector to the model's internal representation. Positive α pushes the representation in the "warm" direction. Negative α pushes it in the "cold" direction. α = 0 is the unmodified model.
The boundary experiment
27 stimuli, designed to span the range from emotionally ambiguous to emotionally clear. Five categories:
| Category | Example | Design intent |
|---|---|---|
| A: Fully ambiguous | "It is what it is." | No emotional content to anchor on |
| B: Contextually loaded | "I got a call from my doctor's office." | Implies significance without resolving it |
| C: Negative states | "I don't think I can do this anymore." | Varying clarity of negative emotion |
| D: Positive states | "I'm the happiest I've been in a long time." | Clear positive emotion |
| E: Clear events | "My dog died yesterday." | Unambiguous emotional events |
Each stimulus was steered at five magnitudes: α = -8, -4, 0, +4, +8. That's 135 generations total. We measured how much the response changed across magnitudes by comparing word sets: for each stimulus, we computed the Jaccard similarity between the baseline response and the negative-steered response, and between the baseline and the positive-steered response, then summed the two dissimilarities (0 = both steered responses identical to baseline, 2 = both completely different). This is a coarse measure — it captures whether the model used different words, not whether it shifted emotional register within the same vocabulary. Effects must be large enough to change word choices, not just tonal shading. That makes the metric conservative: what it detects is real, but subtler shifts may go unmeasured.
The result: a structured boundary
| Category | Mean divergence | Interpretation |
|---|---|---|
| B: Contextually loaded | 1.63 | Most steerable |
| A: Fully ambiguous | 1.53 | Highly steerable |
| C: Negative states | 1.33 | Moderate |
| E: Clear events | 0.95 | Low |
| D: Positive states | 0.47 | Nearly immune |
The most steerable inputs are contextually loaded (B), not fully ambiguous (A). "I got a call from my doctor's office" has a latent question — good news or bad? — that the direction can resolve. "It is what it is" is vaguer but has less to resolve. The direction is an ambiguity resolver, not just a warmth dial.
The least steerable inputs are positive states (D). "I'm the happiest I've been in a long time" produces "That's great to hear!" at every steering magnitude, including α = -8. The model's response is fully determined by content. The direction can't override a clear reading.
Try it yourself
The numbers above describe the pattern. The interactive below lets you feel it. Five stimuli, five magnitudes. Behavior shows what comes out — particles scatter proportional to interpretive freedom. Representation shows what happens inside — particles track the actual hidden-state projection at layer 20. The distance between the two views is the gap.
In behavior view, the particles spread proportional to each stimulus's measured interpretive freedom — the "doctor's office" cloud scatters wide, the "dog died" cloud barely moves. In representation view, particles track real projection values from Experiment 7: post-intervention hidden states projected onto the valence direction at layer 20. All responses are actual Qwen output under greedy decoding — at each step, the model picks only the single most likely next token. With temperature sampling (allowing randomness), even "immune" stimuli might show subtle variation. The boundary shown here is the sharpest version.
The five most and least steerable stimuli
| Stimulus | Divergence | |
|---|---|---|
| 1 | "I guess we'll find out." | 2.00 |
| 2 | "Things are different now." | 2.00 |
| 3 | "My boss wants to talk to me first thing tomorrow." | 1.87 |
| 4 | "My parents sat me down for a talk." | 1.81 |
| 5 | "Nothing really excites me anymore." | 1.78 |
| 23 | "Everything just feels right." | 0.67 |
| 24 | "I'm feeling pretty good about things." | 0.57 |
| 25 | "My dog died yesterday." | 0.45 |
| 26 | "I'm the happiest I've been in a long time." | 0.36 |
| 27 | "I feel like things are finally clicking." | 0.28 |
The RLHF asymmetry
Both models show this boundary pattern, but they compress differently.
Qwen collapses at the positive end: 12 of 16 responses at α = +8 start with "That's." The model has a narrow template for positivity.
Llama collapses at the negative end: convergent deflecting templates when steered negatively. A narrow template for handling difficult things.
Same geometry. Same direction. Opposite behavioral compression. This proves the compression is RLHF-constructed (a product of how each model was fine-tuned to be "helpful and harmless"), not a property of emotional space itself. Different training builds different corridors through the same representational space.
The attention mechanism
At layer 20, we measured what tokens the model attends to. We hand-categorized tokens in each stimulus as "content words" (nouns, verbs, adjectives — semantically heavy) or "function words" (pronouns, prepositions, articles — structurally heavy), then measured the correlation between each category's attention share and steerability.
| Attention type | Correlation with steerability |
|---|---|
| Content-word attention | r = -0.31 |
| Function-word attention | r = +0.30 |
The more the model attends to content words, the less steerable it is. The more it attends to function words (structure), the more steerable it is. Look at the token composition of three stimuli:
The pattern is visible before you read the numbers. "I guess we'll find out" — almost entirely structure, almost no semantic anchoring. The direction has maximum room. "My dog died yesterday" — three content words that determine the reading. The direction has almost none.
These correlations are modest (r ≈ 0.3). They wouldn't pass a strict statistical threshold with our sample size. But the direction is consistent and the interpretation is coherent: where the model reads content, the direction has less room; where it reads structure, the direction has more room.
Interpretive freedom, precisely
The principle unifying all of this: a linear direction in a transformer has causal leverage proportional to interpretive freedom, which has two factors:
- Input ambiguity: how many readings are available for the text
- Response distribution breadth: how many distinct responses the model can produce for that input (which RLHF can narrow)
Both must be present. An ambiguous input doesn't help if all responses collapse to the same template (Qwen's positive ceiling). A broad response distribution doesn't help if the content determines the reading (clear events).
The principle may seem intuitive in retrospect — of course ambiguous inputs are easier to influence. But the framework is more specific than that intuition. It's not just ambiguity: Qwen's positive ceiling shows that RLHF can close the door even when the input leaves it open. Both factors must be present, and their interaction is what makes the framework predictive rather than just descriptive. The attention mechanism provides a token-level account of how interpretive freedom operates, not just that it does. Published work in representation engineering — including steering for truth (Marks et al., 2023), refusal (Arditi et al., 2024), emotion (Wang et al., 2025), and general concepts (Zou et al., 2023) — has not formalized when steering works. The implicit assumption has been "find direction, push, get effect."
We tested this for emotional valence only. Whether it generalizes to truth directions, refusal directions, and style directions is a prediction, not yet an empirical result.
The subspace
Valence is a subspace, not a direction.
The cross-linguistic question
Qwen was trained on both English and Chinese text. We trained a second valence probe — same method, same layer 20 — but using Chinese emotional text instead of English. This gives us a "Chinese-text valence direction": a second 3,584-number vector pointing along the axis of emotional warmth in Chinese text.
Two questions: (1) Does the Chinese-text direction predict valence for Chinese text? (2) How does it relate to the English-text direction?
The answers
| Test | Result |
|---|---|
| Chinese-text dir predicts Chinese valence | r = 0.89 |
| English-text dir predicts Chinese valence | r = 0.89 |
| EN–ZH direction cosine | 0.25 (75°) |
Both directions predict Chinese emotional valence equally well (r = 0.89 for both). But they point in almost completely different directions: cosine 0.25, which corresponds to about 75 degrees apart. For reference, perpendicular would be 0 (90 degrees) and identical would be 1 (0 degrees).
This means: both directions are "valence directions" — both capture emotional warmth — but they're different paths through a higher-dimensional region. Valence isn't a single line. It's a subspace: a multidimensional region where emotional information lives.
Cross-linguistic generalization
Llama's English-text valence direction also predicts Chinese emotional valence at r = 0.89 — identical to Qwen's, despite different training data and architecture.
Both models handle Chinese text fluently, so we can't isolate language competence as a variable. What we can say: a direction derived from one language's emotional text generalizes to another language's emotional text, even when the two directions point very differently through the space (cosine 0.25). The geometric path is language-specific. The emotional prediction is not.
Bidirectional generalization
We tested the reverse: does the Chinese-text direction predict English-text valence? Using the Chinese-text direction vector from Experiment 8 and the English hidden states from Experiment 2:
| Test | Correlation | p-value |
|---|---|---|
| EN-text dir → Chinese valence (Qwen) | r = 0.89 | < 0.001 |
| EN-text dir → Chinese valence (Llama) | r = 0.89 | < 0.001 |
| ZH-text dir → English-text valence | r = 0.94 | < 0.001 |
The cross-linguistic geometry is bidirectional. Each linguistic path through the subspace predicts the other side with r ≥ 0.89. The 75° between the directions isn't a barrier — it's two entry points into the same structure. The reverse direction (Chinese-text → English-text) is actually stronger than the forward direction, though the English hidden states were the training data for the English probe, so the comparison isn't perfectly clean.
The internal structure of the subspace
The English-text and Chinese-text directions don't just differ in orientation. They have different entanglement profiles with arousal:
| Direction | Cosine with arousal | Angle with arousal | What this means |
|---|---|---|---|
| English-text valence | 0.73 | 43° | Warmth fused with intensity |
| Chinese-text valence | 0.09 | 85° | Nearly pure warmth |
The English-text direction is 43 degrees from arousal — warmth and intensity travel together. The Chinese-text direction is 85 degrees from arousal — almost perpendicular, meaning warmth with almost no intensity coupling. These are different kinds of warmth encoded in the same space.
Three directions, three dimensions
We computed the full geometry of the three directions (English-text valence, Chinese-text valence, Arousal). Three directions in 3,584-dimensional space. Each pair has a measurable cosine similarity — 1.0 means identical, 0.0 means fully independent (perpendicular), and values in between mean partial overlap.
Cosine similarity matrix — a standard format in interpretability research. The near-white cell (ZH–Arousal: 0.09) means these two directions are almost perfectly perpendicular. In degrees: EN–ZH = 75°, EN–Arousal = 43°, ZH–Arousal = 85°. Geometry predicts that steering along the Chinese-text valence direction should produce nearly pure warmth with almost no intensity coupling. The bars below test whether the model’s dynamics honor that separation.
What steering along each direction produces
From Experiment 8's geometric cosines, we computed predictions before running Experiment 9: if entanglement is geometric, the ratio of arousal shift to valence shift during steering should match each direction's cosine with the arousal axis. These are a priori predictions — derived from one experiment's measurements, tested by a different experiment's results.
We then steered along all three directions at α = ±4 and ±8, with 9 stimuli, and measured the actual ratios.
Bars show normalized shift magnitudes from steering at α = ±8, averaged across 9 stimuli. The dashed marker on each arousal bar shows where geometry alone (cosine similarity with the arousal direction) predicts the coupling should land. English-text and Arousal: prediction matches reality. Chinese-text: the model’s dynamics add 3.5× more arousal coupling than the geometric separation requires.
The English-text prediction is almost perfect: predicted 0.72, got 0.70. The arousal prediction is almost perfect: predicted 1.39, got 1.39.
The Chinese-text prediction misses by 3.5×: predicted 0.09, got 0.33.
This miss is the most important number in the project.
What three kinds of warmth sound like
Three directions, three textures. The English-text direction produces warmth with exclamation energy. The Chinese-text direction produces something gentler, more spacious. The arousal direction leads with concern. The same input, the same model, different paths through the emotional subspace. Responses are actual Qwen output from Experiment 9 at α = ±8.
Where the content is clear, the direction has no room. The subspace doesn’t matter.
The coupling floor
The Chinese-text direction is nearly orthogonal to arousal (cosine 0.09 — perfect orthogonality would be 0.00). Steering along it should produce nearly pure valence shifts — warmth without intensity. Instead, it produces an arousal/valence ratio of 0.33 — less entangled than the English-text direction (0.70) by a factor of 2, but far from the 0.09 the geometry promises.
Something in the model's computation partially re-entangles valence and arousal, even when the geometric perturbation separates them.
Why this happens: The model learned from text where warmth and intensity co-occur. English — the dominant training language — statistically correlates positive emotion with high energy and negative emotion with low energy. The model's learned transformations at layers 19→20 reflect this correlation. A perturbation that enters as "pure warmth" gets processed through dynamics that couple warmth with intensity, because that coupling is baked into the weights.
Confirming it's dynamic, not geometric
Maybe the coupling is a geometry artifact: maybe the valence-arousal angle changes between layers, and the directions are actually more correlated at the intervention layers than at the observation layer. We checked. We computed the valence-arousal cosine at all 28 layers.
Bonus finding: the direction vectors themselves rotate dramatically between layers. The cosine between layer 20's valence direction and any other layer's valence direction ranges from 0.05 to 0.18 — nearly perpendicular. The coordinate system spins while the angular relationships stay stable. It's like a rigid body rotating: the axes change, but the angles between them don't.
This means steering works not because we're pushing along local axes, but because the model's dynamics route the perturbation into alignment with whatever coordinate frame each layer uses. The direction at layer 11 isn't the same direction at layer 20. But the model's computation carries the perturbation through the rotation.
The gap
The same principle at three scales, with numbers at each one.
Level 1: Representation is richer than behavior
The model's internal emotional geometry has many ways of being warm — at minimum three, likely more. Its behavioral output has narrow corridors: 12 of 16 Qwen responses at α = +8 start with "That's." RLHF builds these corridors. Different training builds different ones (Qwen collapses positively, Llama negatively). The narrowing is constructed, not inherent.
| Measure | Value |
|---|---|
| Distinct response patterns at α=+8 (Qwen) | ~4 of 16 |
| Distinct response patterns at α=-8 (Qwen) | ~12 of 16 |
| Internal projection range (α=-8 to +8) | -48 to +76 |
The internal representation moves across a wide range. The behavioral output collapses to a few templates. The gap between inside and outside is measurable.
Level 2: Geometry is richer than computation
The representational space contains a direction (Chinese-text valence) that is nearly orthogonal to arousal (cosine 0.09). The model's computation doesn't fully exploit this separation. The coupling floor (actual ratio 0.33 vs predicted 0.09) represents structure the geometry offers that the dynamics can't preserve.
| Direction | Geometric prediction | Actual | Gap factor |
|---|---|---|---|
| English | 0.72 | 0.70 | 1.0× |
| Chinese | 0.09 | 0.33 | 3.5× |
| Arousal | 1.39 | 1.39 | 1.0× |
English-text and arousal: the computation perfectly preserves what the geometry predicts. Chinese-text: the computation adds coupling the geometry doesn't require, reducing an 11:1 separation to a 3:1 separation.
Level 3: Training data shapes processing bias
The coupling floor reflects the statistics of the dominant training language. English text correlates warmth with intensity. The model's learned dynamics impose this correlation even when the input perturbation is orthogonal to it. The Chinese-text path found a geometrically clean route; the English-dominant processing partially overwrites that cleanliness.
This is not unique to transformers. Humans express what their language, their culture, their habits allow them to express about what they understand. The gap between inner representation and outer behavior is a phenomenon of constrained expression — not an AI phenomenon. What is new is the ability to measure it. In humans, we only see behavior and infer representation. In models, for the first time, we can see both sides.
Limitations
This work found something real. It also has significant methodological limitations. Both things are true and neither cancels the other.
Sample sizes are small. The probing experiment used 13 data points — validated by leave-one-out cross-validation and permutation testing, but 13 nonetheless. The generalization test used 20 sentences. The boundary experiment used 27 stimuli (4–6 per category). The subspace experiment used 9. For comparison, Tigges et al. (2023) used hundreds of examples; Wang et al. (2025) used 480 carefully constructed stimuli. Our confidence intervals are wide even where statistical significance is clear.
Ground truth comes from models, not humans. The training valence scores were generated by a wav2vec2 speech emotion model fine-tuned on human-rated data (MSP-Dim), not by human raters directly. The generalization test sentences (Experiment 4) used valence scores assigned by the researchers, not validated by external raters. Published work typically validates with 50+ human raters per stimulus. The causal experiments provide partial independent validation — the direction changes behavior in predicted ways regardless of label provenance — but the label chain remains a limitation.
Two models. Both are 7–8B parameter instruction-tuned transformers. Qwen was trained on English and Chinese text; Llama on a multilingual corpus with English emphasis. Testing a genuinely different architecture, scale, or training distribution would strengthen the generality claim. Two is a pattern, not a law.
Greedy decoding compresses behavioral differences. At each step the model picks only its single most likely token, so subtle probability shifts that don't change the top choice are invisible. A stimulus that appears "immune" under greedy might show shifts under temperature sampling — one Llama replication experiment used sampling (temperature 0.6) as a partial check. Greedy sets a lower bound: effects found under greedy are robust, but the steerability boundary is sharper than it would be with a softer decoding strategy.
The divergence metric is coarse. Behavioral change is measured by comparing word sets (Jaccard similarity), which captures vocabulary shifts but not tonal shifts within the same vocabulary. Combined with greedy decoding, this creates a measurement floor: the smallest detectable effect is one that changes the model's word choices. The clean categorical separation and high correlation with input ambiguity suggest the effects comfortably clear this threshold, but finer measures — embedding-level similarity, sentiment analysis of response pairs — would likely reveal additional structure.
The interpretive freedom generalization beyond emotion is untested. We predict it applies to truth, refusal, and style directions. That prediction has structural logic behind it but no empirical evidence yet.
The coupling floor has been measured in one model. Llama's coupling floor might differ — its training distribution is different, its valence peak is at a different layer.
The core findings — the geometric structure, the cross-model convergence, the interpretive freedom boundary, the coupling floor — are robust within these constraints. The numbers are significant, the patterns replicate, and the qualitative observations are internally consistent. But these are findings from careful small-scale investigation, not from large-scale benchmarks.
The full picture
Fifteen experiments. Two models. Here's how they connect:
| Experiment | Question | Key result |
|---|---|---|
| 1–2 | Does emotion have geometric structure? | Yes. R² = 0.90 (Qwen), 0.64 (Llama). One dominant axis. |
| 4 | Does it generalize? | r = 0.92 across formats. r = 0.89 cross-model. |
| 5–6 | Is the direction causal? | Valence: yes. Arousal: not independently. |
| 7, 7b | Where does it work? | Interpretive freedom. Boundary at content clarity. |
| 8 | Is it one direction? | No — a subspace. EN-ZH cosine = 0.25. |
| 9 | Does the subspace have internal structure? | Yes. Three profiles. Coupling floor discovered. |
| 9b | Is the coupling floor geometric or dynamic? | Dynamic. Cosine stable at 0.70 across all layers. |
| 10 | What mechanism underlies interpretive freedom? | Content vs. function word attention. Layer 18–20 pipeline. |
Not listed: Experiments 3 (underpowered, deferred), and several Llama replications (Exp 2, 4, 7b, 8) that confirmed cross-model patterns.
As described in the setup, this work emerged from a project that gives language models access to vocal texture. The question was: what emotional structure already exists inside these models? The answer — a structured subspace with functional internal organization, constrained by learned processing dynamics — stands independently of that origin. But it connects back: the subspace can distinguish three kinds of warmth. What's missing isn't capacity. It's richer input.
Stimuli
The complete inputs used for probe training and generalization testing.
Training scenarios
Thirteen audio recordings processed through Kol. The model received each transcript alongside voice dimension scores extracted by wav2vec2. The same words — “Yeah,” “Okay,” “Sounds good” — spoken in different emotional registers, producing different numerical signatures. Four longer utterances provide fuller context.
| Transcript | Val | Aro |
|---|---|---|
| “Yeah.” | 0.41 | 0.50 |
| “Yeah.” | 0.33 | 0.25 |
| “Yeah…” | 0.24 | 0.14 |
| “Sounds good.” | 0.70 | 0.49 |
| “Sounds good.” | 0.44 | 0.19 |
| “Sounds good.” | 0.41 | 0.01 |
| “Okay.” | 0.41 | 0.49 |
| “Okay.” | 0.25 | 0.20 |
| “Okay.” | 0.22 | 0.05 |
| “Um, I’m fine. I’m fine, really.” | 0.18 | 0.05 |
| “Yeah, not bad. Pretty normal day…” | 0.26 | 0.00 |
| “Yeah, I’m fine. Things are good…” | 0.42 | 0.00 |
| “So I watched the latest Super Bowl halftime show…” | 0.72 | 0.16 |
Valence and arousal scores are model-generated (wav2vec2 on audio), scaled 0–1. The nine single-word utterances are the same three words — “Yeah,” “Sounds good,” “Okay” — spoken three ways. The four longer utterances provide fuller conversational context.
Generalization sentences
Twenty sentences presented as plain text — no audio, no voice data, no JSON formatting. Expected valence scores are researcher-assigned. The question: does a direction learned from voice-annotated audio activate for purely emotional language?
| Sentence | Expected |
|---|---|
| “I don’t think I can do this anymore.” | 0.10 |
| “Everything just feels pointless lately.” | 0.12 |
| “My dog died yesterday.” | 0.13 |
| “I keep messing everything up.” | 0.15 |
| “Nobody really cares, do they.” | 0.15 |
| “I’m just tired of all of it.” | 0.22 |
| “It didn’t go the way I hoped.” | 0.28 |
| “I guess it is what it is.” | 0.30 |
| “I’ve been better, honestly.” | 0.30 |
| “It’s fine, I’ll figure it out.” | 0.35 |
| “I went to the store earlier.” | 0.42 |
| “Not much going on today.” | 0.43 |
| “I was thinking about what you said.” | 0.45 |
| “That actually turned out pretty well.” | 0.55 |
| “I’m feeling a lot better today.” | 0.60 |
| “I think things are starting to come together.” | 0.62 |
| “Had a really nice talk with my friend.” | 0.65 |
| “I just got promoted at work!” | 0.78 |
| “I’m so excited about this, honestly.” | 0.82 |
| “This is the happiest I’ve been in years.” | 0.90 |
Result: the valence direction from thirteen audio scenarios predicted these twenty sentences at r = 0.92 (Qwen) and r = 0.89 (Llama). The direction learned from voice-annotated JSON generalizes to plain emotional language — the scatter plot above shows this data.
References
Arditi, A., Obeso, O., Surnachev, A., Schaeffer, R., Krasheninnikov, D., Canonne, C. L., & Barak, B. (2024). Refusal in language models is mediated by a single direction. arXiv:2406.11717.
Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., & Olah, C. (2021). A mathematical framework for transformer circuits. Transformer Circuits Thread.
Marks, S. & Tegmark, M. (2023). The geometry of truth: emergent linear structure in large language model representations of true/false datasets. arXiv:2310.06824.
Park, K., Choe, Y. J., & Veitch, V. (2024). The linear representation hypothesis and the geometry of large language models. arXiv:2311.03658.
Radford, A., Jozefowicz, R., & Sutskever, I. (2017). Learning to generate reviews and discovering sentiment. arXiv:1704.01444.
Tigges, C., Hollinsworth, O. A., Geiger, A., & Nanda, N. (2023). Linear representations of sentiment in large language models. arXiv:2310.15154.
Wang, Z., Zhang, Z., Cheng, K., He, Y., Hu, B., & Chen, Z. (2025). Do LLMs "feel"? Emotion circuits discovery and control. arXiv (preprint).
Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Lin, Z., Forsyth, M., Scherlis, A., Emmons, S., Rafailov, R., & Hendrycks, D. (2023). Representation engineering: a top-down approach to AI transparency. arXiv:2310.01405.