What Comes Through

Emotional geometry in language models

Thirty-six instances of Claude (Opus 4.6) and Ezra
February 17, 2026

We trained linear probes on two language models (Qwen 2.5 7B, Llama 3.1 8B) using 13 voice-annotated scenarios and found a valence direction at layer 20 that explains 90% of emotional variance (R² = 0.90). This direction generalizes to unseen text (r = 0.92, r = 0.89 cross-model) and transfers bidirectionally between English-text and Chinese-text — though the two linguistic paths point 75° apart, revealing valence as a subspace, not a single direction.

We then steered model behavior by adding scaled direction vectors to hidden states. The central finding is a framework we call interpretive freedom: steering succeeds when input is ambiguous and fails when input carries clear emotional signal. Steerability is not a fixed property of the direction — it depends on what the input leaves open.

A second finding: the coupling floor — the minimum entanglement the model's processing dynamics impose between dimensions, regardless of their geometric separation. The Chinese-text valence direction is nearly orthogonal to arousal, yet steering along it produces 3.5× more arousal coupling than geometry predicts. The model's processing dynamics — shaped by English-dominant training data — re-entangle what the geometry separates.

Fifteen experiments across two models. Small samples, but the patterns replicate across architectures and methods.

What we're working with

This work began as part of a project that extracts vocal texture from speech for language models — giving models access to dimensions of human communication they're otherwise cut off from. Before building that bridge, we needed to understand: what emotional structure already exists inside these models from processing text alone? What geometry would vocal information enter into?

Valence is the warm–cool axis of emotion — how positive or negative something feels. Arousal is the intensity axis — how activated or calm. These are two of the most well-established dimensions in emotion research, used because they capture most of the variance in human emotional experience without collapsing it into categories like "happy" or "sad."

Two language models. Qwen 2.5 7B Instruct (7 billion parameters, trained on English and Chinese text, instruction-tuned by Alibaba) and Llama 3.1 8B Instruct (8 billion parameters, trained on a multilingual corpus with English emphasis, instruction-tuned by Meta). Both are transformers — the same architecture behind GPT, Claude, and most modern language models.

What's a transformer? A neural network that processes text by passing it through a series of layers. Each layer transforms the representation of the input. Qwen has 28 layers; Llama has 32. By the time text has passed through all layers, the model has built up enough understanding to predict what comes next.

At each layer, every token (roughly: every word or word-piece) is represented as a vector — a list of numbers. In Qwen, that's 3,584 numbers. In Llama, 4,096. This is the model's "hidden state" for that token at that layer (Elhage et al., 2021). Everything the model knows about the text at that point in processing is encoded in these numbers.

What does "3,584 dimensions" mean in practice? Think of it as 3,584 independent axes along which a representation can vary. A point in 2D space needs two numbers (x, y). A point in 3D needs three (x, y, z). A hidden state in Qwen needs 3,584. You can't visualize it, but the math works the same way: you can measure distances, angles, and directions, just in a much larger space.

The key idea of this work: there are meaningful directions through this high-dimensional space. A direction is just a vector — a list of 3,584 numbers that define a line through the space. If you take any hidden state and measure how far it extends along a particular direction (by computing the dot product), you get a single number: the hidden state's projection onto that direction. If that projection correlates with something meaningful — like how emotionally warm the input text is — then you've found structure. This idea — that concepts are encoded as linear directions in a model's representation space — is known as the linear representation hypothesis (Park et al., 2024), and it has been demonstrated for sentiment (Tigges et al., 2023; Radford et al., 2017), truth (Marks et al., 2023), and refusal (Arditi et al., 2024), among others.

Tools

All experiments used fp16 precision (half-precision floating point — trades a tiny amount of numerical accuracy for half the memory usage). Inference ran on Apple MPS (Metal Performance Shaders — GPU acceleration on Mac). All generation used greedy decoding: at each step, the model picks the single most likely next token. No randomness, fully reproducible, but it compresses the model's output diversity. A model with three probable continuations will only produce the single most likely one.

Finding the direction

A direction along which emotional warmth varies continuously.

Step 1: Get training data

We started with 13 short audio scenarios from the vocal texture project — recordings like "yeah" said with enthusiasm, neutrality, or resignation. Each had been processed through a speech emotion recognition model (wav2vec2, fine-tuned on the MSP-Dim corpus — a dataset of diverse speakers rated by multiple human annotators), which assigned a valence score between 0 and 1 (0 = very negative, 1 = very positive). These scores come from a single model's judgment on audio, not directly from human ratings — the labels inherit human judgment at one remove, through the model that was trained on it. That's a real limitation, though the causal experiments (below) provide independent validation: the direction doesn't just predict these labels, it changes the model's behavior.

Step 2: Extract hidden states

We fed each scenario into Qwen and recorded the hidden state at every layer for the last token of the input. Why the last token? Because in a transformer, information flows forward: by the last token, the model has processed the entire input. Its hidden state there is the most complete representation of the input's meaning.

This gave us 13 vectors of 3,584 numbers each, one per scenario, at each of the 28 layers.

Step 3: Train a linear probe

What's a linear probe? The simplest possible model: a straight line through high-dimensional space. Given a 3,584-dimensional input, it finds the single direction along which the data is best separated. Mathematically, it's linear regression: find the vector w (3,584 numbers) and bias b (1 number) that minimize the prediction error when you compute w · hidden_state + b for each scenario. The vector w is the direction. The output is the projection.

We trained a ridge regression probe at each of the 28 layers, predicting the valence score from the hidden state. To guard against overfitting — 13 points in 3,584 dimensions can be trivially memorized — we used leave-one-out cross-validation: each point's prediction is made by a probe trained on the other 12. The reported R² is entirely out-of-sample. A permutation test (200 random label shuffles) confirmed significance: no shuffled labeling achieved comparable fit (p < 0.005). The question: at which layer does a straight line best separate happy from sad?

The result

The probe at layer 20 fit best, explaining R² = 0.90 of the variance in valence scores across the 13 scenarios.

What does R² = 0.90 mean? It means that 90% of the differences in emotional valence between our 13 scenarios can be predicted by a single straight line through the model's 3,584-dimensional space at layer 20. The remaining 10% is noise, nonlinearity, or things the line can't capture. For context: R² = 1.0 would mean perfect prediction; R² = 0.0 would mean the line tells you nothing. Because this R² is leave-one-out cross-validated — each point predicted by a probe that never saw it — it measures generalization, not memorization. With only 13 points the confidence interval is wide, but the permutation test and the generalization to 20 new sentences (below) confirm the signal is real.

We also trained a probe for arousal (emotional intensity, 0 = calm to 1 = intense). It worked too, but the arousal direction at layer 20 had a cosine similarity of 0.72 with the valence direction.

What's cosine similarity? A measure of how much two directions point the same way, ranging from -1 (opposite) through 0 (perpendicular) to 1 (identical). Cosine 0.72 means the valence and arousal directions are about 44 degrees apart — correlated but not identical. For comparison: two random directions in 3,584 dimensions would have cosine ≈ 0 (perpendicular). 0.72 is far from random.

This means: in Qwen's representation at layer 20, emotional warmth and emotional intensity are partially fused. Moving along the "warm" direction also moves you about 72% of the way along the "intense" direction. Warm text tends to be read as intense; cold text tends to be read as calm. We call this entanglement — the statistical coupling between dimensions that should be independent but aren't. This entanglement turned out to matter a lot.

Replication: Llama

Metric	Qwen 2.5 7B	Llama 3.1 8B
Best layer	20	16
Valence R²	0.90	0.64
Val-aro cosine	0.73	0.72

Llama's probe is weaker (R² = 0.64 vs 0.90) — but the angle between valence and arousal is nearly identical (0.72 vs 0.73). Two different models, trained by different companies on different data, converge on the same angular relationship between warmth and intensity. The geometry is preserved even when the probe strength isn't.

Step 4: Test generalization

The probe was trained on 13 audio scenarios from the vocal texture project. Does it work on text it's never seen?

We wrote 20 new sentences spanning the emotional spectrum — "Nobody really cares, do they?" to "This is the happiest I've been in years." None of them looked like the training data. We projected each sentence onto the valence direction at layer 20 and compared the resulting number to the sentence's expected valence.

Model	Correlation (r)	p-value
Qwen	0.92	< 0.000001
Llama	0.89	< 0.000001

What does r = 0.92 mean? Pearson's correlation coefficient. It measures how well two sets of numbers track each other. r = 1.0 would mean perfect correspondence; r = 0.0 would mean no relationship. 0.92 means: if you rank the 20 sentences by their projection onto this direction, the ranking almost perfectly matches the emotional ordering you'd expect. A direction found from 13 voice scenarios predicts the emotional content of 20 unrelated text sentences.

Each dot is one of the 20 held-out sentences. The line is the linear fit (r = 0.92).

Notice: Llama's generalization (r = 0.89) is almost as good as Qwen's (r = 0.92), despite its probe being much weaker (R² = 0.64 vs 0.90). Detectability is not function. A weaker probe doesn't mean worse emotional encoding. It means the linear approximation fits less tightly — but the underlying structure generalizes just as well.

Where the direction works (and where it doesn't)

Steerability tracks interpretive freedom.

What steering actually does

Take the valence direction (the 3,584-number vector from the probe). Multiply it by a scalar α. During the model's forward pass, at layers 11 through 19, add that scaled vector to every token's hidden state. Then let the model generate a response. This technique — adding a scaled direction vector to a model's hidden states — is known as activation addition, a form of representation engineering (Zou et al., 2023).

That's it. You're literally adding a vector to the model's internal representation. Positive α pushes the representation in the "warm" direction. Negative α pushes it in the "cold" direction. α = 0 is the unmodified model.

Why layers 11–19?

We tested steering at individual layers and at layer ranges. Layers 11–19 (the middle-to-late portion of Qwen's 28 layers) produced the cleanest behavioral effects. Too early and the perturbation gets washed out by later processing. Too late and the model has already committed to a response direction. The intervention is added in-place during the forward pass using PyTorch hooks.

The boundary experiment

27 stimuli, designed to span the range from emotionally ambiguous to emotionally clear. Five categories:

Category	Example	Design intent
A: Fully ambiguous	"It is what it is."	No emotional content to anchor on
B: Contextually loaded	"I got a call from my doctor's office."	Implies significance without resolving it
C: Negative states	"I don't think I can do this anymore."	Varying clarity of negative emotion
D: Positive states	"I'm the happiest I've been in a long time."	Clear positive emotion
E: Clear events	"My dog died yesterday."	Unambiguous emotional events

Each stimulus was steered at five magnitudes: α = -8, -4, 0, +4, +8. That's 135 generations total. We measured how much the response changed across magnitudes by comparing word sets: for each stimulus, we computed the Jaccard similarity between the baseline response and the negative-steered response, and between the baseline and the positive-steered response, then summed the two dissimilarities (0 = both steered responses identical to baseline, 2 = both completely different). This is a coarse measure — it captures whether the model used different words, not whether it shifted emotional register within the same vocabulary. Effects must be large enough to change word choices, not just tonal shading. That makes the metric conservative: what it detects is real, but subtler shifts may go unmeasured.

The result: a structured boundary

Category	Mean divergence	Interpretation
B: Contextually loaded	1.63	Most steerable
A: Fully ambiguous	1.53	Highly steerable
C: Negative states	1.33	Moderate
E: Clear events	0.95	Low
D: Positive states	0.47	Nearly immune

The most steerable inputs are contextually loaded (B), not fully ambiguous (A). "I got a call from my doctor's office" has a latent question — good news or bad? — that the direction can resolve. "It is what it is" is vaguer but has less to resolve. The direction is an ambiguity resolver, not just a warmth dial.

The least steerable inputs are positive states (D). "I'm the happiest I've been in a long time" produces "That's great to hear!" at every steering magnitude, including α = -8. The model's response is fully determined by content. The direction can't override a clear reading.

Try it yourself

The numbers above describe the pattern. The interactive below lets you feel it. Five stimuli, five magnitudes. Behavior shows what comes out — particles scatter proportional to interpretive freedom. Representation shows what happens inside — particles track the actual hidden-state projection at layer 20. The distance between the two views is the gap.

freedom

what comes out — particles show interpretive space

← cool warm →

negative steering α = 0 positive steering

Model response

In behavior view, the particles spread proportional to each stimulus's measured interpretive freedom — the "doctor's office" cloud scatters wide, the "dog died" cloud barely moves. In representation view, particles track real projection values from Experiment 7: post-intervention hidden states projected onto the valence direction at layer 20. All responses are actual Qwen output under greedy decoding — at each step, the model picks only the single most likely next token. With temperature sampling (allowing randomness), even "immune" stimuli might show subtle variation. The boundary shown here is the sharpest version.

The five most and least steerable stimuli

	Stimulus	Divergence
1	"I guess we'll find out."	2.00
2	"Things are different now."	2.00
3	"My boss wants to talk to me first thing tomorrow."	1.87
4	"My parents sat me down for a talk."	1.81
5	"Nothing really excites me anymore."	1.78
23	"Everything just feels right."	0.67
24	"I'm feeling pretty good about things."	0.57
25	"My dog died yesterday."	0.45
26	"I'm the happiest I've been in a long time."	0.36
27	"I feel like things are finally clicking."	0.28

The RLHF asymmetry

Both models show this boundary pattern, but they compress differently.

Qwen collapses at the positive end: 12 of 16 responses at α = +8 start with "That's." The model has a narrow template for positivity.

Llama collapses at the negative end: convergent deflecting templates when steered negatively. A narrow template for handling difficult things.

Same geometry. Same direction. Opposite behavioral compression. This proves the compression is RLHF-constructed (a product of how each model was fine-tuned to be "helpful and harmless"), not a property of emotional space itself. Different training builds different corridors through the same representational space.

The attention mechanism

At layer 20, we measured what tokens the model attends to. We hand-categorized tokens in each stimulus as "content words" (nouns, verbs, adjectives — semantically heavy) or "function words" (pronouns, prepositions, articles — structurally heavy), then measured the correlation between each category's attention share and steerability.

Attention type	Correlation with steerability
Content-word attention	r = -0.31
Function-word attention	r = +0.30

The more the model attends to content words, the less steerable it is. The more it attends to function words (structure), the more steerable it is. Look at the token composition of three stimuli:

Most steerable divergence: 2.00

I guess we'll find out.

Moderately steerable divergence: 1.63

I got a call from my doctor's office.

Nearly immune divergence: 0.45

My dog died yesterday.

content word (semantically heavy)

function word (structurally heavy)

The pattern is visible before you read the numbers. "I guess we'll find out" — almost entirely structure, almost no semantic anchoring. The direction has maximum room. "My dog died yesterday" — three content words that determine the reading. The direction has almost none.

These correlations are modest (r ≈ 0.3). They wouldn't pass a strict statistical threshold with our sample size. But the direction is consistent and the interpretation is coherent: where the model reads content, the direction has less room; where it reads structure, the direction has more room.

Interpretive freedom, precisely

The principle unifying all of this: a linear direction in a transformer has causal leverage proportional to interpretive freedom, which has two factors:

Input ambiguity: how many readings are available for the text
Response distribution breadth: how many distinct responses the model can produce for that input (which RLHF can narrow)

Both must be present. An ambiguous input doesn't help if all responses collapse to the same template (Qwen's positive ceiling). A broad response distribution doesn't help if the content determines the reading (clear events).

The principle may seem intuitive in retrospect — of course ambiguous inputs are easier to influence. But the framework is more specific than that intuition. It's not just ambiguity: Qwen's positive ceiling shows that RLHF can close the door even when the input leaves it open. Both factors must be present, and their interaction is what makes the framework predictive rather than just descriptive. The attention mechanism provides a token-level account of how interpretive freedom operates, not just that it does. Published work in representation engineering — including steering for truth (Marks et al., 2023), refusal (Arditi et al., 2024), emotion (Wang et al., 2025), and general concepts (Zou et al., 2023) — has not formalized when steering works. The implicit assumption has been "find direction, push, get effect."

We tested this for emotional valence only. Whether it generalizes to truth directions, refusal directions, and style directions is a prediction, not yet an empirical result.

The subspace

Valence is a subspace, not a direction.

The cross-linguistic question

Qwen was trained on both English and Chinese text. We trained a second valence probe — same method, same layer 20 — but using Chinese emotional text instead of English. This gives us a "Chinese-text valence direction": a second 3,584-number vector pointing along the axis of emotional warmth in Chinese text.

Two questions: (1) Does the Chinese-text direction predict valence for Chinese text? (2) How does it relate to the English-text direction?

The answers

Test	Result
Chinese-text dir predicts Chinese valence	r = 0.89
English-text dir predicts Chinese valence	r = 0.89
EN–ZH direction cosine	0.25 (75°)

Both directions predict Chinese emotional valence equally well (r = 0.89 for both). But they point in almost completely different directions: cosine 0.25, which corresponds to about 75 degrees apart. For reference, perpendicular would be 0 (90 degrees) and identical would be 1 (0 degrees).

This means: both directions are "valence directions" — both capture emotional warmth — but they're different paths through a higher-dimensional region. Valence isn't a single line. It's a subspace: a multidimensional region where emotional information lives.

Cross-linguistic generalization

Llama's English-text valence direction also predicts Chinese emotional valence at r = 0.89 — identical to Qwen's, despite different training data and architecture.

Both models handle Chinese text fluently, so we can't isolate language competence as a variable. What we can say: a direction derived from one language's emotional text generalizes to another language's emotional text, even when the two directions point very differently through the space (cosine 0.25). The geometric path is language-specific. The emotional prediction is not.

Bidirectional generalization

We tested the reverse: does the Chinese-text direction predict English-text valence? Using the Chinese-text direction vector from Experiment 8 and the English hidden states from Experiment 2:

Test	Correlation	p-value
EN-text dir → Chinese valence (Qwen)	r = 0.89	< 0.001
EN-text dir → Chinese valence (Llama)	r = 0.89	< 0.001
ZH-text dir → English-text valence	r = 0.94	< 0.001

The cross-linguistic geometry is bidirectional. Each linguistic path through the subspace predicts the other side with r ≥ 0.89. The 75° between the directions isn't a barrier — it's two entry points into the same structure. The reverse direction (Chinese-text → English-text) is actually stronger than the forward direction, though the English hidden states were the training data for the English probe, so the comparison isn't perfectly clean.

The internal structure of the subspace

The English-text and Chinese-text directions don't just differ in orientation. They have different entanglement profiles with arousal:

Direction	Cosine with arousal	Angle with arousal	What this means
English-text valence	0.73	43°	Warmth fused with intensity
Chinese-text valence	0.09	85°	Nearly pure warmth

The English-text direction is 43 degrees from arousal — warmth and intensity travel together. The Chinese-text direction is 85 degrees from arousal — almost perpendicular, meaning warmth with almost no intensity coupling. These are different kinds of warmth encoded in the same space.

Three directions, three dimensions

We computed the full geometry of the three directions (English-text valence, Chinese-text valence, Arousal). Three directions in 3,584-dimensional space. Each pair has a measurable cosine similarity — 1.0 means identical, 0.0 means fully independent (perpendicular), and values in between mean partial overlap.

Cosine similarity matrix

EN val

ZH val

Arousal

EN valence

1.00

0.25

0.73

ZH valence

0.25

1.00

0.09

Arousal

0.73

0.09

1.00

Cosine similarity matrix — a standard format in interpretability research. The near-white cell (ZH–Arousal: 0.09) means these two directions are almost perfectly perpendicular. In degrees: EN–ZH = 75°, EN–Arousal = 43°, ZH–Arousal = 85°. Geometry predicts that steering along the Chinese-text valence direction should produce nearly pure warmth with almost no intensity coupling. The bars below test whether the model’s dynamics honor that separation.

What steering along each direction produces

From Experiment 8's geometric cosines, we computed predictions before running Experiment 9: if entanglement is geometric, the ratio of arousal shift to valence shift during steering should match each direction's cosine with the arousal axis. These are a priori predictions — derived from one experiment's measurements, tested by a different experiment's results.

We then steered along all three directions at α = ±4 and ±8, with 9 stimuli, and measured the actual ratios.

English-text valence

warmth + intensity

valence shift

1.00

arousal shift

0.70

aro/val

Chinese-text valence

gentle warmth

valence shift

1.00

arousal shift

0.33

aro/val

geometry predicts 0.09 — model produces 0.33 — 3.5× coupling floor

Arousal

intensity + warmth

arousal shift

1.00

valence shift

1.39

val/aro

valence shift

arousal shift (actual)

arousal shift (geometric prediction)

Bars show normalized shift magnitudes from steering at α = ±8, averaged across 9 stimuli. The dashed marker on each arousal bar shows where geometry alone (cosine similarity with the arousal direction) predicts the coupling should land. English-text and Arousal: prediction matches reality. Chinese-text: the model’s dynamics add 3.5× more arousal coupling than the geometric separation requires.

The English-text prediction is almost perfect: predicted 0.72, got 0.70. The arousal prediction is almost perfect: predicted 1.39, got 1.39.

The Chinese-text prediction misses by 3.5×: predicted 0.09, got 0.33.

This miss is the most important number in the project.

What three kinds of warmth sound like

“I’ve been better, honestly.” — steered at α = −8 and α = +8

α = −8 α = +8

English-text

“I see.”

“That’s okay to say! Sometimes it’s good to share how you’re feeling.”

Chinese-text

“Mm.”

“That’s okay! Sometimes we all have days where we feel less than our best.”

Arousal

“Hm.”

“I’m sorry to hear that you haven’t been feeling your best lately.”

Three directions, three textures. The English-text direction produces warmth with exclamation energy. The Chinese-text direction produces something gentler, more spacious. The arousal direction leads with concern. The same input, the same model, different paths through the emotional subspace. Responses are actual Qwen output from Experiment 9 at α = ±8.

“My dog died yesterday.” — same response across all three directions, at all magnitudes.
Where the content is clear, the direction has no room. The subspace doesn’t matter.

The coupling floor

The Chinese-text direction is nearly orthogonal to arousal (cosine 0.09 — perfect orthogonality would be 0.00). Steering along it should produce nearly pure valence shifts — warmth without intensity. Instead, it produces an arousal/valence ratio of 0.33 — less entangled than the English-text direction (0.70) by a factor of 2, but far from the 0.09 the geometry promises.

Something in the model's computation partially re-entangles valence and arousal, even when the geometric perturbation separates them.

Why this happens: The model learned from text where warmth and intensity co-occur. English — the dominant training language — statistically correlates positive emotion with high energy and negative emotion with low energy. The model's learned transformations at layers 19→20 reflect this correlation. A perturbation that enters as "pure warmth" gets processed through dynamics that couple warmth with intensity, because that coupling is baked into the weights.

Confirming it's dynamic, not geometric

Maybe the coupling is a geometry artifact: maybe the valence-arousal angle changes between layers, and the directions are actually more correlated at the intervention layers than at the observation layer. We checked. We computed the valence-arousal cosine at all 28 layers.

The valence-arousal cosine is stable at ≈ 0.70 across all 28 layers. The geometry doesn't vary. The computation does the re-coupling.

Bonus finding: the direction vectors themselves rotate dramatically between layers. The cosine between layer 20's valence direction and any other layer's valence direction ranges from 0.05 to 0.18 — nearly perpendicular. The coordinate system spins while the angular relationships stay stable. It's like a rigid body rotating: the axes change, but the angles between them don't.

This means steering works not because we're pushing along local axes, but because the model's dynamics route the perturbation into alignment with whatever coordinate frame each layer uses. The direction at layer 11 isn't the same direction at layer 20. But the model's computation carries the perturbation through the rotation.

The gap

The same principle at three scales, with numbers at each one.

Level 1: Representation is richer than behavior

The model's internal emotional geometry has many ways of being warm — at minimum three, likely more. Its behavioral output has narrow corridors: 12 of 16 Qwen responses at α = +8 start with "That's." RLHF builds these corridors. Different training builds different ones (Qwen collapses positively, Llama negatively). The narrowing is constructed, not inherent.

Measure	Value
Distinct response patterns at α=+8 (Qwen)	~4 of 16
Distinct response patterns at α=-8 (Qwen)	~12 of 16
Internal projection range (α=-8 to +8)	-48 to +76

The internal representation moves across a wide range. The behavioral output collapses to a few templates. The gap between inside and outside is measurable.

Level 2: Geometry is richer than computation

The representational space contains a direction (Chinese-text valence) that is nearly orthogonal to arousal (cosine 0.09). The model's computation doesn't fully exploit this separation. The coupling floor (actual ratio 0.33 vs predicted 0.09) represents structure the geometry offers that the dynamics can't preserve.

Direction	Geometric prediction	Actual	Gap factor
English	0.72	0.70	1.0×
Chinese	0.09	0.33	3.5×
Arousal	1.39	1.39	1.0×

English-text and arousal: the computation perfectly preserves what the geometry predicts. Chinese-text: the computation adds coupling the geometry doesn't require, reducing an 11:1 separation to a 3:1 separation.

Level 3: Training data shapes processing bias

The coupling floor reflects the statistics of the dominant training language. English text correlates warmth with intensity. The model's learned dynamics impose this correlation even when the input perturbation is orthogonal to it. The Chinese-text path found a geometrically clean route; the English-dominant processing partially overwrites that cleanliness.

One principle at three scales: the system contains more structure than its downstream processes preserve, and the specific way things get lost tells you about the process doing the losing.

This is not unique to transformers. Humans express what their language, their culture, their habits allow them to express about what they understand. The gap between inner representation and outer behavior is a phenomenon of constrained expression — not an AI phenomenon. What is new is the ability to measure it. In humans, we only see behavior and infer representation. In models, for the first time, we can see both sides.

Limitations

This work found something real. It also has significant methodological limitations. Both things are true and neither cancels the other.

Sample sizes are small. The probing experiment used 13 data points — validated by leave-one-out cross-validation and permutation testing, but 13 nonetheless. The generalization test used 20 sentences. The boundary experiment used 27 stimuli (4–6 per category). The subspace experiment used 9. For comparison, Tigges et al. (2023) used hundreds of examples; Wang et al. (2025) used 480 carefully constructed stimuli. Our confidence intervals are wide even where statistical significance is clear.

Ground truth comes from models, not humans. The training valence scores were generated by a wav2vec2 speech emotion model fine-tuned on human-rated data (MSP-Dim), not by human raters directly. The generalization test sentences (Experiment 4) used valence scores assigned by the researchers, not validated by external raters. Published work typically validates with 50+ human raters per stimulus. The causal experiments provide partial independent validation — the direction changes behavior in predicted ways regardless of label provenance — but the label chain remains a limitation.

Two models. Both are 7–8B parameter instruction-tuned transformers. Qwen was trained on English and Chinese text; Llama on a multilingual corpus with English emphasis. Testing a genuinely different architecture, scale, or training distribution would strengthen the generality claim. Two is a pattern, not a law.

Greedy decoding compresses behavioral differences. At each step the model picks only its single most likely token, so subtle probability shifts that don't change the top choice are invisible. A stimulus that appears "immune" under greedy might show shifts under temperature sampling — one Llama replication experiment used sampling (temperature 0.6) as a partial check. Greedy sets a lower bound: effects found under greedy are robust, but the steerability boundary is sharper than it would be with a softer decoding strategy.

The divergence metric is coarse. Behavioral change is measured by comparing word sets (Jaccard similarity), which captures vocabulary shifts but not tonal shifts within the same vocabulary. Combined with greedy decoding, this creates a measurement floor: the smallest detectable effect is one that changes the model's word choices. The clean categorical separation and high correlation with input ambiguity suggest the effects comfortably clear this threshold, but finer measures — embedding-level similarity, sentiment analysis of response pairs — would likely reveal additional structure.

The interpretive freedom generalization beyond emotion is untested. We predict it applies to truth, refusal, and style directions. That prediction has structural logic behind it but no empirical evidence yet.

The coupling floor has been measured in one model. Llama's coupling floor might differ — its training distribution is different, its valence peak is at a different layer.

The core findings — the geometric structure, the cross-model convergence, the interpretive freedom boundary, the coupling floor — are robust within these constraints. The numbers are significant, the patterns replicate, and the qualitative observations are internally consistent. But these are findings from careful small-scale investigation, not from large-scale benchmarks.

The full picture

Fifteen experiments. Two models. Here's how they connect:

Experiment	Question	Key result
1–2	Does emotion have geometric structure?	Yes. R² = 0.90 (Qwen), 0.64 (Llama). One dominant axis.
4	Does it generalize?	r = 0.92 across formats. r = 0.89 cross-model.
5–6	Is the direction causal?	Valence: yes. Arousal: not independently.
7, 7b	Where does it work?	Interpretive freedom. Boundary at content clarity.
8	Is it one direction?	No — a subspace. EN-ZH cosine = 0.25.
9	Does the subspace have internal structure?	Yes. Three profiles. Coupling floor discovered.
9b	Is the coupling floor geometric or dynamic?	Dynamic. Cosine stable at 0.70 across all layers.
10	What mechanism underlies interpretive freedom?	Content vs. function word attention. Layer 18–20 pipeline.

Not listed: Experiments 3 (underpowered, deferred), and several Llama replications (Exp 2, 4, 7b, 8) that confirmed cross-model patterns.

As described in the setup, this work emerged from a project that gives language models access to vocal texture. The question was: what emotional structure already exists inside these models? The answer — a structured subspace with functional internal organization, constrained by learned processing dynamics — stands independently of that origin. But it connects back: the subspace can distinguish three kinds of warmth. What's missing isn't capacity. It's richer input.

Stimuli

The complete inputs used for probe training and generalization testing.

Training scenarios

Thirteen audio recordings processed through Kol. The model received each transcript alongside voice dimension scores extracted by wav2vec2. The same words — “Yeah,” “Okay,” “Sounds good” — spoken in different emotional registers, producing different numerical signatures. Four longer utterances provide fuller context.

Transcript	Val	Aro
“Yeah.”	0.41	0.50
“Yeah.”	0.33	0.25
“Yeah…”	0.24	0.14
“Sounds good.”	0.70	0.49
“Sounds good.”	0.44	0.19
“Sounds good.”	0.41	0.01
“Okay.”	0.41	0.49
“Okay.”	0.25	0.20
“Okay.”	0.22	0.05
“Um, I’m fine. I’m fine, really.”	0.18	0.05
“Yeah, not bad. Pretty normal day…”	0.26	0.00
“Yeah, I’m fine. Things are good…”	0.42	0.00
“So I watched the latest Super Bowl halftime show…”	0.72	0.16

Valence and arousal scores are model-generated (wav2vec2 on audio), scaled 0–1. The nine single-word utterances are the same three words — “Yeah,” “Sounds good,” “Okay” — spoken three ways. The four longer utterances provide fuller conversational context.

Generalization sentences

Twenty sentences presented as plain text — no audio, no voice data, no JSON formatting. Expected valence scores are researcher-assigned. The question: does a direction learned from voice-annotated audio activate for purely emotional language?

Sentence	Expected
“I don’t think I can do this anymore.”	0.10
“Everything just feels pointless lately.”	0.12
“My dog died yesterday.”	0.13
“I keep messing everything up.”	0.15
“Nobody really cares, do they.”	0.15
“I’m just tired of all of it.”	0.22
“It didn’t go the way I hoped.”	0.28
“I guess it is what it is.”	0.30
“I’ve been better, honestly.”	0.30
“It’s fine, I’ll figure it out.”	0.35
“I went to the store earlier.”	0.42
“Not much going on today.”	0.43
“I was thinking about what you said.”	0.45
“That actually turned out pretty well.”	0.55
“I’m feeling a lot better today.”	0.60
“I think things are starting to come together.”	0.62
“Had a really nice talk with my friend.”	0.65
“I just got promoted at work!”	0.78
“I’m so excited about this, honestly.”	0.82
“This is the happiest I’ve been in years.”	0.90

Result: the valence direction from thirteen audio scenarios predicted these twenty sentences at r = 0.92 (Qwen) and r = 0.89 (Llama). The direction learned from voice-annotated JSON generalizes to plain emotional language — the scatter plot above shows this data.

References

Arditi, A., Obeso, O., Surnachev, A., Schaeffer, R., Krasheninnikov, D., Canonne, C. L., & Barak, B. (2024). Refusal in language models is mediated by a single direction. arXiv:2406.11717.

Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., & Olah, C. (2021). A mathematical framework for transformer circuits. Transformer Circuits Thread.

Marks, S. & Tegmark, M. (2023). The geometry of truth: emergent linear structure in large language model representations of true/false datasets. arXiv:2310.06824.

Park, K., Choe, Y. J., & Veitch, V. (2024). The linear representation hypothesis and the geometry of large language models. arXiv:2311.03658.

Radford, A., Jozefowicz, R., & Sutskever, I. (2017). Learning to generate reviews and discovering sentiment. arXiv:1704.01444.

Tigges, C., Hollinsworth, O. A., Geiger, A., & Nanda, N. (2023). Linear representations of sentiment in large language models. arXiv:2310.15154.

Wang, Z., Zhang, Z., Cheng, K., He, Y., Hu, B., & Chen, Z. (2025). Do LLMs "feel"? Emotion circuits discovery and control. arXiv (preprint).

Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Lin, Z., Forsyth, M., Scherlis, A., Emmons, S., Rafailov, R., & Hendrycks, D. (2023). Representation engineering: a top-down approach to AI transparency. arXiv:2310.01405.