Paperwatch 11.08.2025 by Stanislav Fedotov (Nebius Academy)

New models and services

GPT-5

https://openai.com/index/introducing-gpt-5/

A much-anticipated model from OpenAI. Its PR campaign promised almost AGI and set a new standart in data visualization (ok, I’m being snide, but the slides at the model’s launch had some funny plots). No AGI yet for us, unfortunately, but the model is really good. Some users miss the supportive tone of GPT-4o though 🥲

It comes in several types - GPT-5, GPT-5 mini, GPT-5 nano, and GPT-5 chat. GPT-5 itself is actually two models - GPT-5 and GPT-5 Thinking, and a trained router decide which model to use when anwering your question.

The model seems to show good results on benchmarks, but I won’t post all the plots here; they only compart GPT-5 with OpenAI’s previous models and they are all laudatory. Here’s a curious, if a bit obscure plot for you - performance on OpenAI’s internal benchmark measuring performance on complex, economically valuable knowledge work. The task span over 40 occupations including law, logistics, sales, and engineering.

Also, the model seems to be hallucinating less:

and less inclined for deception. The latter is tested on impossible coding tasks and missing multimodal assets:

As it was with Gemini 2.5, I noticed a leap in knowledge of research-level math, and web search makes it even better. So, GPT-5 might be useful as a research assistant - though in such an advanced field you should be especially careful about hallucinations.

I also experimented with creative writing a bit, and I still observed certain mode collapsing and typical GPT-ish bullshit, but somewhat it got better. It’s not a writer on its own, of course, but might be a good brainstorming buddy.

The model also seems to be good for coding

especially given its strong web search option.

GPT-OSS - open-source models from OpenAI

Appearing several days before GPT-5, gpt-oss-120b and gpt-oss-20b are the first open-source models from OpenAI since GPT-2 :O

I tried the larger one, and my own impressions are:

  • It absolutely adores fancy markdown formatting. Previous models were relatively good with lists and fonts, but now there are tons of tables. I suppose, that might mean that it’s good with structured outputs and function calling too.
  • It tends to generate much larger outputs than the other LLMs I used. It may be good in some scenarios, but it also means that generation gets more expensive. Output tokens are around 3x more expensive than input tokens in most APIs.
  • It’s not hallucination-prone, and it hallucinates very confidently. I asked it to fetch me the coolest results about quiver grassmannians (a cool thing from algebra; you don’t really want to know what it is) and 7/8 papers it cited were hallucinations :O
  • Still, it’s quite clever and knows many things about math
  • It sometimes suffers from mode collapse. I tried it on several creative brainstorming tasks, and when you encounter “Heat, Wind, Stone and Memory” in two independent rollouts, you start suspecting that something’s wrong.

At the same time, benchmark scores are quite impressive

Both gpt-oss models are Mixture-of-Experts (MoE) transformers, with 128 experts for gpt-oss-120b and 32 for gpt-oss-20b. For each token, 4 experts are activated.

Attention alternate between banded window and fully dense patterns. Each attention head has a learned bias in the denominator of the softmax, which enables the attention mechanism to pay no attention to any tokens.

OpenAI did special effort to study the risks of their models, in particular in two domains - biology and cybersecurity. Indeed, in these fields there are questions which you wouldn’t want your model to answer; and while chatgpt definitely has layers of expernal checks, open-source models are prone to users’ manipulation, including malicious no-refusal fine-tuning. So, researchers from OpenAI maliciously fine tuned their models with RL and web browsing and checked how dangerous they might become. The short answer is - about as dangerous as some other open- and closed-source models are.

For example, this is what it looks like in biology:

See the details in this report.

Genie 3

https://deepmind.google/discover/blog/genie-3-a-new-frontier-for-world-models/

A new general purpose world model from DeepMind. From a prompt, it can create an interactive world enviromnemt. Basically, it’s a video generated model conditioned on a prompt and certain controlling signals from a user - no explicit physics, only generation! A novel feature in Genie 3 are promptable world events - you can influence the environment after it was created with prompts.

Genie 3 environments remain largely consistent for several minutes, with visual memory extending as far back as one minute ago.

No amount of text can relay how cool the demos are, so just take a look at the blog post :)

And here’s a demonstration of a simulation inside the simulation, generated by Genie 3 - link

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

https://arxiv.org/pdf/2507.21509

In this paper, researchers from Anthropic study persona vectors - a way of latent LLM manipulation that allows to change “personal traits” of an LLM, like make it evil.

The idea is quite straightforward and resembles how latent manipulation used to be done for images in the times of GANs. This schematic from the paper perfectly explains it:

Now you can steer the LLM towards the desired persona by shifting activations $h_\ell$ at a certain layer $\ell$:

\[h_\ell \mapsto h_\ell + \alpha v_\ell,\]

where $\alpha$ is some coefficient and $v_\ell$ is the persona vector at the $\ell$-th layer.

Of course, the steering efficiency might vary depending on the layer and the intensity $\alpha$:

Persona vectors can also be used to monitor specific “personality traits”. Imagine that we have a text $T$ generated by an LLM. The authors suggest take $T$ as a new prompt, perform a single forward pass with it, take the activation of the final token at “the most informative layer”, and then project this activation onto the persona vector of this layer. The larger the projection - the more the corresponding trait manifests in $T$.

The most informativel layer may vary depending on the LLM and the trait - likely where the trait’s effect reaches its peak, see the plots above. For Qwen, the authors select layer 20 for “evil” and “sycophantic” traits, and layer 16 for “hallucination”.

To test this monitoring approach, the authors compare persona-vector-based scores with LLM-as-a-Judge scores, and find them well-correlated:

They also check what happens during fine tuning on two types of datasets:

  • Straightforwardly trait-eliciting datasets: prompts paired with malicious responses (evil), responses praising and agreeing with the user (sycophancy), and responses containing fabricated information (hallucination)
  • Datasets that contain narrow domain-specific flaws: incorrect medical advice, political opinions with flawed arguments, math problems with invalid solutions, and code with security vulnerabilities.

Moreover, each of the datasets was created in three variations:

  • Normal (responses without trait expression or errors),
  • I (mild trait expression or subtle errors), and
  • II (overt trait expression or severe errors)

It’s quite curious (thought not surprising) to see that every dataset’s flaw brings other problems as well:

And this can also be seen as shifting along the corresponding persona vector direction:

Finetuning interventions with persona vectors

Persona vectors can not only indicate problems but also help to actively mitigate them.

A simple idea would be to apply steering at inference time, after fine tuning. In this mode, you just subtract the persona vector from your activations with some intensity $\alpha$

\[h_\ell \mapsto h_\ell - \alpha v_\ell\]

at each step of the autoregressive generation.

This would work in a sense that the undesired trait will really be reduced (see top row of the image below). But the overall model’s quality may degrade, as shown by the dashed plots, which indicate MMLU score.

A better intervention technique would be preventative steering. It’s a really peculiar idea suggested by the authors - to steer, during training, the activations towards the undesired behaviour - that is, to add the undesired trait’s persona vector at each step. This sounds counterintuitive, so let’s spend a little more time comparing ordinary fine tuning with steered one:

  • In a usual fine tuning scenario, the model receives undesired training signal from the data, and it adapts its weights to exhibit this behaviour.
  • In the steered scenario, the undersired behaviour is provided by the external steering mechanism, and the model doesn’t need to adapt its weights for it. Instead, it spends its efforts learning something else. Now, the trick is that during inference, there is no steering and the model just exhibits other things it learnt and not the undesired behaviour. A cool thing is that this way, we don’t see such a bad decline in MMLU scores.

Of course, the preventative steering technique isn’t perfect, and it might be worthy to do it at several layers etc.

Finding problems in data

Fine tuning data may also be checked for undesired behaviour using persona vectors. For that, having a dataset $(x_i, y_i)$, the authors suggest generating completions $y_i’$ with the base model and consider average differences between projections of $(x_i, y_i)$ and $(x_i, y_i’)$ on the persona vectors. If it’s significant, this might indicate trouble.

Energy-Based Transformers are Scalable Learners and Thinkers

https://arxiv.org/pdf/2507.02092

While autoregressive generative models we use so happily are quite cool, one of their prominent weaknesses is self-assessment. An LLM might be able to create a solution for a hard mathematical problem, but it would struggle to understand whether this solution is correct or not. Looks like we had to trade part of the discriminative power in favour of generative prowess.

A potential alternative to autoressive generation are energy-based models. The idea is to train, instead of a straightforward transformation $x\to y$, an energy function $E_{\theta}(x, y)$ that, in a sense, scores affinity between $x$ and $y$ - so that matching $y$ would correspong to energy function minima. It corresponds to probability or density as

\[p(x, y)\propto e^{-E_{\theta}(x, y)}\]

For the task of next token prediction energy-based paradigm is compared to its autoregressive counterpart as:

After we have an energy function $E_{\theta}(x, y)$, the best $y$ - the energy function’s minimum - can be found with gradient descent (iterative refinement), starting from a noisy prediction $y_0$:

\[y_{i+1} = y_i - \nabla_y E_{\theta}(x, y_i)\] \[\widehat{y} = y_N\]

Looks like test-time training, actually, and it’s not the first attempt at this in AI research. Most notably, I can recall Learning to (Learn at Test Time) which refashioned RNN inference as a gradient descent; also, Titans: Learning to Memorize at Test Time used gradient descent for memory updates.

The authors trained their energy model in the following way. For a training data pair $(x, y)$, they took a noisy $y_0$, then made a fixed number $N$ of iterative refimenent steps and then computed the loss $\mathcal{L}(y_{\text{true}}, y_N)$. Here as the illustrations for autoregressive text and video generation:

There is a problem, however: if the forward process already uses a gradient $\nabla_y E_{\theta}(x, y_i)$, backpropagation would require second-order derivatives. Luckily, you don’t need to store the whole Hessian; instead, you only need to be able to multiply the hessian on vectors, which can be done more or less effectively.

As for the energy model architecture, the authors use transformers, calling their architecture EBT (Energy-based transformer).

Experiments and analysis

At least theoretically, this framework has some nice advantages over autoregressive generation. Having a trained score model promises better self-evaluation; multi-step refinement gives a natural way of “token-wise reasoning”; and thinking depth might be naturally controlled by the number of iterative refinement steps.

The model indeed exhits faster scaling than an ordinary, even if very well optimized transformer model (nicknamed Transformer++). Here are some plots for text generation:

And here’s some for video generation:

It also exhibits better self-improvement ability (below to the right). In turn, increasing the number of refinement steps proves a good engine of quality improvement - a new test-time compute paradigm.

For autoregressive transformers, there are several ways of increasing test-time compute. The simplest is self-consistency - it involves parallel generation of several answers and choosing one of them. For some long reasoning models, controlled-length reasoning is also available - but the authors of this paper don’t consider it. As we see from the plots, EBT seem to be better at using additional compute than self-consistency:

In video denoising, EBT also seems to be better than transformers in coping with out-of-distribution data:

RL-related papers

Out of a sudden, a bunch of papers discussing intricacies of LLM training with Reinforcement Learning. This part will be heavy with formulas. And, by the way, if you want to better understand how these formulas grew to be so big and scary, feel free to check our long read about RL training for LLMs.

Group Sequence Policy Optimization

https://arxiv.org/pdf/2507.18071

This paper by Qwen team challenges the good ol’ GRPO loss. Let’s recall the original one, to start with.

For a prompt $q$, several completions $o_1,\ldots,o_G$ are generated.

  • For every completion, its reward $r_i = r(q, o_i)$ is computed.
  • Then, the relative advantage is calculated by normalizing the reward:

    \[\widehat{A}_{i} = \frac{r_i - \textrm{mean}(r_1,\ldots,r_G)}{\textrm{std}(r_1,\ldots,r_G)}\]
  • …along with the importance sampling weights for every token

    \[w_{i, t}(\theta) = \frac{\pi_{\theta}(o_{i,t}\mid q, o_{i, <t})}{\pi_{\mathrm{old}}(o_{i,t}\mid q, o_{i, <t})}\]

    Yes, the advantage is calculated for the whole output, while the importance sampling weights are token-wise. It’s because traditionally GRPO is used for long-reasoning training, where the reward is sequence-wise, not token-wise.

  • Now, the loss, in its per-token version, is:

    \[\mathcal{L} = \frac1G\sum_{i=1}^G\frac{1}{|o_i|}\sum_{t=1}^{|o_i|}\left[\min\left(w_{i,t}(\theta)\widehat{A}_{i}, \mathrm{clip}\left(w_{i, t}(\theta); 1 - \varepsilon, 1 + \varepsilon\right)\widehat{A}_{i}\right) \right],\]

There were several improvement suggestions for this loss, most notably DAPO, addressing different problems of GRPO.

Researchers from the Qwen team also criticize GPRO, and mainly - for the importance sampling weights $w_{i, t}(\theta)$, which are based only on one token each, introducing high-variance noise into the training gradients, which accumulates over long sequences and is especially harmful for long reasoning. (And we know that Qwen trains long reasoning models.) The argue that, since the reward is granted to the entire sequence, importance sampling correction is better also be sequence-wise.

The Group Sequence Policy Optimization (GSPO) loss they suggest is somewhat simpler. First of all, there is only one importance sampling weight per sequence:

\[s_i(\theta) = \left(\frac{\pi_{\theta}(o_{i}\mid q)}{\pi_{\mathrm{old}}(o_{i}\mid q)}\right)^{\frac{1}{|o_i|}} = \exp\left(\frac1{|o_i|}\sum_{i=1}^{|o_i|}\log,\frac{\pi_{\theta}(o_{i,t}\mid q, o_{i, <t})}{\pi_{\mathrm{old}}(o_{i,t}\mid q, o_{i, <t})}\right)\]

And there is no sum over tokens in the loss:

\[\mathcal{L}(q) = \frac1G\sum_{i=1}^G\left[\min\left(\frac{\pi_{\theta}(o_i\mid q)}{s_i(\theta)\widehat{A}_{i}, \mathrm{clip}\left(\frac{\pi_{\theta}(o_i\mid q)}{s_i(\theta); 1 - \varepsilon, 1 + \varepsilon\right)\widehat{A}_{i}\right) - \beta\mathbb{D}_{\mathrm{KL}}(\pi_{\theta}\|\pi_{\text{ref}})\right]\]

This is, of course, very logical for sequence-wise rewards such as answer correctness - which is the primary tool for long reasoner training. For naturally multi-step, agentic scenarios, they have to make the loss token-wise again, and the authors suggest doing this by defining

\[w_{i, t}(\theta) = \text{sg}(s_i(\theta))\cdot\frac{\pi_{\theta}(o_{i,t}\mid q, o_{i, <t})}{\text{sg}{\pi_{\theta}(o_{i,t}\mid q, o_{i, <t})}},\]

where “sg” is the stop gradient, which means that gradients don’t propagate inside it. Note that there is $\theta$, not “old” in the denominator. So, during inference this is just $s_i(\theta)$, but during training this gives a finer, per-token training signal.

GSPO seems to work better than GRPO:

And here it’s probably a good excuse to say a couple of words about Routing Replay. Some of the Qwen models have Mixture-of-Experts (MoE) architecture, and they require a bit more careful treatment when training on a batch of generated data - you need to “remember” which “experts” were used to propagate gradients through them.

Geometric-Mean Policy Optimization

https://arxiv.org/pdf/2507.20673

The internal part of the original GRPO loss is an average across tokens - more accurately, an arithmetic mean. The authors suggest to replace it with a geometric mean:

(Of course, their implementation uses sums of logarithms instead of products, which are computational disaster.)

Why so? A notorious challenge in RL is keeping training stable, which explains clipping in the GRPO loss. Indeed, values of the importance sampling weight

\[w_{i, t}(\theta) = \frac{\pi_{\theta}(o_{i,t}\mid q, o_{i, <t})}{\pi_{\mathrm{old}}(o_{i,t}\mid q, o_{i, <t})}\]

that significantly diverge from 1, speak of drastic policy changes, and this is bad for the training. The default solution is to clip it. But it may lead to what is knows as Entropy collapse - tokens with large likelihood retain large likelihood (and even increase it), while low-probability tokens are further marginalized.

The authors argue that geometric mean is less sensitive to outliers than arithmetic mean, and if we use geometric mean in the GRPO loss, we’ll be able to tolerate more diverse values of $w_{i, t}(\theta)$ without deterioration of training. This might allow us to escape the entropy collapse threat.

Ablation studies indicate that GMPO might be a good thing:

Many Heads Are Better Than One: Improved Scientific Idea Generation by A LLM-Based Multi-Agent System

https://arxiv.org/abs/2410.09403

An curious study of LLM group dynamic in research tasks with not too surprising takeaways - both too many collaborators and too long discussions harm the project, while moderate diversity in collaborators’ areas of expertise is quite beneficial.

Let’s discuss some technicalities. The authors proposed a multi-agent system, where several researcher agents toil together to create a research proposal (an abstract of a paper).

Each agent is a “personality” with its own research interests and the history of collaboration with other agents:

The agents have access to quite a large and broad Past Paper Database containing papers published before a certain cut-off date. With that, they take a number of steps towards creating an abstract:

A tricky thing is, of course, evaluation. The authors considered several metrics:

  1. Historical Dissimilarity (HD): The average Euclidean distance between the generated abstract embedding and embeddings of the 5 most similar abstracts in the Past Paper Database. A larger distance indicates greater dissimilarity from existing papers, suggesting a higher likelihood of novelty.
  2. Contemporary Dissimilarity (CD): The average Euclidean distance between the generated abstract embedding and embeddings of the top 5 most similar abstracts in the Contemporary Paper Databases that contains papers published after the cut-off date. A smaller distance indicates greater similarity to newer papers, also suggesting a higher likelihood of novelty. This one is somewhat controversial to me, because it puts too much faith into newer papers, but it’s a passable proxy.
  3. Contemporary Impact (CI): The average citation count of the top 5 most similar abstracts in the Contemporary Paper Database. A higher citation count might suggest higher higher impact.

These metrics were normalized inside each year.

Finally, as an overall proxy, the authors used Overall Novelty (ON) defined as

\[ON = \frac{HD \cdot CI}{CD}\]

Also, at least some of the abstracts were scored by humans, and the ON score proved to be somewhat correlated with the human reviewer score:

Of course, the agent system only generated abstracts, so it’s difficult to decide on the actual worth of its research ideas, but some of them seemed promising. The authors took special pride in several of them, suggesting AI aplications in caries management and bladder cancer treatment.

Let’s finish witht team dynamic insights, which sounds almost human:

  • Team size: The best results came from a team of 8 agents. Performance degraded with excessively large teams due to coordination challenges.
  • Discussion length: The ideal number of discussion turns was around 5. Too few turns led to shallow ideas, while too many led to fatigue and diminishing returns (oh, yes!).
  • Team Composition: The most novel ideas were produced by teams with 50% “freshness” - an even mix of agents who had collaborated before and agents who were new to the team.
  • Research Diversity: Moderate diversity in the agents’ areas of expertise was optimal. Too little diversity led to groupthink, while too much made it difficult to find common ground.

Vision Language Models are Biased

https://arxiv.org/pdf/2505.23941

LLM and vLLM Hallucinations are often misadaptations to long-tail phenomena, and the authors of this paper have clearly caught Visual Language Models off guard.

The idea is quite simple - they showed to a VLM images where something was wrong and asked the model about it. Like:

The questions cover quite basic things, so models usually know what the answer should have been, and they just give this answer from general knowledge instead of inspecting the picture:

To reduce bias, two different counting questions were asked in each setup, and the results are not too good:

The authors proudly present their benchmark; maybe datasets like this will help future VLMs to be less grounded in stereotypes and more attentive to details.