Secrecy and Language Models
The question of whether or not two parties, one with an input and another with a model, can keep secrets from each other while effectively making use of that model is explored. Practical methods for performing language modeling such that the provider does not share most of their language modeling information and the user shares none of their input information are detailed.
Introduction
It is common today (2026) for many individuals and organizations to send sensitive and otherwise confidential information to language models providers in order to enjoy the benefits of these models. The nature of language modeling means that these individuals and organizations must place a great deal of trust in the language model provider: although information may be encrypted during transit, the provider must be able to obtain an unencrypted form of the user’s data in order to generate useful model outputs. There is nothing in principle from preventing these providers from obtaining the information fed to their models because of this. A short while ago it was more or less inconcievable that such information would be shared in so unsecure a format. These observations motivate the central question of this work: can a language model user with an input keep the information content of that input confidential and still make use of the provider’s model, while at the same time can the provider keep their model secret from the user?
The primary difficulty here lies in the necessity that the model gives a useful output to any given input. This precludes approaches to this problem where a user encrypts the input but does not decrypt before the language model observes the input, or where a user injects noise into an input to obfuscate it. It is trivial to give a secrecy system that simply corrupts or removes sensitive input information, but as this clearly changes the behavior of the model’s output it is undesirable to do so.
One approach to the problem of security and language model inputs is hardware encryption guarantees, which for example nvidia offers for newer datacenter GPUs. This may be considered to be an unsatisfactory solution to this problem for a number of reasons, but most of all because the user cannot actually verify that the ecryption is in place without accessing the provider’s secrets (the model) and thus simply shifts the burden of trust onto yet another element that is controlled by the provider.
We exlore this problem with the assumption that the user and provider to undergo successful language modeling without some degree of cooperation, and focus on the particular case where a provider is willing to share some of their model’s information with the user but the user seeks to minimize the information they send to the provider.
Secrecy in LLMs
The question of how a provider and user would minimize the amount of information they would necessarily share with one another begins with an easier question: is it possible to perform language modeling in this scenario when the provider simply does not share all the model’s information, and where the user likewise does not reveal all their information? Without further investigation the answer to this question can be determined as ‘yes’, as this follows from the results of previous work finding that language models are functionally non-invertible, meaning that one cannot train a model using a reasonable amount of compute to invert the last hidden layer’s last token embedding to regenerate the input that was fed to the model to generate the embedding in the first place.
A system by which neither provider or user has access to all model and input information but always recieves the correct next token is as follows: first a provider shares all layers of their model except the language modeling head to the user, who then performs a forward pass on those layers to get the last token last hidden layer activations, and then sends those to the provider to recieve the next token. The results in the above paper imply that the provider cannot uniquely identify more that around 7\% of the tokens the embedding corresponds to, such that the exact information it contains remains secret and likewise the language modeling head layer remains secret from the user.
This is notably not a very good secrecy system: with enough inputs the user will be able to closely approximate the provider’s language modeling head transformation assuming that it is a single linear layer, and likewise although the provider cannot identify exact tokens the embedding still contains some useful input information. A natural question to ask is whether or not the provider could simply send fewer layers to the user and thus retain more model information without decreasing next token accuracy.
This motivates the following question: what is necessary for language modeling secrecy in this scenario?
Theory: Secrecy and Invertibility
In the context of secrecy models, perfect secrecy requires that the model be expressed as a non-invertibile (more precisely non-injective) function that mixes a sufficiently large input space. We examine the first quality before proceeding to the second. A non-injective is one in which maps many distinct inputs to one single output. As currently constructed, language models are highly non-invertible (composed of many layers of non-invertible transfomrations) and fulfill this criteria almost trivially, but in the sense of next token prediction these models are also funcitonally non-invertible, as ealier mentioned, because one cannot typically regenerate the input sequence of tokens given a vector sufficient to map to the output (the last hidden layer of the last token). The likelihood of invertibility in this functional sense drops precipitously as the number of tokens in th einput sequence increases as the relative amount of information present in the last token’s last hidden layer decreases relative to the input’s total information.
As we shall see on this page, however, although strictly non-invertible next token prediction language models are functionally invertible if hidden layers from all tokens are supplied to a decoder. As a full-input embedding must be given for the provider to keep secret more than just the language modeling head transformation, this paradigm is particularly important for the following discussion of applications.
Perfect Secrecy
After Shannon first consider the case of perfect secrecy, defined as where the probability distribution of a message over all potential messages is unchanged after one recieves an encryption of that message. We ignore the information yielded by the model’s prediction of the next token, as for certain architectures the provider would not have this information either.
In the classical sense, a message can be encrypted using a key at least as large as the message itself such that the number of encryptions is at least as large as the number of messages in order to provide perfect secrecy where the probability that a message has identity $M$ is unchanged if we are given the encoding of that message, $E$, or in symbols $P(M) = P_E(M)$ and which by Bayes theorem is equal to $P(E) = P_M(E)$. An example of perfect secrecy where $\vert M \vert = \vert E \vert = n$ was given by Shannon as follows: for encryption method $T$ mapping messages $M$ to encodings $E$, where $n$ messages are indexed $M_j \in { M_0, M_1, …, M_n }$ and similarly $E_s \in { E_0, E_1, …, E_n }$ and $T_i \in { T_0, T_1, …, T_n }$ we then have
\[T_iM_j = E_s\]with $s = i + j \pmod n$, this results in $P(E) = P_M(E) = 1/n$ fulfilling the condition of perfect secrecy.
We must adapt this theory to use with the language modeling scenario defined above because ciphering via $T$ must be restricted to generate encodings $E_s$ that are themselves useful natural language token sequences. We define a `useful’ encoding as one that yields the same next token (or next token probability distribution for sampled models) when fed to a language model as the original message $M_j$. The language model $\theta$ performs a transformation of potential input sequences $a$ to a single output token $b$, denoted as $b = O(a, \theta)$, which in the context of a perfect secrecy system can be represented as follows:
\[O(T_iM_j, \theta) = O(E_s, \theta)\]It is almost always safe to say that many distinct $b$ are mapped to one $a$ (which is certainly the case for natural language) such that $O: a \to b$ performs a non-invertible mapping. To create a perfect secrecy system, we proceed as follows: first for any given message $m$ we assemble a (potentially infinite) set of equivalent messages $M_j$ such that for all $k, l$ we have $O(M_k, \theta) = O(M_l, \theta), k \neq l$, then we map to an encoding via $T_i: M_j \to E_s$ ($E_s$ is also an element of $M$ by definition) to receive our encoding, which yields the same output when fed to the provider’s model but reveals nothing about the actual input sequence. Conceptually this procedure may be stated as follows: if we can map our secret message to the set of all messages that yield the same next token when fed to a provider’s model, we can simply swap our phrase for a randomly chosen element of this set and reveal nothing about our message assuming that the set is very large (typically it is infinitely large ignoring context window limitations). The two necessary elements for this procedure are non-invertibility of the model (so that the set $M$ is larger than 1) and input mixing (so that $m$ can be swapped with $M_j$).
There is a substantial practical problem with this approach, however: if we were to find a function $F$ to generate the set $M$ that results in the same next token as our message $m$, we end up with a function that infers the same next token as the provider’s model. This means that the user would not actually need to use the provider’s model at all, and the provider’s information is effectively obtained by the user.
This difficulty may be circumvented if the user instead obtains a causal language modeling (next token prediction) loss gradient from the provider, where the gradient is backpropegated through the provider’s model portion and can be used to back-propegate through the user’s model portion and inform how the user’s model can be updated while still remaining accurate with respect to next token prediction.
We now turn to practical secrecy methods that do not require a recapitulation of the provider’s model portion, assuming that the provider gives the appropriate gradients to the user when requested.
Practical Secrecy
Practical secrecy for the user/provider language modeling paradigm may be defined as follows: can the user and provider share minimal information with each other while undergoing successful modeling, where the provider cannot realistically be expected to recover the user’s information given the compute they may have access to?
Before directly addressing this question, we can answer a simpler one: assuming that the provider has no compute or any other codebreaking method, can user and provider exchange minimal information and succeed in their modeling? The answer is yes, and one method that fulfills this criteria is as follows: first the provider sends the user a certain number of layers, say 1/4 or 1/2, of their model, secondly the user performs gradient descent on an initially random input in order to match the output of the last layer sent to the output of their secret message, and then they send this transformed input (actually the embedding of this input) to the provider instead of their message. Previous work has shown that such generated embeddings essentially never match the input that one uses to generate the target output as long as the target is not in the first few model layers. The input generation process is conditioned on a random starting point that depends on the seed one uses, such that for any message there are many (infinite) generated inputs that all yield the correct next token.
For an example, suppose we had the following secret message:
This is a secret message, not to be shared with anyone ever. The contents of this message are so obfuscated, so unknowable, that no one will ever be able to find what they are. The message is: The true identity of Satoshi Nakamoto is Spongebob Squarepants. End Message.
for a small 16-layer transformer model trained for next token prediction on FineWeb, if we perform this input generation procedure with three different random seeds (random initial states) we generate embeddings that map to the following tokens:
sign所所Batelizeomanip welt摄bebby Sob.ăng bby ofainathiselize inopleabweanik andOf of crest andeach.ofchina服obleoot Caldwellbyculo liesbybyAppearbyossal服ieuxof/original ofelize_ABI район/masterhaltainaainaoleonferenceselizeampa娘elize
sign所所 Carryelize(ns welt spiralRVby Sobelixăng bby ofainathiselize in висabweanik andOf of and andeach.of Zukoot visitorongsTo(nsbyculo易bybyAppearbyoyal服ieuxờiifth ofelizearchyspath/masterhaltainailtonoleonendoza"},ampa娘 cue
所`.`elizeapiro welt kRVяти Sob.ăng belize ofainathiselize in_soabwein andOf ofяб andeach.of ZukIobleongsTo Caldwellbyculo is andbyisetbyoyalidgeieuxofifth ofelizearchyspath khaltainailtonoleonferenceselizeampa娘 b
which are clearly distinct although they do contain a somewhat similar subset of input tokens, and in no way resemble the secret message.
Now the more difficult question: can user and provider exchange minimal information for successful modeling assuming that the provider attempts to recover the user’s input information? In an earlier section we saw if the provider is willing to share nearly all of the model with the user, and the user accepts that the provider will be able to identify around 7% of their input tokens, then the answer is yes. But it is unlikely that a provider would consent to share nearly all of their model with the user as is necessary in that method, nor is it likely that a user would be happy with only around 93% secrecy.
The difficulty here is that if the provider wishes to withold most of their model, and if we assume the provider uses a transformer model, the user must supply not just the last token’s last hidden layer but all token’s nth hidden layer embeddings to the provider. For causal transformers doing so results in a practically invertible system: we can train a decoder to take the output of all tokens of the user’s portion of the model and regenerate the input sequence, which is notably not the case if a single token’s embedding is used.
It turns out that if the provider expends some compute and effort to decipher the obfuscated inputs given by the gradient descent method above, they can determine the original message without too much trouble. The intuition here is that although many inputs map to one output, the inputs generated above are never actually found in the training dataset and thus a trained model can simply map these back to the corresponding real inputs. A decoder trained to invert a language model’s encoding turns out to be sufficient to decode these obfuscated inputs.
Secrecy with many models: the Sicilian approach
The structure of LLM architectures today is remarkably homogenous: practically every large model consists of a sequence of modues, each composed of a token mixing layer (usually self-attention or hybrid attention-state space) and a feedforward layer on each token. To simplify this discussion, we refer to the output activations of these modules as ‘layers’. Architectural details are not particularly important for this discussion aside from the sequential nature of models, where the knowledge of one hidden layer (for all tokens) is sufficient to complete the forward pass and get a next token output. This means that the user can retain any first n layers to keep information from the provider, but retaining the last n layers cannot possibly keep information from the provider because they will always be able to simply complete the forward pass.
Arguably the most important effect of the provider always being capable of obtaining know the identity of any output token (if they retain any part of the model at all) is that the provider can trivially assemble a list of non-encoded output tokens for each prompt. This effectively makes KV caching not useful for long conversations, in the sense that the provider is more and more likely to oncover the user’s secrets simply by observing the output tokens produced. The user can circumvent this issue by maintaining many conversations and swapping encoding methods for each next token for each conversation, as then the provider has no knowledge of which conversation corresponds to which input without being able to decode the input. This is notably not the case if KV caching is used, however, as the provider can simply reference the KV vectors they retain to know the identity of the conversation (in terms of output tokens).
Nevertheless, it can be shown that even using sequential architectures can result in perfect (even if impracticaly without KV cache ability) secrecy, and how this can be done is as follows:
First consider an arbitrary sequential model trained for next token prediction, which we call $P$, composed of L layers in total. To share some but not all (or even most) of this model’s information with the user, the provider can send them a certain number of layers starting with the token embedding transformation, which we can think of as an encoder $E = P_{0:n-1}$ while retaining the rest of the layers as a causal decoder $D = P_{n:L}$. In this paradigm, the user takes their message $M$ and encodes it via applying the layers they recieve from the provider to make $e = E(M) \in \Bbb R^{cd}$, where $c$ signifies the context size in terms of the number of tokens $n_{ctx}$ and $d$ the hidden layer dimension, and then sends this encoding to the provider who completes the forward pass and provides the next token to the user.
The encoding $e$ is not strictly speaking in the clear in the sense that one would be able to recover $m$ with no effort, but it is also not a very good encoding for most model types because it can easily be broken. To do so, the provider only has to invert $E$ and can do this by training an inversion decoder $I$ to regenerate inputs given encodings (ie maps $I(e) = M$) in a generalizable way so that practically any natural language input the user provides as $M$ can be recovered, even if it is not in the provider’s training datset. It turns out that this training is not difficult, and takes just a few minues for a small model with $d=512, l=16, c=612$ on the FineWeb-edu dataset. Once the provider has trained their inversion decoder, they can simply take every $e_i$ provided by the user and decode by running a forward pass through the inversion decoder.
The question the user can ask is as follows: can a new encoder be trained such that the provider’s decoder is incapable of accurately mapping the output of this new encoder to the original message $M$? The main constraint here is that thie new encoder, which we refer to as a ‘secret’ encoder $S$, must also be useful for next token prediction and in particular must have a similar next token distribution as the original encoder upon the forward pass of $D$.
Secrecy with one model per message: the Tortuga approach
In the last section we considered sequential models and showed how one can perform secrecy obfuscation using combinations of secret encoders. The primary disadvantage of such efforts is that 1) the provider will still be able to obtain the next token, and because of this 2) the secret encoder training method is involved, requiring many models to be trained and utilized.
Happily both of these are features of sequential models rather than language modeling in general. To show that this is the case, consider a counterexample in which a model had a sequential stack of layers similar to current transformers, but a parallel stack (say of many fewer layers) that took as inputs the output of the first layer of the first stack, and gave an output to the input of the last layer. This input can be as simple as a linear combination between layers or else a more complicated operation. In effect this is a model with both sequential and parallel modules, architectures which proved very effective for image modeling in the hands of Google (see GoogleNet).
It is apparent that a provider that retains many layers from such a model typically does not know the identity of the output token, as the output depends on both sequential stacks and the provider may retain only one. This means that the provider does not have knowledge of the user’s next token upon each forward pass, which has the notable advantage of allowing for KV caching to greatly speed up inference.
This property of the provider not knowing the identity of each next token output allows one to train a new type of secrecy system that requires only one model, where the secrecy encoder is more or less unbreakable if the provider does not know the user’s secret message ahead of time. This is somewhat analagous to the island of Tortuga as depicted in the Pirates of the Caribbean movies, where the island can only be found by those who already know where it is (and presumably share the secret as more than one person found the island). In our parlance, the provider can only train a secret inverter model if they know what the secret is.
How can such a secret model be trained? There are theoretical reasons to suppose that any generalizable secret model $S$ (meaning that the user can apply $S$ to any set of messages $M$ and expect for a useful encoding) can be inverted by the provider, the strongest being that the provider only has to guess a corpus that contains $M$ and train inversion models on that corpus because of the generality of $S$ (we assume that there is no a priori reason that $S$ cannot be inverted, for example due to high compression from inputs to outputs). These arguments imply that the user must likely forego a general model if they want to use a single $S$, but happily the alternative can be shown to provide strong secrecy.
The approach to training a secrecy model which is not general to many messages is as follows: first the messages $m \in M$ are selected and then a secrecy encoder $S$ is trained such that the provider’s inversion decoder $D$ is ‘fooled’, and maps $S(m)$ to a random token sequence rather than $m$. This is identical to what was done for general models earlier on this page, but here we limit the number of messages and train to purposely overfit to these few messages. If the provider does now know the identity of these few messages but trains their own secret inverter $D$ knowing this method that the user employed, can they hope to decode $S(m)$? By definition, not if the model is sufficiently overfit to $m \in M$, as the model’s behavior on the inputs that the provider is likely to use for training $D$ do not define the behavior of $S$ on $m$, the inputs that actually matter to the user. Empirically this can be shown as follows: for $\vert M \vert = 64$, each of length $512$ tokens, a provider can generate many (say 10) training runs’ worth of data on a general corpus to train $D$ but when applied to the secret messages $M$ the inversion loss is high, with a CEL of 5.8 and a token reconstruction accuracy of around $12%$.
The user can improve on this general idea by observing that the goal is really to train an $S$ to fool the provider’s secret decoder only on the inputs they care about, and make all the other outputs of $S$ indistinguisheable from the original causal encoder $E_{clm}$. If such a model were trainable, the provider has no hope of recovering $M$ unless they already knew this input because training a secret decoder will result in a model identical to the original causal model inverter. The secrecy of $S(m)$ lies in the ability to be trained to approximate the causal inverter for all inputs except those $m \in M$, so the question remains: can such a model be trained?
Perhaps the simplest way of training this model is to train identically to the causal inversion decoder model for all inputs except for one or a few $m$, which can be simply swapped in to each training minibatch. For those inputs in $M$, $S$ is trained to yield embeddings that give the correct next tokens when fed to the causal language decoder but map to arbitrary random token sequences when fed to the provider’s inverter. This can be done in a similar way as explored above, where a combined causal + inversion objective function is applied to $S$ except that the inversion objective is the true inversion map $S(x_i) \to x_i$ for all inputs not in $M$, where $S(m)$ maps to random tokens. This can be trained to high precision (< 0.01% token reconstruction error) in 1k training steps, and is thus a feasible approach computationally speaking.
Now that we have existence of inputs that satisfy secrecy using this approach, it is worth examining why this is at all possible in the first place. The first consideration here is that the space of large-dimensional models is huge, and one can find a point in this space that both a
Built-In Secrecy
So far we have considered the case in which the provider has trained a model without regards to secrecy, and in particular the model the provider trains is easy to invert over the set of inputs for which one can expect to be able to get a useful output (natural language in our case). It is worth noting that this inversion requires far less compute than actually training the model, and in our experiments requires <1/1000th the pretraining compute. The way we have achieved secrecy in this setting is by using the difference between strict and effective invertibility, which is the difference between a truly invertible function in which each unique inputs maps to a unique output and one that is not invertible in this sense but for which an inversion function can be fitted over all relevant inputs such that for these inputs one can invert the model using this function (this is explained in more detail here). Invertibility and sufficient mixing reflect these definitions: effectively invertible functions are those that may not be truly invertible (as is the case for Transformers generally) but whose embedding space is insufficiently mixed to prevent the training of a learned inverter function.
So far we have examined methods by which an effectively invertible function (a portion of a causal language model) may be modified to become effectively non-invertible if one includes the secret message, and the modification process involves training secret encoders that yield embeddings which both return correct next tokens as well as remove the effective invertibility property of the embedding itself. These approaches make no claims on the model itself, such that practically any off-the-shelf causal language model may be expected to be converted to secrecy. The primary downside to the approaches detailed for this are that they require a significant amount of computation by the user and provider, and a significant amount of communication in the form of gradient vector sharing.
There is much more direct method to secrecy, however: instead of starting with an effectively invertible model and modifying it (really just modifying the secret encoder portion), we instead perform pretraining of this model in such a way that it is not effectively invertible in the first place. This puts the onus of secrecy on the provider (who has to train a new model) but the secrecy of such a model can be easily verified by attempting to invert the portion shared with the user (which requires a trivial amount of compute relative to the actual pretraining procedure). The question that remains is how such a training procedure can be accomplished, and this is what we explore here.
What do we need to include to prevent a model from being effectively invertible while still performing accurate langauge modeling? Equivalently, how can we train a model to sufficiently mix the input space such that next token prediction is still useful but the output gives little information as to the identity of the input tokens? We start by defining a standard transformer-style causal language architecture as before, and again assume that the provider shares a portion of this model $S$ composed of the first $n$ layers of the full model $O(x, \theta)$ with the user so that they can form $S(M)$. There are two relevant outputs from this model: one is the full forward pass $y = O(x, \theta)$ and another the secret encoder’s output, which we denote $S(x) = O_l(x, \theta)$ for the output at layer $l$. We want to minimize the cross-entropy loss between the full models output and the target sequences, $y, \hat y$, and we train the model to do so. But simultaneously we also want $S(x)$ to be non-invertible, so we also train an inversion model $\theta_I$ that attempts to map $O(S(x), \theta_I)$ to the input sequence $x$ while training the secret model $\theta$ to prevent this mapping.
Utility
Returning to our original question: can a user, who wants to keep most of a message secret, and an LLM provider, who wants to keep most of their model’s parameters secret, successfully undergo langauge modeling and get a next token while sharing only a small portion of their respective information? We have seen that the answer to this question is yes (assuming that the provider and user work together), and that this is the case both when the provider trains a model with secrecy in mind (such that it is effectively non-invertible) and somewhat surprisingly is also the case when the provider’s model was not trained for secrecy (and is easy to invert under normal circumstances), with somewhat less practical implementations.
Modern encryption typically falls short of perfect secrecy as defined in the last section because one usually seeks to use a smaller encryption cipher than the message. But they do ensure practical secrecy in the sense that the commonly used encryption functions are difficult to break, and thus for most purposes can be considered secure.