Steganographic NLG-powered Chat Plugin
This project is just starting, jointly with some Foulabers.
In many countries using encryption is illegal and will attract attention to the parties involved in an encrypted exchange to the point of threatening their physical well-being.
In other countries, encryption might be legal, but again, it might attract the attention of law enforcement and make it practically unusable.
StegoChat is designed to carry a covert message in a normal chat conversation so as to be indistinguishable from a normal conversation not carrying any encrypted payload. To do so, quite a bit of preparation and care by both parties in the chat exchange are needed. The current version is still under research and not ready for public use, but if you want to help us evaluate it, contact us.
How It Works
StegoChat requires a collection of chat exchanges between you and your target party. The bigger the size of the collection, better the chances of maintaining secret the fact the user is involved in an encrypted conversation.
From the collection, StegoChat computes a large collection of pair of words, in order. These pairs are called skip-bigrams and StegoChat will associate some to binary zero and others to a binary one. These pairs annd association is called the decoding dictionary and it has to be sent to the receiving party beforehand. The best way to do that is by meeting the person in real life, but banning that, techniques using steganography on photos or music can be attempted.
Every time the decoding dictionary generation process is invoked, you will get a different decoding dictionary. It is encouraged to use a different decoding dictionary to communicate with different people, as the dictionary is also used as a key to encrypt the stream. The effects of this recommendation are still unclear, though.
When using StegoChat, you type a message to be sent encrypted to the other party and then proceed to chat about a normal, non-controversial topic. It is better to type the message beforehand to avoid timming attacks (see below). As you type the chat message to be sent plaintext, StegoChat will show you variants of the text you typed that contain the skip-bigrams encoding the secret message. If you see a text that fits what you are saying and you feel you could have said that in the current conversation, click on it and the text will get changed. If none of the options convinces you, rewrite your message from scratch. This process is time consuming, so you better practice beforehand! If you take too much time, the timing attack tracker will stop you for a longer period of time and you'll have to make an excuse (e.g., "Sorry, phone call") but you can type a longer text. Use that time to assemble a nice message.
As you send more and more encrypted data, your statistics will start to change from your usual self. This is being tracked by the language model attack. The same as the timing attack tracker, when it gets to red you wont be able to communicate anymore. To signal this to your party, you are better off disconnecting from the chat.
Under The Hood
Choosing the encoding dictionary
The encoding dictionary is chosen so that there are enough skip-bigrams in any position within the chat message and enough non-confusing tri and fourth grams to enable a good degree of variability.
The key part here is that the skip-bigrams have to be common enough to be usable to send messages but that the ones assigned to zeros and the ones assigned to ones co-occur in such a way to form a well distributed random sequence in normally occurring text. That is quite challenging given the fact that word distributions are far from random.
The idea is to focus on skip-bigrams that do not appear in half the training sample and that appear once in the other half of the training sample.
Choosing the encoded variant
This is the key, technologically complex part of StegoChat. It uses ideas from statistical Natural Language Generation by building a network of words with probabilities associated with each subsequence. From this network (called a _trellis_) some highly likely messages are sampled, and those are the rewordings you see. The construction of the _trellis_ has to satisfy a number of simultaneous constraints: it has to contain the words typed by the user (or likely misspellings to neutralize spurious skip-bigrams), it has to contain the skip-bigrams with the hidden message and (most importantly) it has to be highly likely given the user's text style.
For more information about working with n-grams in NLG, see Langkilde&Knight, '98, "The practical value of n-grams in generation" (PDF)
Explanation: the original message is extended with new nodes, some coming from the encryption model and some from a language model. Each arc has a probability associated with it. To produce the recoded versions, these probabilities are used so to get the most likely paths through the trellis, with the added constraint of encoding the target message.
StegoChat contains the following modules. The chat-client-dependent modules need to be re-written for each client. So far the focus is on xchat2.
StegoChat is currently under development. Contact DrDub in ##foulab (FreeNode) if you'd like to help in.