Shape preserving encoding, part 1 of 2

Hiding in plain sight…

People have always been interested in making sure that certain other people (the adversaries) could not read messages intended for their friends (the confederates), and cryptography is one way to do that. (Steganography is another way, as is using a language unknown to the adversaries.)

For most of the history of cryptography, the adversaries have been only humans, so the techniques involved in cryptography have involved information known (hopefully!) only by the confederates and rendering the plaintext into an unreadable ciphertext.

Let’s do an example the way Caesar did it in his letters. The plaintext message (the first paragraph of Through the Looking-Glass, by Lewis Carroll, who I think would have enjoyed these 2 blog posts.):

One thing was certain, that the WHITE kitten had had nothing to do with it: — it was the black kitten’s fault entirely. For the white kitten had been having its face washed by the old cat for the last quarter of an hour (and bearing it pretty well, considering); so you see that it COULDN’T have had any hand in the mischief.

 

Now here’s the ciphertext according to Caesar (yes, that’s a bit anachronistic).

Vul topun whs jlrthpu , toht tol DOPAL
rpttlu ohk ohk uvtopun tv kv wpto pt : —
pt whs tol ishjr rpttlu’s mhust lutprlsy
. Mvr tol woptl rpttlu ohk illu ohvpun
pts mhjl whsolk iy tol vsk jht mvr tol
shst quhrtlr vm hu ovur ( huk ilhrpun pt
wrltty wlss , jvuspklrpun ) ; sv yvu sll
toht pt JVBSKU’A ohvl ohk huy ohuk pu
tol tpsjoplm .

 

Every letter is replaced by the letter a fixed number of letters later in the alphabet, circling around when we get to Z. That number of letters is the secret information known only to us and our confederates but not to our adversaries. Without that secret information, the ciphertext is unreadable.

Eventually, clever people figured out how to deduce the secret number (it’s 7 in this case), and computers make it simple (well for some programmers, not me) to “crack the code” and deduce that number, letting the adversaries read the message.

However, with the rise of digital communications, computers aren’t used just for deciphering encoded messages, but for reading plaintext (unencoded) messages automatically as well. Companies (we’ll call them all “BigCo”) scan web pages, email messages, tweets, etc. to extract information, all without humans reading those texts. While sometimes the computers are trying to be helpful, other times we might not want their “help.” In these cases, the computers themselves become the adversaries.

So how might we conceal our messages from the computers of BigCo? Well, if you’re serious, you use sophisticated encryption, but we’re not serious. We’d also like our confederates to be able to read our messages without having any secret information, and since we’re all a bit lazy, it would be nice if they (and we) could read the messages without having to do anything special (except maybe think a little).

In other words, we’d like a way of disguising our message (not encrypting it, since any person can read the message) so that the computers won’t “understand” what is in the message, but our confederates will be able to read it right off with no secret information — hidden in plain sight. We’ll add a couple more rules to the game: the message must be text (not pictures of text, like some people do on Twitter to avoid the character limit), and no program should be necessary to read the message (even if it is useful to have a program to create the disguised version).

How could this even be possible? One approach is to use the fact that computers represent letters by numbers: A is 65, B is 66, etc. and all the natural language understanding techniques depend on using those numbers. So if we use different numbers for the letters, then the computers will be confused (lots of anthropomorphizing here, but hey, we’re having fun). Now clever programmers can be on the lookout for our tricks, and thereby deconfuse the computers, but at least we’ll make them work a bit instead of just reading our important messages, like where we’re going to get doughnuts later.

So here’s that same example in a new disguise:

??? ????? ??? ???????, ???? ??? ????? ?????? ??? ??? ??????? ?? ?? ???? ??: — ?? ??? ??? ????? ??????’? ????? ????????. ??? ??? ????? ?????? ??? ???? ?????? ??? ???? ?????? ?? ??? ??? ??? ??? ??? ???? ??????? ?? ?? ???? (??? ??????? ?? ?????? ????, ???????????); ?? ??? ??? ???? ?? ??????’? ???? ??? ??? ???? ?? ??? ????????.

 

Disguise, what disguise? That’s pretty easy to read. Well for us, yes. However, that example is using a different set of numbers for the letters, where A is 120224, B is 120225, etc. So the computer will see those numbers and go HUH?!?! Well, we’d like to think so, anyway. Of course, those clever programmers at BigCo may well be several steps ahead of me*, and so they can tell the computer, “Hey, if you see big numbers like 120224 that are supposed to be representing letters, just subtract 120159 (to get 65 in this case, which is A), and you can be ‘helpful’ again.”

That’s enough cleverness all round for this post. Next time I’ll give some more examples, some of which the clever programmers may not have thought of … yet.

 


Technical notes

Those characters in the example are from the math symbols section of Unicode, and even though they look like the regular ASCII letters, they are, of course, different glyphs. Hence the title. This is the same technique used in URL-spoofing, so I can’t take credit (or blame) for it.

* Well, the clever Apple and Google programmers are ahead of me, but the Mozilla programmers have some catching up to do. If you search for “kitten” in this post using Safari or Chrome, those browsers do find it in the “disguised” example. However, Firefox (as of version 81.0.1) does not find “kitten” in the disguised example. I don’t know about the clever Microsoft (or other BigCo) programmers, since I haven’t tried this on a Windows computer or in other programs.

 

Posted in FMOA