Bex - guide

Base64 encoding and decoding with Bex

The tutorial shows how base64 encoding (and decoding) routines can be implemented and analyzed in Bex. Base64 is the perfect example of how Bex can succinctly represent string encoders and decoders.

What is base64 encoding?

Base64 encoding is used to encode binary data in order to transfer the data in textual format. The purpose of the encoding is to ensure that the text remains unmodified during transport. To this end the digits of the base64 encoding are chosen to be in the safe range of characters that are unmodified by other encodings. In base64, a number x, 0 <= x <= 63 , is mapped as the following base64 digit E(x) (although some variants exist, the base64 encoding here is the standard one):

  1. If x in [0..25] then E(x) in ['A'..'Z'] , i.e., E(x)=x+65,
  2. If x in [26..51] then E(x) in ['a'..'z'] , i.e., E(x)=x+71,
  3. If x in [52..61] then E(x) in ['0'..'9'] , i.e., E(x)=x-4,
  4. If x = 62 then E(x)='+',
  5. If x = 63 then E(x)='/',

Binary data (sequence of bytes or octets), is encoded as follows. Let <bk,..,b1,b0> stand for the binary representation of a number whose i'th bit is bi. There are three cases.

Notice that the length of a valid base64 encoded sequence is a mutiple of 4.

base64encode

The following bek program base64encode encodes a sequence of bytes in base64 format. Any input that is not a sequence of bytes (contains a value greater than 255) causes an exception.

load in editor
// function E maps x to the following base64-digit and viceversa
function E(x)=(ite(x<=25,x+65,ite(x<=51,x+71,ite(x<=61,x-4,ite(x==62,'+','/')))));
// base64 encoder of an input of bytes, any value in the sequence > 0xFF raises an exception
// for example base64encode("Man") = "TWFu"
program base64encode(_){ 
  replace {
    @"[\0-\xFF]{3}"  ==> [E(#0>>2), E((Bits(1,0,#0)<<4)|Bits(7,4,#1)), E((Bits(3,0,#1)<<2)|Bits(7,6,#2)), E(Bits(5,0,#2))];
    @"[\0-\xFF]{2}$" ==> [E(Bits(7,2,#0)), E((Bits(1,0,#0)<<4)|Bits(7,4,#1)), E(Bits(3,0,#1)<<2), '='];
    @"[\0-\xFF]$"    ==> [E(Bits(7,2,#0)), E(Bits(1,0,#0)<<4), '=', '='];
  }
}
==
js(base64encode);
display(base64encode);
The given bek program can be analyzed for various properties. We consider two sample properties.
  1. The set of valid inputs (inputs not causing an exception), must be the set of all sequences of bytes. load in editor
    ...
    BYTES = re(@"^[\0-\xFF]*$");
    B = dom(base64encode);
    subset(BYTES, B);
    subset(B, BYTES);
    
  2. The set of valid outputs may not contain characters other than base64 digits and the padding character. load in editor
    ...
    B64 = re(@"^[A-Za-z0-9+/=]*$"); //over-approximation of valid outputs
    BAD = complement(B64);          //invalid outputs
    W = invimage(base64encode,BAD); //input witnesses to bad outputs
    isempty(W);                     //check that W is empty
    
In the second property, we can strengthen the regex B64 to be more precise, because the padding character must only occur at the end, e.g., set the regex to re(@"^([A-Za-z0-9+/]*(()|=|==))$") and rerun the check. Try also re(@"^[A-Za-z0-9+/]*(=?)$") or any other regex that is too restrictive.

base64 regex

The following regex describes all valid base64 encodings. Try it out in the editor! load in editor
==
B64 = re("^([A-Za-z0-9+/]{4})*(()|[A-Za-z0-9+/][AQgw]==|[A-Za-z0-9+/]{2}[AEIMQUYcgkosw048]=)$");            
display(B64);

display command

display(A) command can be used to create an instance of the symbolic automaton A by choosing a single element satisfying each predicate.

display(A,0) will show the full SFA. The command display(A) is short for display(A,0) for SFAs. Sometimes the full symbolic labels may be hard to view and a concrete instance is more informative.

display(A,k), where k is positive, will show the transitions with up to k values (several consequtive values are shown as an interval).

load in editor
==
A = re("^([A-Za-z0-9+/]{4})*(()|[A-Za-z0-9+/][AQgw]==|[A-Za-z0-9+/]{2}[AEIMQUYcgkosw048]=)$");    
display(A);
display(A,1);
display(A,100);

The transition label () in the figures means that the transition is an epsilon-move.

Automata operations

It may sometimes be useful to minimize the automaton first. load in editor
==
A = re("^([A-Za-z0-9+/]{4})*(()|[A-Za-z0-9+/][AQgw]==|[A-Za-z0-9+/]{2}[AEIMQUYcgkosw048]=)$");    
M = minimize(A);        
display(M,100);
In general, other operations over SFAs are also available: intersect(A,B), minus(A,B) (difference), ~(A) (complement) was used above, determinize(A), and eliminateEpsilons(A).

base64decode

The following bek program base64decode decodes a base64 encoded sequence back to the original byte sequence. Any input that is not a valid encoding causes an exception.

load in editor
// D maps x to the following byte
function D(x)=(ite(x=='/',63,ite(x=='+',62,ite(x<='9',x+4,ite(x<='Z',x-65,x-71)))));
//base64 decoder
program base64decode(_){ 
  replace {
    "[a-zA-Z0-9+/]{4}"   ==> [(D(#0)<<2)|bits(5,4,D(#1)), (bits(3,0,D(#1))<<4)|bits(5,2,D(#2)), (bits(1,0,D(#2))<<6)|D(#3)];
    "[a-zA-Z0-9+/]{3}=$" ==> [(D(#0)<<2)|bits(5,4,D(#1)), (bits(3,0,D(#1))<<4)|bits(5,2,D(#2))];
    "[a-zA-Z0-9+/]{2}==$"==> [(D(#0)<<2)|bits(5,4,D(#1))];
  }
}
==
js(base64decode);

The decoder can be analyzed for various properties. One property is that all outputs produced by the decoder must be sequences of octets, i.e., no sequence may contain a value greater than 0xFF. load in editor

...
BYTES = regex(@"^[\0-\xFF]*$");               //all sequences of bytes
W = invimage(base64decode,complement(BYTES)); //inputs leading to ivalid outputs
isempty(W);                                   //check that w is empty

Functional correctness

The most fundamental property of the encoder and decoder, when combined together, is their functional correctness: the decoder must be the inverse of the encoder. In other words, the following property must hold: In general, this kind of property is very hard to prove. In Bex we can express it as the following property because we can compose bek programs and reason about their equivalence. Let ID be the identity function over all sequences. The above property is the same as the following property: load in editor
...
BYTES = regex(@"^[\0-\xFF]*$");            //domain of sequences of bytes
ID_BYTES = restrict(ID,BYTES);             //identity over sequences of bytes
ed = join(base64encode, base64decode);     //composition, first encode then decode
eq(ID_BYTES, ed);                          //check equivalence
What is quite remarkable is that the analysis is bit-precise. Try to modify either the encoder or the decoder in any way that makes them wrong, and observe that that the analysis will produce counterexamples demonstrating that the equivalence fails.
Contact Us| Privacy & Cookies | Terms of Use | Trademarks| © 2021 Microsoft