Bex - guide

# Base64 encoding and decoding with Bex

The tutorial shows how base64 encoding (and decoding) routines can be implemented and analyzed in Bex. Base64 is the perfect example of how Bex can succinctly represent string encoders and decoders.

## What is base64 encoding?

Base64 encoding is used to encode binary data in order to transfer the data in textual format. The purpose of the encoding is to ensure that the text remains unmodified during transport. To this end the digits of the base64 encoding are chosen to be in the safe range of characters that are unmodified by other encodings. In base64, a number x, 0 <= x <= 63 , is mapped as the following base64 digit E(x) (although some variants exist, the base64 encoding here is the standard one):

1. If x in [0..25] then E(x) in ['A'..'Z'] , i.e., E(x)=x+65,
2. If x in [26..51] then E(x) in ['a'..'z'] , i.e., E(x)=x+71,
3. If x in [52..61] then E(x) in ['0'..'9'] , i.e., E(x)=x-4,
4. If x = 62 then E(x)='+',
5. If x = 63 then E(x)='/',

Binary data (sequence of bytes or octets), is encoded as follows. Let <bk,..,b1,b0> stand for the binary representation of a number whose i'th bit is bi. There are three cases.

• In the general case, three bytes are encoded as four base64 digits:
[<x7,x6,x5,x4,x3,x2,x1,x0>,<y7,y6,y5,y4,y3,y2,y1,y0>,<z7,z6,z5,z4,z3,z2,z1,z0>]
==>
[E(<x7,x6,x5,x4,x3,x2>),E(<x1,x0,y7,y6,y5,y4>),E(<y3,y2,y1,y0,z7,z6>),E(<z5,z4,z3,z2,z1,z0>)]

• The encoding uses a padding character '=' when the number of input bytes is not a mutiple of 3. When one final byte remains it is encoded as follows:
[<x7,x6,x5,x4,x3,x2,x1,x0>]
==>
[E(<x7,x6,x5,x4,x3,x2>),E(<x1,x0,0,0,0,0>),'=','=']

• When two final bytes remain, they are encoded as follows:
[<x7,x6,x5,x4,x3,x2,x1,x0>,<y7,y6,y5,y4,y3,y2,y1,y0>]
==>
[E(<x7,x6,x5,x4,x3,x2>),E(<x1,x0,y7,y6,y5,y4>),E(<y3,y2,y1,y0,0,0>),'=')]

Notice that the length of a valid base64 encoded sequence is a mutiple of 4.

## base64encode

The following bek program base64encode encodes a sequence of bytes in base64 format. Any input that is not a sequence of bytes (contains a value greater than 255) causes an exception.

// function E maps x to the following base64-digit and viceversa
function E(x)=(ite(x<=25,x+65,ite(x<=51,x+71,ite(x<=61,x-4,ite(x==62,'+','/')))));
// base64 encoder of an input of bytes, any value in the sequence > 0xFF raises an exception
// for example base64encode("Man") = "TWFu"
program base64encode(_){
replace {
@"[\0-\xFF]{3}"  ==> [E(#0>>2), E((Bits(1,0,#0)<<4)|Bits(7,4,#1)), E((Bits(3,0,#1)<<2)|Bits(7,6,#2)), E(Bits(5,0,#2))];
@"[\0-\xFF]{2}$" ==> [E(Bits(7,2,#0)), E((Bits(1,0,#0)<<4)|Bits(7,4,#1)), E(Bits(3,0,#1)<<2), '=']; @"[\0-\xFF]$"    ==> [E(Bits(7,2,#0)), E(Bits(1,0,#0)<<4), '=', '='];
}
}
==
js(base64encode);
display(base64encode);
The given bek program can be analyzed for various properties. We consider two sample properties.
1. The set of valid inputs (inputs not causing an exception), must be the set of all sequences of bytes. load in editor
...
BYTES = re(@"^[\0-\xFF]*$"); B = dom(base64encode); subset(BYTES, B); subset(B, BYTES);  2. The set of valid outputs may not contain characters other than base64 digits and the padding character. load in editor ... B64 = re(@"^[A-Za-z0-9+/=]*$"); //over-approximation of valid outputs
isempty(W);                     //check that W is empty

In the second property, we can strengthen the regex B64 to be more precise, because the padding character must only occur at the end, e.g., set the regex to re(@"^([A-Za-z0-9+/]*(()|=|==))$") and rerun the check. Try also re(@"^[A-Za-z0-9+/]*(=?)$") or any other regex that is too restrictive.

## base64 regex

The following regex describes all valid base64 encodings. Try it out in the editor! load in editor
==
B64 = re("^([A-Za-z0-9+/]{4})*(()|[A-Za-z0-9+/][AQgw]==|[A-Za-z0-9+/]{2}[AEIMQUYcgkosw048]=)$"); display(B64); ## display command display(A) command can be used to create an instance of the symbolic automaton A by choosing a single element satisfying each predicate. display(A,0) will show the full SFA. The command display(A) is short for display(A,0) for SFAs. Sometimes the full symbolic labels may be hard to view and a concrete instance is more informative. display(A,k), where k is positive, will show the transitions with up to k values (several consequtive values are shown as an interval). load in editor == A = re("^([A-Za-z0-9+/]{4})*(()|[A-Za-z0-9+/][AQgw]==|[A-Za-z0-9+/]{2}[AEIMQUYcgkosw048]=)$");
display(A);
display(A,1);
display(A,100);

The transition label () in the figures means that the transition is an epsilon-move.

## Automata operations

It may sometimes be useful to minimize the automaton first. load in editor
==
A = re("^([A-Za-z0-9+/]{4})*(()|[A-Za-z0-9+/][AQgw]==|[A-Za-z0-9+/]{2}[AEIMQUYcgkosw048]=)$"); M = minimize(A); display(M,100); In general, other operations over SFAs are also available: intersect(A,B), minus(A,B) (difference), ~(A) (complement) was used above, determinize(A), and eliminateEpsilons(A). ## base64decode The following bek program base64decode decodes a base64 encoded sequence back to the original byte sequence. Any input that is not a valid encoding causes an exception. load in editor // D maps x to the following byte function D(x)=(ite(x=='/',63,ite(x=='+',62,ite(x<='9',x+4,ite(x<='Z',x-65,x-71))))); //base64 decoder program base64decode(_){ replace { "[a-zA-Z0-9+/]{4}" ==> [(D(#0)<<2)|bits(5,4,D(#1)), (bits(3,0,D(#1))<<4)|bits(5,2,D(#2)), (bits(1,0,D(#2))<<6)|D(#3)]; "[a-zA-Z0-9+/]{3}=$" ==> [(D(#0)<<2)|bits(5,4,D(#1)), (bits(3,0,D(#1))<<4)|bits(5,2,D(#2))];
"[a-zA-Z0-9+/]{2}==$"==> [(D(#0)<<2)|bits(5,4,D(#1))]; } } == js(base64decode); The decoder can be analyzed for various properties. One property is that all outputs produced by the decoder must be sequences of octets, i.e., no sequence may contain a value greater than 0xFF. load in editor ... BYTES = regex(@"^[\0-\xFF]*$");               //all sequences of bytes
W = invimage(base64decode,complement(BYTES)); //inputs leading to ivalid outputs
isempty(W);                                   //check that w is empty


## Functional correctness

The most fundamental property of the encoder and decoder, when combined together, is their functional correctness: the decoder must be the inverse of the encoder. In other words, the following property must hold:
• For all byte sequences s, base64decode(base64encode(s)) = s.
In general, this kind of property is very hard to prove. In Bex we can express it as the following property because we can compose bek programs and reason about their equivalence. Let ID be the identity function over all sequences. The above property is the same as the following property:
• The composition join(base64encode,base64decode), say ed, is equivalent to ID restricted to BYTES.
...