Introduction and motivation
Mixed Boolean-Arithmetic (MBA) expressions play a major role in the current state-of-the-art code obfuscation mechanisms.
In a nutshell, an MBA expression is composed of integer arithmetic operators, e.g. $(+, -, \times)$ and bitwise operators, e.g. $(\land, \lor, \oplus, \neg)$. MBA expressions can be leveraged to obfuscate the data-flow of code by iteratively applying rewriting rules and function identities that complicate (obfuscate) the initial expression while preserving its semantic behavior. This obfuscation approach is motivated by the fact that the combination of operators from these different fields do not interact well together: we have no rules (distributivity, factorization…) or general theory to deal with this mixing of operators.
Studying MBA expressions is exciting and terrifying at the same time. It offers a wide range of research paths to follow, many of which are barely explored yet. Academic literature is pretty scarce and not always approachable without a strong mathematics background. We aim to pave the ground, so more researchers get seduced with this field of study and make it advance both through theoretical contributions and practical tools.
While this MBA tour is mainly targeted to low-level security researchers and reverse engineers, anyone interested in software protection and the intersection of mathematics and computer science will benefit from it as well.
MBA expressions in obfuscation vs cryptography
The concept of MBA expressions appeared initially in the context of code obfuscation. However, a mixing of bitwise and arithmetic operators was already used in the context of cryptography, without being given a particular name. In this case, with the main goal of getting efficient, non-linear and complex interactions between operations. That being said, there is a key difference between the desired outcome in cryptography vs obfuscation.
In cryptography, the MBA expression is the direct result of the algorithm description. The resulting cryptosystem has to verify a set of properties (e.g. non-linearity, high algebraic degree) from a black-box point of view. The complex form of writing is directly related to some kind of intrinsic computational (semantic) complexity for the resulting function: one wants the inverse computation to be difficult to deduce (without knowing the key).
In obfuscation, the MBA expression is the result of rewriting iterations from a simpler expression which can have very simple black-box characteristics. There is no direct relation between the complex form of writing and any intrinsic computational (semantic) complexity of the resulting expression. On the contrary, when obfuscating simple expressions, one knows that the complex form of writing is related to a semantically simpler expression.
Through the course of this MBA tour, we will be focusing on MBA expressions in the context of code (de)obfuscation. Please also note that we will use interchangeably the terms of boolean and bitwise operators.
MBA definitions
It is always a good idea to define (or at least bound) the object of study that we want to address. Unfortunately, we lack a consensual formal definition for MBA expressions.
On the one hand, Zhou et al.1 propose an algebraic system which they call a Boolean-Arithmetic algebra (BA-algebra).
An algebraic system is a set (i.e. a bunch of elements) with some operations and relations defined on it.
This BA-algebra includes the set of $n$-bit values accompanied by usual boolean and arithmetic operators as well as inequalities. On top of that, they define MBA expressions as functions mapping (through evaluation) $t$ variables in this BA-algebra into a new element of the BA-algebra.
We will not dive any deeper in this definition, as we deliberately exclude from our study the subject of MBA inequalities. This is because we are interested in deobfuscating (simplifying) MBA expressions. However, an inequality is an assertion that returns a true or false value, which changes the problem to be handled.
In terms of notation, we will be somewhat more flexible in our definition of MBA expressions. In the same fashion as Eyrolles2, we choose to define MBA expressions by explicitly describing the different building blocks (operators) that compose them and how they are bundled together, instead of defining them as functions linked to the definition of a BA-algebra.
Despite that, it is important to remark that we will work on $n$-bit values considered at the same time as elements of different mathematical structures.
A mathematical structure is a set endowed with some additional features on it (e.g. operations).
For example, standard arithmetic operations are considered in $(\mathbb{Z}/2^n\mathbb{Z}, +, \times)$ (i.e. the $n$-bit values are treated as integers modulo $2^n$ with usual arithmetic operators) while bitwise operations belong to $(\{0, 1\}^n , \land, \lor, \neg)$ or $(\{0, 1\}^n , \land, \oplus)$ (i.e. the $n$-bit values accompanied by different collections of boolean operators).
Whether defined as functions on some algebraic system or by specifying the actual boolean and arithmetic operators in use, two main categorizations are considered and studied in literature: polynomial MBA expressions and linear MBA expressions.
Polynomial MBA expressions
An expression E of the form $$ E = {\color{red}\sum_{i\in I}} {\color{blue}a_i} \cdot \bigg({\color{green}\prod_{j\in J_{i}}} {\color{purple} e_{i,j}(}{\color{orange} x_{1},\dots,x_{t}}{\color{purple})}\bigg) $$ where the arithmetic sum and product are modulo $2^n$, $a_i$ are constants in $\mathbb{Z}/2^n\mathbb{Z}$, $e_{i,j}$ are bitwise expressions of variables $x_{1},\dots,x_{t}$ in $\{0,1\}^n$, $I \subset \mathbb{Z}$ and for all $i \in I$, $J_i \subset \mathbb{Z}$ are finite index sets, is a polynomial MBA expression.
It is easy to get lost in the formal notation, but do not be afraid. Although it might feel intimidating, the idea is rather simple:
- A polynomial MBA expression $E$ consists of a
sum of terms , each one composed by an$n$-bit constant $a_i$ times theproduct of severalbitwise expressions ona number $t$ of $n$-bit variables . - The number of terms is accounted in the $I$ set. The $J_i$ set accounts for the number of bitwise expressions ($j$) being multiplied together for each term ($i$).
Let’s break down an example that will help to visualize and clarify the definition.
Example: polynomial MBA expression
$$ \begin{align}E &= {\color{red}\underline{\strut {\color{blue}\underline{\strut {\color{white}\underline{\strut {\color{white}\underline{\strut \textcolor{black}{43}}}}}}}\hspace{1pt}{\color{green}\underline{\strut {\color{purple}\underline{\strut \textcolor{black}{(}{\color{orange}\underline{\strut \textcolor{black}{x}}} \textcolor{black}{\land} {\color{orange}\underline{\strut \textcolor{black}{y}}} \textcolor{black}{\lor} {\color{orange}\underline{\strut \textcolor{black}{z}}}\textcolor{black}{)}}}\hspace{1pt}{\color{purple}\underline{\strut {\color{white}\underline{\strut \textcolor{black}{^2}}}}} \hspace{1pt}{\color{purple}\underline{\strut \textcolor{black}{((}{\color{orange}\underline{\strut \textcolor{black}{x}}} \textcolor{black}{\oplus} {\color{orange}\underline{\strut \textcolor{black}{y}}}\textcolor{black}{)} \textcolor{black}{\land} {\color{orange}\underline{\strut \textcolor{black}{z}}} \textcolor{black}{\lor} {\color{orange}\underline{\strut \textcolor{black}{t}}}\textcolor{black}{)}}}}}}} \\ &+ {\color{red}\underline{\strut {\color{blue}\underline{\strut {\color{white}\underline{\strut {\color{white}\underline{\strut \textcolor{black}{2}}}}}}}\hspace{1pt}{\color{green}\underline{\strut {\color{purple}\underline{\strut {\color{orange}\underline{\strut \textcolor{black}{x}}}}}}}}} \\ &+ {\color{red}\underline{\strut {\color{blue}\underline{\strut {\color{white}\underline{\strut {\color{white}\underline{\strut \textcolor{black}{123}}}}}}\hspace{1pt}{\color{green}\underline{\strut {\color{purple}\underline{\strut \textcolor{black}{(}{\color{orange}\underline{\strut \textcolor{black}{x}}} \textcolor{black}{\lor} {\color{orange}\underline{\strut \textcolor{black}{y}}}\textcolor{black}{)}}}\hspace{1pt}{\color{purple}\underline{\strut {\color{orange}\underline{\strut \textcolor{black}{z}}}}}\hspace{1pt}{\color{purple}\underline{\strut {\color{orange}\underline{\strut \textcolor{black}{t}}}}}\hspace{1pt}{\color{purple}\underline{\strut {\color{white}\underline{\strut \textcolor{black}{^2}}}}}}}}}}\end{align} $$ Note that this is an expression on $4$ variables $x,y,z,t$.
Observe that there are three terms being added together, i.e. $I = \{1,2,3\}$: $$ \begin{align}E_1&=43(x \land y \lor z)^2 ((x \oplus y) \land z \lor t)\\E_2&=2x\\E_3&=123(x \lor y)zt^2\end{align} $$ The first term $E_1$ is composed by the $n$-bit constant $a_1$ times the product of three bitwise expressions, i.e. $J_1 = \{1,2,3\}$: $$ \begin{align}a_1 &= 43\\e_{1,1}(x, y, z, t) &= x \land y \lor z\\e_{1,2}(x, y, z, t) &= x \land y \lor z\\e_{1,3}(x, y, z, t) &= (x \oplus y) \land z \lor t\end{align} $$
The second term $E_2$ is composed by the $n$-bit constant $a_2$ times one bitwise expression, i.e. $J_2 = \{1\}$: $$ \begin{align}a_2 &= 2\\e_{2,1}(x, y, z, t) &= x\end{align} $$
The third term $E_3$ is composed by the $n$-bit constant $a_3$ times the product of four bitwise expressions, i.e. $J_3 =\{1,2,3,4\}$: $$ \begin{align}a_3 &= 123\\e_{3,1}(x, y, z, t) &= x \lor y\\e_{3,2}(x, y, z, t) &= z\\e_{3,3}(x, y, z, t) &= t\\e_{3,4}(x, y, z, t) &= t\end{align} $$
Linear MBA expressions
A polynomial MBA expression of the form $$ E = {\color{red}\sum_{i\in I}} {\color{blue}a_i} \cdot {\color{purple}e_i(}{\color{orange}x_{1},\dots,x_{t}}{\color{purple})} $$ is called a linear MBA expression.
As indicated by the definition, linear MBA expressions are a particular class of polynomial MBA expressions. They are defined by imposing just one bitwise expression for each term instead of a product of an arbitrary number of them.
In practice, you can vaguely think of linearity as a restriction not allowing variables to end up being multiplied together.
Example: linear MBA expression
$$ E = {\color{red}\underline{\strut {\color{purple}\underline{\strut \textcolor{black}{(}{\color{orange}\underline{\strut \textcolor{black}{x}}} \textcolor{black}{\oplus} {\color{orange}\underline{\strut \textcolor{black}{y}}}\textcolor{black}{)}}}}} \textcolor{black}{+} {\color{red}\underline{\strut {\color{blue}\underline{\strut {\color{white}\underline{\strut \textcolor{black}{2}}}}} \hspace{1pt} {\color{purple}\underline{\strut \textcolor{black}{(}{\color{orange}\underline{\strut \textcolor{black}{x}}} \textcolor{black}{\land} {\color{orange}\underline{\strut \textcolor{black}{y}}}\textcolor{black}{)}}}}} $$
Note that this is an expression on $2$ variables $x,y$.
Observe that there are two terms being added together, i.e. $I = \{1,2\}$: $$ \begin{align}E_1 &= x \oplus y\\E_2 &= 2 (x \land y)\end{align} $$
The first term $E_1$ is composed by the $n$-bit constant $a_1$ times the bitwise expression $e_1(x, y)$: $$ \begin{align}a_1 &= 1\\e_1(x,y) &= x \oplus y\end{align} $$
The second term $E_2$ is composed by the $n$-bit constant $a_2$ times the bitwise expression $e_2(x, y)$: $$ \begin{align}a_2 &= 2\\e_2(x,y) &= x \land y\end{align} $$
Notice that, assuming variables of the same bit size, this MBA expression simplifies to $E_{simp} = x + y$. Namely, $E$ is a more complex expression than $E_{simp}$ syntactically speaking, but they are semantically equivalent.
In this case, we can easily verify this equivalence with an SMT solver. Let’s quickly check it out using Z3 through its python API. For the sake of simplicity, we will set $x$ and $y$ as $8$-bit variables, tough other bit size values would work as well.
|
|
$ python eq.py
proved
Final notes
The previous example already suggests a basic obfuscation idea: we could replace the sum of two variables in our code by the more complex expression involving $XOR$ and $AND$ boolean operators, while preserving the code semantics. But do not get ahead of time. We will revisit and expand on this obfuscation idea during the next tour stops.
In this first part, we have presented the fundamental ideas and definitions regarding MBA expressions, delimiting its study to the context of code (de)obfuscation. In the next part, we will explore the main mechanisms to obfuscate expressions by leveraging MBA rewriting rules and insertion of identities.
If you are interested and want to know more, make sure to follow us to stay informed of subsequent articles. You might also want to check out our training An Analytical approach to Modern Binary Deobfuscation, which contains a whole section devoted to the study, analysis and applications of MBA expressions.
-
Yongxin Zhou et al. “Information Hiding in Software with Mixed Boolean-Arithmetic Transforms”. In: Information Security Applications. Ed. by SehunKim, Moti Yung, and Hyung-Woo Lee. Berlin, Heidelberg: Springer BerlinHeidelberg, 2007, pp. 61–75. isbn: 978-3-540-77535-5. ↩︎
-
Ninon Eyrolles. “Obfuscation with Mixed Boolean-Arithmetic Expressions :reconstruction, analysis and simplification tools”. Theses. Université Paris-Saclay, June 2017. ↩︎