Tailored compact CNF encoding for SHA-1

Share: 
Abstract

We present a crafted Conjunctive Normal Form (CNF) encoding for Secure Hash Algorithm (SHA-1). The encoding relies on representing n-operand modular addition as an adder network and its subsequent translation into CNF. The encoding is more compact than comparable crafted encodings published previously. Also, it is more compact than those encodings after state-of-the-art CNF pre-processing applied to them.

We would like to thank the JSAT editorial team and the reviewers for considering publishing our work. We concur that in the original form the paper does not qualify for the journal, notably, because of the absence of experiments measuring impact of our CNF optimizations on SAT solving performance. Following the review, we have measured solving time for a wider range of encoding methods using state of the art SAT solvers, with 4 types of constrains. We have found that while our encoding shows some improvement over known crafted alternatives, using CBMC as a more general method of translating SHA-1 C code to CNF, produces encoding which is while somewhat less compact is comparable or better than ours in terms of solving time. We believe this difference arises from encoding of adders within the adder network.

For the above reasons, we decided to publish the work as a less formal post as opposed to a paper. We believe our optimized encoding can be used to advance SAT-based research of SHA-1 and other cryptographic primitives, also is interesting as a source of benchmarks for CNF pre-processors and SAT solver techniques. Finally, we note that reversing unreduced SHA-1 function through SAT solving remains computationally intractable.

The original article was prepared in October 2018 and then revised in 2020 to include SAT solving performance measurements.

Introduction

There is a number of publications on encoding SHA-1 into a Conjunctive Normal Form (CNF) formula [7, 11, 13, 8, 14, 12]. SHA-1 pre-image attack translated into a satisfiability (SAT) problem is a well known source of progressively hard SAT instances. Modern SAT solvers can solve instances reduced to 21 rounds or full instances with up to 20 variable message bits, in trivial time [13, 12]. Solving time increases non-polynomially for more complex cases [13].

The traditional encoding approach for an algebraic or a Boolean function is to interpret it as Boolean circuit, then translate each operation-gate into an optimal set of CNF clauses [13]. Complexity of the translated formula is controlled through introduction of additional variables for gate outputs, known as Tseitin transformation [13].

We have noticed a number of inefficiencies in the encodings published in [13, 14]. These inefficiencies are related to encoding implementation rather than to the method itself. Notably, SHA-1 algorithm uses numerical constants as initial values and as a source of permutation at each round. A tailored encoding algorithm would recognise these constants to reduce the encoded operations and the number of clauses necessary to encode them.

In this work we describe techniques to produce a compact SHA-1 CNF representation. Our biggest contribution is the minimised translation of 5-operand addition into an adder network reduced through elimination of constants and duplicate inputs separately for each operation-gate. The described techniques can be used to encode other algebraic and Boolean functions. We introduce a tool which outputs our encoding.

Modern SAT solving leverages sophisticated pre-processing techniques to simplify input formulas and eliminate encoding inefficiencies [3, 9]. Therefore, to claim an improvement it is necessary to compare the improved encoding with pre-processed as well as original alternatives.

We measured solving performance for sets of randomly generated instances of our encoding and the crafted alternatives of both reduced and unreduced versions of SHA-1. In addition, we included encoding produced by translating SHA-1 C source code using CBMC [17]. We measured solving performance using Crytominisat, Kissat and MapleSAT which we believe implement state-of-the-art solving techniques. While our encoding shows some improvements over the alternative crafted alternatives, we found that CBMC encoding, while somewhat less compact, leads to comparable or better solving performance. Still, we believe that our optimized encodings may be useful for SAT-based analysis of SHA-1 and other cryptographic primitives, as well as are interesting as a source of minimalistic increasingly hard SAT bencharks.

The rest of this paper is structured as follows. After giving a short background on CNF and SHA-1 (Section Preliminaries), we discuss the encoding approach and techniques (Section Encoding). Then, we describe our implementation and compare our encoding with alternatives and their pre-processed versions. (Section Results). Finally, we present our conclusions in Section Conclusions}.

Preliminaries

In this section we introduce concepts deemed relevant to the remainder of the paper.

Conjunctive Normal Form

A Boolean formula ƒ(𝑥1…𝑥Ν) with Boolean parameters 𝑥i, i=1..Ν is said to be in Conjunctive Normal Form (CNF) if it is a conjunction of clauses cj, j=1..Μ. A clause cj is a disjunction of literals lj,k, k=1..Sj where each literal is either one of the variables 𝑥i or its negation ¬𝑥i.

A literal is pure if its negation is not present in any clause.

Unit clause is a clause with one literal. Respectively, binary clause contains two literals. A clause is a tautology if it contains both a variable and its negation. A clause cj is said to be subsumed by clause ck if ck literals is a subset of cj literals. Finally, k-CNF is a formula with clauses containing at most k literals.

SHA-1

A hash function is an algorithm which given an arbitrarily-sized input message produces a fixed length bit string referred to as message digest or hash. SHA-1 [5] is a hash function which splits the message into 512 bit long blocks and produces 160 bit hash after 80 rounds of computation for each block. SHA-1 operates with 32-bit words. Computations include bitwise logical operations (and, or, xor, not), left, right and circular shift, and addition modulo 232. Round reduced SHA-1 is a variant of the algorithm with lower number of rounds.

In this work we consider CNF encoding for SHA-1 with a single message block input.

Circuits

A circuit is a directed acyclic graph. Its vertices are referred to as gates, edges connect gate outputs to inputs of other gates. In a Boolean circuit, gates represent Boolean functions, function parameters or operands are mapped to gate inputs, function result is mapped to the gate output.

Algebraic operations can be represented as circuits. An n-operand adder or n-operand counter is a circuit with n inputs and m outputs, n≤2m, which outputs a count of activated inputs in binary form. A full adder is a counter with 3 inputs and 2 outputs, a half adder has 2 inputs and 2 outputs respectively. A ripple carry adder is a Boolean circuit which outputs binary representation of the sum of two binary numbers. It is constructed by chaining full adders through connecting carry output of an adder with an input to the next more significant adder in the chain.

Encoding

In this section we describe our encoding approach and techniques.

Criteria

There are varying notions of what indicates optimality of CNF encoding. These include number of clauses and literals [3], also other properties that help SAT solver's performance, notably arc-consistency [4, 13]. Arc-consistency refers to consistency between the original and the translated formula, of propagating partial assignments. As our primary objective is to remove redundancies, we shall use number of literals as a measure of compactness. This measure is roughly proportional to quantity of computer memory needed to store the formula.

We shall use number of clauses as a secondary measure of compactness. With respect to the number of variables, we note that for any sufficiently complex CNF formula there exist other representations with lower number of variables and higher number of clauses [12]. This follows from application of the resolution rule to eliminate variables by introducing new clauses [3]. By "sufficiently complex" we refer to a formula where most of the variables produce few tautologies when eliminated.

Approach

Among the published SHA-1 encodings, all except one are produced using variants of the same "traditional" method [7, 13, 8, 14, 12]. CryptLogVer [11] relies on a proprietary commercial circuit synthesis package. Similarly to our work, a number of published encodings is compared in [12]. [13] complemented with [4] gives the most detailed structured analysis of the encoding approach, techniques and optimality, which is the reason we chose it as a basis for our work.

Therefore, we represent the algorithm as a Boolean circuit and apply Tseitin transformation during translation. SHA-1 words are interpreted as sequences of Boolean variables. Unless noted otherwise, we use CNF representation for operations and functions from [13] and [4]. The resulting formula is equisatisfiable with rather than is equivalent to the original algorithm, with variables that correspond to SHA-1 parameters remaining identifiable. In particular, an assignment of a subset of variables that correspond to bits of SHA-1 message block and hash is a satisfying assignment for the formula if and only if there exists a matching pair of SHA-1 message block and hash with the respective bit assignments.

In addition, we apply the following techniques separately for each bit of the SHA-1 word operations.

  1. Similarly to [14], we introduce auxiliary variables only when the result of the operation is not a constant and is not equal to one of the operands or its negation. Otherwise, we pass the constant or the literal respectively as an input to the subsequent operations. For example, for binary exclusive or:

    x x⊕0≡xx⊕1≡¬xxx≡0, x⊕¬x≡1

  2. SHA-1 includes 4-operand exclusive or and 5-operand addition. We reduce multi-operand operations where an operand is a constant, a duplicate or negation of another operand. For example, 24=16 clauses are required to encode exclusive or with 4 operands [13]. If two of the operands are the same variable, they cancel each other and the operation becomes binary, requiring 22=4 clauses only.
  3. We apply the above techniques to encode addition as described in the next subsection.
  4. If-then-else function ch(x, y, z) is encoded with 4 clauses rather than 6 as discussed in [4]:

    x ∨ ¬ych) ∧ (x ∨ ¬zch) ∧ (¬xy ∨ ¬ch) ∧ (xych)

Modular addition

5-operand 32 bit modular addition within the compression function is the most complex part of SHA-1, producing the bulk of clauses and auxiliary variables during encoding. However, we note that at least one of the operands is always a constant [5]. Furthermore, at the bit level it is possible that an operand duplicates or negates another one, particularly for reduced versions of SHA-1.

One simple encoding approach is to reduce the operation to several 2-operand additions [13]. However, this approach hides possible simplifications behind the auxiliary variables of each full adder. Another approach also tried in [13] is to use n-operand counters chained into a ripple carry adder extended to n operands. While this approach minimises number of variables, the number of clauses grows disproportionally as shown in Table 1. As a note, the first two columns correspond to half-adder (HA) and full-adder (FA) respectively.

Table 1: Size of encoding for regular and modified n-operand counters
operands/inputs 2 (HA) 3 (FA) 4 5 6
variables 2 2 3 3 3
clauses 6 10 24 46 88
clauses, with residual 1 6 13 23 47 88
max clause length 3 4 5 6 7

We extend the second approach from [13] as follows:

  1. We remove pairs of inputs set to constant 1 and pass instead an additional constant carry input to the subsequent circuit in the ripple carry structure. We process duplicate variables the same way. Where there is a variable and its negation, the two inputs are replaced with a constant 1.
  2. For each round of the ripple carry structure we choose between a modified n-operand counter and a regular one. The modified counter has an additional constant input equal to 1. We choose one of these counters depending on the residual constant value obtained after the earlier optimisations. Similarly to [13], we use espresso logic minimiser [2] to obtain CNF translation for each of the counters.
  3. We end up with a varying number of carry inputs for each counter. To limit disproportional growth of the translated encoding, we replace a single n-operand counter with a circuit very similar to the adder network discussed in [4]. In particular, we take the first 3 operands as inputs for the first full adder in the circuit. We feed the sum output as an input to the next adder of the current circuit, plus 2 more remaining operands. We feed carry output as an input to the next more significant counter circuit in the chain. We choose the last adder to be either full- or half- depending on the number of remaining operands and take a modified version if there is a residual constant 1.

Size of encoding for one counter within a 4-operand ripple-carry adder, assuming it is synthesised from reduced counters/adders as discussed, is shown in Table 2.

Table 2: Size of encoding for one counter of a 4-operand ripple-carry adder synthesised from n-operand counters
max counter operands, n 3 (FA) 4 5 6
variables 6 8 5 3
clauses 30 58 56 88

For example, the counter can be synthesised from 3 full adders as shown on Figure 1. There are 4 inputs x1x4, 3 carry in and 3 carry out operands; s is the resulting lower bit of the sum. Translation of this circuit gives 2*3=6 additional variables (cout, 1, saux, 1, cout, 2, saux, 2, cout, 3 and s) and 10*3=30 clauses. Modification of the same circuit with the additional constant input 1 produces second carry output from the last adder. This output becomes an additional input into subsequent round of the ripple carry structure, requiring an additional half-adder and increasing number of clauses to 36. Alternatively, a single 6-operand counter can be used (the last column in Table 1 and Table 2) and encoded into CNF with 3 additional variables and 88 clauses.

Figure 1: 6-input counter of a 4-operand ripple carry adder synthesised from 3 full-adders.

Summarising, we note that:

  1. Synthesising the network from full adders minimises number of clauses while a single 6-operand counter minimises number of variables (Table 1).
  2. There is more than one way to connect full- and half-adders to synthesise an n-operand counter. In our example, one may connect FA1 output to FA3 instead of FA2. Since the choice of topology does not effect number of CNF clauses or auxiliary variables, we chose the simple "sequential" design as shown on Figure 1. At the same time, studying the effect of particular topology on other properties of the encoding can be a task for future work.
  3. Adder network synthesised from half adders is impractical as number of required half adders and carry outputs grows progressively.
  4. Unless synthesised from smaller ones, counters with more inputs are translated into longer CNF clauses (Table 1); therefore, number of literals grows at least proportionally to number of clauses as number of inputs increases.

Results

Our encoding is implemented in a tool we named CGen. The source code and usage instructions are available from github [18]. CGen outputs the resulting CNF formula into a file in DIMACS format [15]. The tool allows producing both full and round reduced SHA-1 CNF instances, also instances with assignment of message and hash bits.

Compactness

Our compact encoding is generated with counters composed of full adders. Also, we limit exclusive or gates to 3 inputs to minimise number of both literals and clauses as well. Maximal clause length within this encoding is 4 and we refer to it as CGen 4-CNF. Also, for the purpose of comparison we produce an encoding with minimal number of variables. It is generated from 6-operand counters with CNF representation minimised using espresso. We refer to it as CGen 7-CNF.

Empirically, we can confirm that both encodings contain no:

  • unconstrained variables or pure literals; that is, each variable appears in at least two clauses in both unnegated and negated form;
  • duplicated clauses or tautologies;
  • unit clauses or subsumable clauses;
  • equivalent variables; for each pair of variables there is at most one binary clause since any pair of binary clauses with the same variables can be reduced to either an unary clause or the equivalence.

Therefore, the effect of our techniques is similar to that of pre-processing techniques discussed in [3, 9].

Comparison with alternative encodings

We have compiled a list of full SHA-1 encodings referenced in literature. This list is presented in Table Table 3 along with our versions. We obtained and verified the encodings discussed in [13] and [14]. Also, we used CBMC [17] to produce the additional version of the encoding from SHA-1 C code. We rely on the published numbers for the other encodings.

We split the encodings into two groups - those that appear to minimise overall size of the formula and those that minimise number of variables. In both groups, our encoding is the most compact. In particular, the second best encoding in the first group contains 13% more literals and 33% more clauses.

Also, we pre-processed our CGen 4-CNF encoding and the two available alternative encodings from the first group. We chose the following three SAT solvers to produce the simplified versions:

  • glucose [1] version 4.1, used with "-elim -asymm" options;
  • lingeling [6], version bcj, used with "-s" option;
  • cryptominisat [16] version 5.0.1, used with "-p1" option.

These solvers or their derivations participated in SAT Competition in recent years and represent distinct implementations. While in-depth analysis of preprocessing capabilities of each solver would go beyond scope of this work, we assume that the chosen solvers implement a reasonable variety of CNF pre-processing techniques.

In addition, we used Coprocessor [10] as the only published stand-alone relatively recent CNF preprocessor known to us. To run it, we used parameters recommended by its authors.

Table Table 3 shows sizes of the encodings along with pre-processed versions. It is evident that preprocessing compensates for significant part of the size difference. Still, our pre-processed encoding remains more compact than the second best with about 18% less literals and 15% less clauses. Notably, even the original version of our encoding remains more compact than the simplified alternatives, about 17% and 14% respectively.

Table 3: CGen and alternative SHA-1 CNF encodings
Description Variables Clauses Literals
optimising size of the formula
CGen 4-CNF 26,156 127,200 447,174
 + coprocessor 25,801 126,178 444,523
 + lingeling 25,741 126,024 444,156
 + cryptominisat 25,785 126,149 444,420
CBMC 33,993 169,385 507,305
Vegard Nossum [13] 56,108 223,551 685,403
 + glucose 29,683 153,924 523,361
 + coprocessor 28,089 155,031 540,556
 + lingeling 21,408 144,980 568,604
Transalg [14] 22,531 231,507 1,060,989
 + coprocessor 20,647 228,899 1,058,722
 + lingeling 20,038 222,757 1,029,171
 + cryptominisat 21,723 226,010 1,038,671
CryptLogVer [11], ~ 31,000 181,000 -
CryptLogVer [13] 44,812 248,220 -
Motara & Irvin [12] 64,901 350,673 -
optimising number of variables
CGen 7-CNF 12,722 255,715 1,445,131
Vegard Nossum [13] 13,408 478,476 3,138,508
Legendre et al. [8] 12,771 375,195 -

Comparing SAT solving performance

SAT solving performance varies depending on choice of SHA-1 hash and message constrains, also on number of rounds. We chose to measure performance for 4 scenarious, 2 round reduced and 2 unreduced versions of the algorithm, with minimal and maximal number of bits fixed respectively just enough for the problem to become non-trivial. For each scenario, we generated batches of 100 instances for all available encodings. We used the same pairs of constrains for each batch, that is, we fixed the same combinations of message and hash bits for each version of the encoding. The bits were fixed by appending respective unit clauses to each instance. Values of the fixed bits as well as fixed positions were chosen randomly from uniform distribution. Finally, we used 3 state-of-the-art solvers to run the experiment: Cryptominisat (CMS) (5.8.0) [16], Kissat (1.0.3) [19] and MapleSAT [20]. The experements were carried out on a server with Intel Xeon E5-1630 v4 @ 3.70GHz CPU and 64 Gb RAM.

Tables 4-7 show mean time in seconds for each batch solved using each of the solvers.

Table 4: Mean time (s), 80 rounds, all hash and 496 message bits fixed
Encoding CMS Kissat MapleSAT
CGen-4 179.1 95.1 297.0
CBMC 180.4 82.8 317.2
[13] 273.1 153.7 508.4
[14] 259.7 135.4 385.7

Table 5: Mean time (s), 80 rounds, 5 hash and 0 message bits fixed
Encoding CMS Kissat MapleSAT
CGen-4 92.4 72.1 60.9
CBMC 1.2 1.4 4.0
[13] 9.9 1.1 8.1
[14] 7.5 1.4 7.9

Table 6: Mean time (s), 22 rounds, all hash and 476 message bits fixed
Encoding CMS Kissat MapleSAT
CGen-4 64.7 31.0 144.8
CBMC 60.3 37.4 97.8
[13] 63.9 39.3 263.9
[14] 81.0 38.1 266.9

Table 7: Mean time (s), 21 rounds, all hash and 0 message bits fixed
Encoding CMS Kissat MapleSAT
CGen-4 32.9 37.7 34.6
CBMC 39.0 43.7 37.3
[13] 56.3 60.1 78.9
[14] 46.7 73.7 64.3

Given large coefficient of variation of the mean solving time and the multiple comparisons, the presented results are not statistically significant. To obtain statistical significance, a much large experiment would be necessary and the comparison criteria would need to be reviewed, for example, substituting comparison of the means with the sign test. However, in our opinion these results are sufficient to demonstrate that our encoding is likely better then the other 2 crafted encodings in 2 scenarious out of 4 while it is likely worse than all other ones considered in 1 scenario. Overall, we conclude that while we see some improvement over other crafted encodings, the CBMC encoding is likely the same or better than CGen-4.

Conclusions

Our motivation for this work was to obtain a CNF formula without identifiable redundancies, to be used for further research. We produced an encoding that is substantially more compact than the alternatives published to date. It is engineered with the goal of minimising size of the formula rather than maximising its arc-consistency. Different optimisation criteria may be the reason a similar encoding has not been produced before.

We note that it may be possible to reduce the formula further. Circuit representation of the formula may contain more reducible redundancies, for example, chained exclusive or operations. Also, it is possible to group and encode several Boolean operations together, experimenting with different groupings. While we think that further reduction of magnitude similar to what we have achieved is unlikely, further optimisation can be the subject of future research.

Furthermore, we believe it is possible to improve solving performance through optimizing adder network decomposition into individual adders and further experiments with their encoding. In particular, the comparable alternatives including CBMC appear to reduce n-nary addition into a series of binary operations, then translate each operation into CNF clauses. We believe this approach can be improved. Also, we note that optimizing adders encoding using espresso[2] does not lead to optimal solving performance.

Finally, CNF pre-processors we used for the experiment are unable to reduce alternative encodings to match our size. Therefore, our encoding can be used as a benchmark for pre-processing.

Meanwhile, we suggest that our encoding and our techniques can be useful for SAT-based analysis of SHA-1 and other cryptographic primitives, as well as a source of progressively hard SAt benchmarks.

References

[1] Audemard, G., Simon, L., The Glucose SAT Solver (2016). http://www.labri.fr/perso/lsimon/glucose. Accessed 18 October 2018

[2] Brayton, R.K., Sangiovanni-Vincentelli, A.L., McMullen, C.T., Hachtel, G.D., Logic Minimization Algorithms for VLSI Synthesis. Kluwer Academic Publishers, Norwell, MA, USA (1984)

[3] Eén, N., Biere, A., Effective preprocessing in SAT through variable and clause elimination. In: F. Bacchus, T. Walsh (eds.) Theory and Applications of Satisfiability Testing, pp. 61--75. Springer Berlin Heidelberg, Berlin, Heidelberg (2005). doi:10.1007/11499107_5

[4] Eén, N., Sörensson, N., Translating pseudo-boolean constraints into {SAT}. Journal on Satisfiability, Boolean Modeling and Computation 2, 1--26 (2006)

[5] Information Technology Laboratory, FIPS PUB 180-4: Secure Hash Standard (SHS). National Institute of Standards and Technology, Gaithersburg, MD 20899-8900, USA (2015). doi:10.6028/NIST.FIPS.180-4

[6] Institute for Formal Models and Verification, Lingeling, Plingeling and Treengeling (2018). http://fmv.jku.at/lingeling Accessed 18 October 2018

[7] Jovanovic, D., Janicic, P., Logical analysis of hash functions. In: B. Gramlich (ed.) Frontiers of Combining Systems, pp. 200--215. Springer Berlin Heidelberg, Berlin, Heidelberg (2005). doi:10.1007/11559306_11

[8] Legendre, F., Dequen, G., Krajecki, M., Logical reasoning to detect weaknesses about SHA-1 and MD4/5. IACR Cryptology ePrint Archive 2014, 239 (2014)

[9] Manthey, N., Coprocessor -- a standalone {SAT} preprocessor. In: H. Tompits, S. Abreu, J. Oetsch, J. Puhrer, D. Seipel, M. Umeda, A. Wolf (eds.) Applications of Declarative Programming and Knowledge Management, pp. 297--304. Springer Berlin Heidelberg, Berlin, Heidelberg (2013). doi:10.1007/978-3-642-41524-1_18

[10] Mantley, N., The SAT Solving Package Riss (2015). http://tools.computational-logic.org/content/riss.php Accessed 18 October 2018

[11] Morawiecki, P., Srebrny, M., A SAT-based preimage analysis of reduced KECCAK hash functions. Information Processing Letters 113 (10), 285 (2010). doi:10.1016/j.ipl.2013.03.004

[12] Motara, Y., Irwin, B., SHA-1, SAT-solving, and CNF. In: S. van Loggerenberg, J. Laureles (eds.) Southern Africa Telecommunication Networks and Applications Conference (SATNAC), pp. 216--221 (2017)

[13] Nossum, V., SAT-based preimage attacks on SHA-1. Master's thesis, University of Oslo, Department of Informatics, Oslo, Norway (2012). http://urn.nb.no/URN:NBN:no-33622

[14] Otpuschennikov, I., Semenov, A., Gribanova, I., Zaikin, O., Kochemazov, S., Encoding cryptographic functions to SAT using Transalg system. ArXiv e-prints (2016)

[15] Satisfiability Application and Theory (SAT) e.V., SAT Competition 2009 Benchmark Submission Guidelines (2009). http://www.satcompetition.org/2009/format-benchmarks2009.html. Accessed 18 October 2018

[16] Soos, M., CryptoMiniSat 5 (2018). https://www.msoos.org/cryptominisat5. Accessed 18 October 2018

[17] The CBMC Homepage. https://www.cprover.org/cbmc. Accessed 19 October 2020

[18] CGen github page. https://github.com/vsklad/cgen. Accessed 19 October 2020

[19] Kissat SAT Solver. http://fmv.jku.at/kissat/. Accessed 19 October 2020

[20] MapleSAT: A Machine Learning based SAT Solver. https://sites.google.com/a/gsd.uwaterloo.ca/maplesat/maplesat. Accessed 19 October 2020