Sophisticated Ways

Tailored compact CNF encoding for SHA-1

By Volodymyr Skladanivskyy on October 21, 2020

Abstract

We present a crafted Conjunctive Normal Form (CNF) encoding for Secure Hash Algorithm (SHA-1). The encoding relies on representing n-operand modular addition as an adder network and its subsequent translation into CNF. The encoding is more compact than comparable crafted encodings published previously. Also, it is more compact than those encodings after state-of-the-art CNF pre-processing applied to them.

We would like to thank the JSAT editorial team and the reviewers for considering publishing our work. We concur that in the original form the paper does not qualify for the journal, notably, because of the absence of experiments measuring impact of our CNF optimizations on SAT solving performance. Following the review, we have measured solving time for a wider range of encoding methods using state of the art SAT solvers, with 4 types of constrains. We have found that while our encoding shows some improvement over known crafted alternatives, using CBMC as a more general method of translating SHA-1 C code to CNF, produces encoding which is while somewhat less compact is comparable or better than ours in terms of solving time. We believe this difference arises from encoding of adders within the adder network.

For the above reasons, we decided to publish the work as a less formal post as opposed to a paper. We believe our optimized encoding can be used to advance SAT-based research of SHA-1 and other cryptographic primitives, also is interesting as a source of benchmarks for CNF pre-processors and SAT solver techniques. Finally, we note that reversing unreduced SHA-1 function through SAT solving remains computationally intractable.

The original article was prepared in October 2018 and then revised in 2020 to include SAT solving performance measurements.

Introduction

There is a number of publications on encoding SHA-1 into a Conjunctive Normal Form (CNF) formula [7, 11, 13, 8, 14, 12]. SHA-1 pre-image attack translated into a satisfiability (SAT) problem is a well known source of progressively hard SAT instances. Modern SAT solvers can solve instances reduced to 21 rounds or full instances with up to 20 variable message bits, in trivial time [13, 12]. Solving time increases non-polynomially for more complex cases [13].

The traditional encoding approach for an algebraic or a Boolean function is to interpret it as Boolean circuit, then translate each operation-gate into an optimal set of CNF clauses [13]. Complexity of the translated formula is controlled through introduction of additional variables for gate outputs, known as Tseitin transformation [13].

We have noticed a number of inefficiencies in the encodings published in [13, 14]. These inefficiencies are related to encoding implementation rather than to the method itself. Notably, SHA-1 algorithm uses numerical constants as initial values and as a source of permutation at each round. A tailored encoding algorithm would recognise these constants to reduce the encoded operations and the number of clauses necessary to encode them.

In this work we describe techniques to produce a compact SHA-1 CNF representation. Our biggest contribution is the minimised translation of 5-operand addition into an adder network reduced through elimination of constants and duplicate inputs separately for each operation-gate. The described techniques can be used to encode other algebraic and Boolean functions. We introduce a tool which outputs our encoding.

Modern SAT solving leverages sophisticated pre-processing techniques to simplify input formulas and eliminate encoding inefficiencies [3, 9]. Therefore, to claim an improvement it is necessary to compare the improved encoding with pre-processed as well as original alternatives.

We measured solving performance for sets of randomly generated instances of our encoding and the crafted alternatives of both reduced and unreduced versions of SHA-1. In addition, we included encoding produced by translating SHA-1 C source code using CBMC [17]. We measured solving performance using Crytominisat, Kissat and MapleSAT which we believe implement state-of-the-art solving techniques. While our encoding shows some improvements over the alternative crafted alternatives, we found that CBMC encoding, while somewhat less compact, leads to comparable or better solving performance. Still, we believe that our optimized encodings may be useful for SAT-based analysis of SHA-1 and other cryptographic primitives, as well as are interesting as a source of minimalistic increasingly hard SAT bencharks.

The rest of this paper is structured as follows. After giving a short background on CNF and SHA-1 (Section Preliminaries), we discuss the encoding approach and techniques (Section Encoding). Then, we describe our implementation and compare our encoding with alternatives and their pre-processed versions. (Section Results). Finally, we present our conclusions in Section Conclusions}.

Preliminaries

In this section we introduce concepts deemed relevant to the remainder of the paper.

Conjunctive Normal Form

A Boolean formula ƒ(𝑥₁…𝑥_Ν) with Boolean parameters 𝑥_i, i=1..Ν is said to be in Conjunctive Normal Form (CNF) if it is a conjunction of clauses c_j, j=1..Μ. A clause c_j is a disjunction of literals l_j,k, k=1..S_j where each literal is either one of the variables 𝑥_i or its negation ¬𝑥_i.

A literal is pure if its negation is not present in any clause.

Unit clause is a clause with one literal. Respectively, binary clause contains two literals. A clause is a tautology if it contains both a variable and its negation. A clause c_j is said to be subsumed by clause c_k if c_k literals is a subset of c_j literals. Finally, k-CNF is a formula with clauses containing at most k literals.

SHA-1

A hash function is an algorithm which given an arbitrarily-sized input message produces a fixed length bit string referred to as message digest or hash. SHA-1 [5] is a hash function which splits the message into 512 bit long blocks and produces 160 bit hash after 80 rounds of computation for each block. SHA-1 operates with 32-bit words. Computations include bitwise logical operations (and, or, xor, not), left, right and circular shift, and addition modulo 2³². Round reduced SHA-1 is a variant of the algorithm with lower number of rounds.

In this work we consider CNF encoding for SHA-1 with a single message block input.

Circuits

A circuit is a directed acyclic graph. Its vertices are referred to as gates, edges connect gate outputs to inputs of other gates. In a Boolean circuit, gates represent Boolean functions, function parameters or operands are mapped to gate inputs, function result is mapped to the gate output.

Algebraic operations can be represented as circuits. An n-operand adder or n-operand counter is a circuit with n inputs and m outputs, n≤2^m, which outputs a count of activated inputs in binary form. A full adder is a counter with 3 inputs and 2 outputs, a half adder has 2 inputs and 2 outputs respectively. A ripple carry adder is a Boolean circuit which outputs binary representation of the sum of two binary numbers. It is constructed by chaining full adders through connecting carry output of an adder with an input to the next more significant adder in the chain.

Encoding

In this section we describe our encoding approach and techniques.

Criteria

There are varying notions of what indicates optimality of CNF encoding. These include number of clauses and literals [3], also other properties that help SAT solver's performance, notably arc-consistency [4, 13]. Arc-consistency refers to consistency between the original and the translated formula, of propagating partial assignments. As our primary objective is to remove redundancies, we shall use number of literals as a measure of compactness. This measure is roughly proportional to quantity of computer memory needed to store the formula.

We shall use number of clauses as a secondary measure of compactness. With respect to the number of variables, we note that for any sufficiently complex CNF formula there exist other representations with lower number of variables and higher number of clauses [12]. This follows from application of the resolution rule to eliminate variables by introducing new clauses [3]. By "sufficiently complex" we refer to a formula where most of the variables produce few tautologies when eliminated.

Approach

Among the published SHA-1 encodings, all except one are produced using variants of the same "traditional" method [7, 13, 8, 14, 12]. CryptLogVer [11] relies on a proprietary commercial circuit synthesis package. Similarly to our work, a number of published encodings is compared in [12]. [13] complemented with [4] gives the most detailed structured analysis of the encoding approach, techniques and optimality, which is the reason we chose it as a basis for our work.

Therefore, we represent the algorithm as a Boolean circuit and apply Tseitin transformation during translation. SHA-1 words are interpreted as sequences of Boolean variables. Unless noted otherwise, we use CNF representation for operations and functions from [13] and [4]. The resulting formula is equisatisfiable with rather than is equivalent to the original algorithm, with variables that correspond to SHA-1 parameters remaining identifiable. In particular, an assignment of a subset of variables that correspond to bits of SHA-1 message block and hash is a satisfying assignment for the formula if and only if there exists a matching pair of SHA-1 message block and hash with the respective bit assignments.

In addition, we apply the following techniques separately for each bit of the SHA-1 word operations.

Similarly to [14], we introduce auxiliary variables only when the result of the operation is not a constant and is not equal to one of the operands or its negation. Otherwise, we pass the constant or the literal respectively as an input to the subsequent operations. For example, for binary exclusive or:
∀x x⊕0≡x, x⊕1≡¬x, x⊕x≡0, x⊕¬x≡1
SHA-1 includes 4-operand exclusive or and 5-operand addition. We reduce multi-operand operations where an operand is a constant, a duplicate or negation of another operand. For example, 2⁴=16 clauses are required to encode exclusive or with 4 operands [13]. If two of the operands are the same variable, they cancel each other and the operation becomes binary, requiring 2²=4 clauses only.
We apply the above techniques to encode addition as described in the next subsection.
If-then-else function ch(x, y, z) is encoded with 4 clauses rather than 6 as discussed in [4]:
(¬x ∨ ¬y ∨ ch) ∧ (x ∨ ¬z ∨ ch) ∧ (¬x ∨ y ∨ ¬ch) ∧ (x ∨ y ∨ ch)

Modular addition

5-operand 32 bit modular addition within the compression function is the most complex part of SHA-1, producing the bulk of clauses and auxiliary variables during encoding. However, we note that at least one of the operands is always a constant [5]. Furthermore, at the bit level it is possible that an operand duplicates or negates another one, particularly for reduced versions of SHA-1.

One simple encoding approach is to reduce the operation to several 2-operand additions [13]. However, this approach hides possible simplifications behind the auxiliary variables of each full adder. Another approach also tried in [13] is to use n-operand counters chained into a ripple carry adder extended to n operands. While this approach minimises number of variables, the number of clauses grows disproportionally as shown in Table 1. As a note, the first two columns correspond to half-adder (HA) and full-adder (FA) respectively.

Table 1: Size of encoding for regular and modified n-operand counters
operands/inputs	2 (HA)	3 (FA)	4	5	6
variables	2	2	3	3	3
clauses	6	10	24	46	88
clauses, with residual 1	6	13	23	47	88
max clause length	3	4	5	6	7

We extend the second approach from [13] as follows:

We remove pairs of inputs set to constant 1 and pass instead an additional constant carry input to the subsequent circuit in the ripple carry structure. We process duplicate variables the same way. Where there is a variable and its negation, the two inputs are replaced with a constant 1.
For each round of the ripple carry structure we choose between a modified n-operand counter and a regular one. The modified counter has an additional constant input equal to 1. We choose one of these counters depending on the residual constant value obtained after the earlier optimisations. Similarly to [13], we use espresso logic minimiser [2] to obtain CNF translation for each of the counters.
We end up with a varying number of carry inputs for each counter. To limit disproportional growth of the translated encoding, we replace a single n-operand counter with a circuit very similar to the adder network discussed in [4]. In particular, we take the first 3 operands as inputs for the first full adder in the circuit. We feed the sum output as an input to the next adder of the current circuit, plus 2 more remaining operands. We feed carry output as an input to the next more significant counter circuit in the chain. We choose the last adder to be either full- or half- depending on the number of remaining operands and take a modified version if there is a residual constant 1.

Size of encoding for one counter within a 4-operand ripple-carry adder, assuming it is synthesised from reduced counters/adders as discussed, is shown in Table 2.

Table 2: Size of encoding for one counter of a 4-operand ripple-carry adder synthesised from n-operand counters
max counter operands, `n`	3 (FA)	4	5	6
variables	6	8	5	3
clauses	30	58	56	88

For example, the counter can be synthesised from 3 full adders as shown on Figure 1. There are 4 inputs x₁…x₄, 3 carry in and 3 carry out operands; s is the resulting lower bit of the sum. Translation of this circuit gives 2*3=6 additional variables (c_{out, 1}, s_{aux, 1}, c_{out, 2}, s_{aux, 2}, c_{out, 3} and s) and 10*3=30 clauses. Modification of the same circuit with the additional constant input 1 produces second carry output from the last adder. This output becomes an additional input into subsequent round of the ripple carry structure, requiring an additional half-adder and increasing number of clauses to 36. Alternatively, a single 6-operand counter can be used (the last column in Table 1 and Table 2) and encoded into CNF with 3 additional variables and 88 clauses.

Figure 1: 6-input counter of a 4-operand ripple carry adder synthesised from 3 full-adders.

Summarising, we note that:

Synthesising the network from full adders minimises number of clauses while a single 6-operand counter minimises number of variables (Table 1).
There is more than one way to connect full- and half-adders to synthesise an n-operand counter. In our example, one may connect FA1 output to FA3 instead of FA2. Since the choice of topology does not effect number of CNF clauses or auxiliary variables, we chose the simple "sequential" design as shown on Figure 1. At the same time, studying the effect of particular topology on other properties of the encoding can be a task for future work.
Adder network synthesised from half adders is impractical as number of required half adders and carry outputs grows progressively.
Unless synthesised from smaller ones, counters with more inputs are translated into longer CNF clauses (Table 1); therefore, number of literals grows at least proportionally to number of clauses as number of inputs increases.

Results

Our encoding is implemented in a tool we named CGen. The source code and usage instructions are available from github [18]. CGen outputs the resulting CNF formula into a file in DIMACS format [15]. The tool allows producing both full and round reduced SHA-1 CNF instances, also instances with assignment of message and hash bits.

Compactness

Our compact encoding is generated with counters composed of full adders. Also, we limit exclusive or gates to 3 inputs to minimise number of both literals and clauses as well. Maximal clause length within this encoding is 4 and we refer to it as CGen 4-CNF. Also, for the purpose of comparison we produce an encoding with minimal number of variables. It is generated from 6-operand counters with CNF representation minimised using espresso. We refer to it as CGen 7-CNF.

Empirically, we can confirm that both encodings contain no:

unconstrained variables or pure literals; that is, each variable appears in at least two clauses in both unnegated and negated form;
duplicated clauses or tautologies;
unit clauses or subsumable clauses;
equivalent variables; for each pair of variables there is at most one binary clause since any pair of binary clauses with the same variables can be reduced to either an unary clause or the equivalence.

Therefore, the effect of our techniques is similar to that of pre-processing techniques discussed in [3, 9].

Comparison with alternative encodings

We have compiled a list of full SHA-1 encodings referenced in literature. This list is presented in Table Table 3 along with our versions. We obtained and verified the encodings discussed in [13] and [14]. Also, we used CBMC [17] to produce the additional version of the encoding from SHA-1 C code. We rely on the published numbers for the other encodings.

We split the encodings into two groups - those that appear to minimise overall size of the formula and those that minimise number of variables. In both groups, our encoding is the most compact. In particular, the second best encoding in the first group contains 13% more literals and 33% more clauses.

Also, we pre-processed our CGen 4-CNF encoding and the two available alternative encodings from the first group. We chose the following three SAT solvers to produce the simplified versions:

glucose [1] version 4.1, used with "-elim -asymm" options;
lingeling [6], version bcj, used with "-s" option;
cryptominisat [16] version 5.0.1, used with "-p1" option.

These solvers or their derivations participated in SAT Competition in recent years and represent distinct implementations. While in-depth analysis of preprocessing capabilities of each solver would go beyond scope of this work, we assume that the chosen solvers implement a reasonable variety of CNF pre-processing techniques.

In addition, we used Coprocessor [10] as the only published stand-alone relatively recent CNF preprocessor known to us. To run it, we used parameters recommended by its authors.

Table Table 3 shows sizes of the encodings along with pre-processed versions. It is evident that preprocessing compensates for significant part of the size difference. Still, our pre-processed encoding remains more compact than the second best with about 18% less literals and 15% less clauses. Notably, even the original version of our encoding remains more compact than the simplified alternatives, about 17% and 14% respectively.

Table 3: CGen and alternative SHA-1 CNF encodings
Description	Variables	Clauses	Literals
optimising size of the formula
CGen 4-CNF	26,156	127,200	447,174
+ coprocessor	25,801	126,178	444,523
+ lingeling	25,741	126,024	444,156
+ cryptominisat	25,785	126,149	444,420
CBMC	33,993	169,385	507,305
Vegard Nossum [13]	56,108	223,551	685,403
+ glucose	29,683	153,924	523,361
+ coprocessor	28,089	155,031	540,556
+ lingeling	21,408	144,980	568,604
Transalg [14]	22,531	231,507	1,060,989
+ coprocessor	20,647	228,899	1,058,722
+ lingeling	20,038	222,757	1,029,171
+ cryptominisat	21,723	226,010	1,038,671
CryptLogVer [11], ~	31,000	181,000	-
CryptLogVer [13]	44,812	248,220	-
Motara & Irvin [12]	64,901	350,673	-
optimising number of variables
CGen 7-CNF	12,722	255,715	1,445,131
Vegard Nossum [13]	13,408	478,476	3,138,508
Legendre et al. [8]	12,771	375,195	-

Comparing SAT solving performance

SAT solving performance varies depending on choice of SHA-1 hash and message constrains, also on number of rounds. We chose to measure performance for 4 scenarious, 2 round reduced and 2 unreduced versions of the algorithm, with minimal and maximal number of bits fixed respectively just enough for the problem to become non-trivial. For each scenario, we generated batches of 100 instances for all available encodings. We used the same pairs of constrains for each batch, that is, we fixed the same combinations of message and hash bits for each version of the encoding. The bits were fixed by appending respective unit clauses to each instance. Values of the fixed bits as well as fixed positions were chosen randomly from uniform distribution. Finally, we used 3 state-of-the-art solvers to run the experiment: Cryptominisat (CMS) (5.8.0) [16], Kissat (1.0.3) [19] and MapleSAT [20]. The experements were carried out on a server with Intel Xeon E5-1630 v4 @ 3.70GHz CPU and 64 Gb RAM.

Tables 4-7 show mean time in seconds for each batch solved using each of the solvers.

Table 4: Mean time (s), 80 rounds, all hash and 496 message bits fixed
Encoding	CMS	Kissat	MapleSAT
CGen-4	179.1	95.1	297.0
CBMC	180.4	82.8	317.2
[13]	273.1	153.7	508.4
[14]	259.7	135.4	385.7

Table 5: Mean time (s), 80 rounds, 5 hash and 0 message bits fixed
Encoding	CMS	Kissat	MapleSAT
CGen-4	92.4	72.1	60.9
CBMC	1.2	1.4	4.0
[13]	9.9	1.1	8.1
[14]	7.5	1.4	7.9

Table 6: Mean time (s), 22 rounds, all hash and 476 message bits fixed
Encoding	CMS	Kissat	MapleSAT
CGen-4	64.7	31.0	144.8
CBMC	60.3	37.4	97.8
[13]	63.9	39.3	263.9
[14]	81.0	38.1	266.9

Table 7: Mean time (s), 21 rounds, all hash and 0 message bits fixed
Encoding	CMS	Kissat	MapleSAT
CGen-4	32.9	37.7	34.6
CBMC	39.0	43.7	37.3
[13]	56.3	60.1	78.9
[14]	46.7	73.7	64.3

Given large coefficient of variation of the mean solving time and the multiple comparisons, the presented results are not statistically significant. To obtain statistical significance, a much large experiment would be necessary and the comparison criteria would need to be reviewed, for example, substituting comparison of the means with the sign test. However, in our opinion these results are sufficient to demonstrate that our encoding is likely better then the other 2 crafted encodings in 2 scenarious out of 4 while it is likely worse than all other ones considered in 1 scenario. Overall, we conclude that while we see some improvement over other crafted encodings, the CBMC encoding is likely the same or better than CGen-4.

Conclusions

Our motivation for this work was to obtain a CNF formula without identifiable redundancies, to be used for further research. We produced an encoding that is substantially more compact than the alternatives published to date. It is engineered with the goal of minimising size of the formula rather than maximising its arc-consistency. Different optimisation criteria may be the reason a similar encoding has not been produced before.

We note that it may be possible to reduce the formula further. Circuit representation of the formula may contain more reducible redundancies, for example, chained exclusive or operations. Also, it is possible to group and encode several Boolean operations together, experimenting with different groupings. While we think that further reduction of magnitude similar to what we have achieved is unlikely, further optimisation can be the subject of future research.

Furthermore, we believe it is possible to improve solving performance through optimizing adder network decomposition into individual adders and further experiments with their encoding. In particular, the comparable alternatives including CBMC appear to reduce n-nary addition into a series of binary operations, then translate each operation into CNF clauses. We believe this approach can be improved. Also, we note that optimizing adders encoding using espresso[2] does not lead to optimal solving performance.

Finally, CNF pre-processors we used for the experiment are unable to reduce alternative encodings to match our size. Therefore, our encoding can be used as a benchmark for pre-processing.

Meanwhile, we suggest that our encoding and our techniques can be useful for SAT-based analysis of SHA-1 and other cryptographic primitives, as well as a source of progressively hard SAt benchmarks.

References

[1] Audemard, G., Simon, L., The Glucose SAT Solver (2016). http://www.labri.fr/perso/lsimon/glucose. Accessed 18 October 2018

[2] Brayton, R.K., Sangiovanni-Vincentelli, A.L., McMullen, C.T., Hachtel, G.D., Logic Minimization Algorithms for VLSI Synthesis. Kluwer Academic Publishers, Norwell, MA, USA (1984)

[3] Eén, N., Biere, A., Effective preprocessing in SAT through variable and clause elimination. In: F. Bacchus, T. Walsh (eds.) Theory and Applications of Satisfiability Testing, pp. 61--75. Springer Berlin Heidelberg, Berlin, Heidelberg (2005). doi:10.1007/11499107_5

[4] Eén, N., Sörensson, N., Translating pseudo-boolean constraints into {SAT}. Journal on Satisfiability, Boolean Modeling and Computation 2, 1--26 (2006)

[5] Information Technology Laboratory, FIPS PUB 180-4: Secure Hash Standard (SHS). National Institute of Standards and Technology, Gaithersburg, MD 20899-8900, USA (2015). doi:10.6028/NIST.FIPS.180-4

[6] Institute for Formal Models and Verification, Lingeling, Plingeling and Treengeling (2018). http://fmv.jku.at/lingeling Accessed 18 October 2018

[7] Jovanovic, D., Janicic, P., Logical analysis of hash functions. In: B. Gramlich (ed.) Frontiers of Combining Systems, pp. 200--215. Springer Berlin Heidelberg, Berlin, Heidelberg (2005). doi:10.1007/11559306_11

[8] Legendre, F., Dequen, G., Krajecki, M., Logical reasoning to detect weaknesses about SHA-1 and MD4/5. IACR Cryptology ePrint Archive 2014, 239 (2014)

[9] Manthey, N., Coprocessor -- a standalone {SAT} preprocessor. In: H. Tompits, S. Abreu, J. Oetsch, J. Puhrer, D. Seipel, M. Umeda, A. Wolf (eds.) Applications of Declarative Programming and Knowledge Management, pp. 297--304. Springer Berlin Heidelberg, Berlin, Heidelberg (2013). doi:10.1007/978-3-642-41524-1_18

[10] Mantley, N., The SAT Solving Package Riss (2015). http://tools.computational-logic.org/content/riss.php Accessed 18 October 2018

[11] Morawiecki, P., Srebrny, M., A SAT-based preimage analysis of reduced KECCAK hash functions. Information Processing Letters 113 (10), 285 (2010). doi:10.1016/j.ipl.2013.03.004

[12] Motara, Y., Irwin, B., SHA-1, SAT-solving, and CNF. In: S. van Loggerenberg, J. Laureles (eds.) Southern Africa Telecommunication Networks and Applications Conference (SATNAC), pp. 216--221 (2017)

[13] Nossum, V., SAT-based preimage attacks on SHA-1. Master's thesis, University of Oslo, Department of Informatics, Oslo, Norway (2012). http://urn.nb.no/URN:NBN:no-33622

[14] Otpuschennikov, I., Semenov, A., Gribanova, I., Zaikin, O., Kochemazov, S., Encoding cryptographic functions to SAT using Transalg system. ArXiv e-prints (2016)

[15] Satisfiability Application and Theory (SAT) e.V., SAT Competition 2009 Benchmark Submission Guidelines (2009). http://www.satcompetition.org/2009/format-benchmarks2009.html. Accessed 18 October 2018

[16] Soos, M., CryptoMiniSat 5 (2018). https://www.msoos.org/cryptominisat5. Accessed 18 October 2018

[17] The CBMC Homepage. https://www.cprover.org/cbmc. Accessed 19 October 2020

[18] CGen github page. https://github.com/vsklad/cgen. Accessed 19 October 2020

[19] Kissat SAT Solver. http://fmv.jku.at/kissat/. Accessed 19 October 2020

[20] MapleSAT: A Machine Learning based SAT Solver. https://sites.google.com/a/gsd.uwaterloo.ca/maplesat/maplesat. Accessed 19 October 2020

Visualizing SHA-1

By Volodymyr Skladanivskyy on November 19, 2018

Translating SHA-1 into CNF results in formulas that are too large for unaided human perception. My SHA-1 encoding while compact, is no exception. Naturally, I wanted to find ways to explore its properties and structure. Since good visualization goes a long way in helping to grasp complexity, I considered visualizing the CNF formula as a graph.

The resulting visualization shows SHA-1 structure in a surprisingly clear intuitive fashion. Furthermore, reduced versions demonstrate the increasing complexity of the algorithm.

Methodology

Back in time, Carsten Sinz and his colleagues at the University of Tübingen have done notable work translating SAT instances into graphs and visualizing them [1, 2, 3]. They built DPvis and 3Dvis, tools for the purpose. While DPvis is still available to download from Web Archive, it is outdated. More recently, similar work in the University of Waterloo resulted into SATGraf [4], a tool focused on analysing communities and visualizing evolution of SAT instances while solving.

With respect to a CNF formula, it is easy to construct a directed implication graph for binary clauses without loss of information. However, since majority of the clauses in our encoding contain 3 or 4 literals, no straightforward translation exists.

DPvis operates with variable dependency (interaction) graphs [1, 2, 3]. SATGraf is based on a similar concept of variable incidence graph (VIG) [4, 5]. As these concepts are sufficient for showing complexity and structure of the formula through connections between CNF variables, I produced undirected graphs as follows:

1. CNF variables are represented as vertices
2. If two variables appear within the same clause, the connection between them is represented as an undirected edge

This representation is "lossy", that is, the original CNF formula cannot be recovered from the graph.

In addition, I experimented with edge weights. I used VIG edge weight definition from [5]. In particular, the total weight introduced by a single clause to all edges is assumed to be 1, then the equivalent fraction is attributed to each variable pair. This approach emphasises short clauses and variable pairs that occur frequently. Empirically, including weights into visualization does not help to emphasise structural elements however, perhaps partially due to limitations of the chosen visualization tool.

In order to derive meaningful structural interpretation, I was particularly interested in CNF variables that have meaning within SHA-1: message, hash, message schedule and round variables. This is because a large number of auxiliary variables is introduced into the formula in order to contain its size during translation.

Information about SHA-1 variables is not present within the CNF formula itself. To preserve it while encoding, I used a bespoke extended version of DIMACS CNF format with "named variable" definitions recorded within comments. This functionality is implemented in CGen, the encoding tool I used to translate SHA-1 into CNF. Another important benefit of using CGen is that the formula size is optimised, resulting into a compact graph.

Thus, for each graph node associated with a bit of a SHA-1 variable I record the variable name and the bit index as its attributes. For the purpose of visualization, these attributes allow grouping and distinguishing SHA-1 variables using colour of the nodes, potentially their shape and placement as well.

I chose GraphML for storing the graph as it is a well known extensible format supported by many graph visualization tools. I decided to use Gephi as a visualization tool for its slightly better performance and usability, although Cytoscape includes comparable functionality as well.

CNF formula translation into GraphML as described above, is implemented in CGraph. The tool can be used to produce graph representation for any DIMACS CNF file even if it contains no original variable definitions.

Experiment

Empirically, I chose four SHA-1 visualizations:

1. A full 80 rounds SHA-1 CNF formula.
2. A reduced 20 rounds formula as an example of a SAT problem solvable by state-of-the-art SAT solvers in trivial time.
3. A reduced 25 rounds formula as an example of a SAT problem unsolvable by state-of-the-art SAT solvers in reasonable time.
4. Representation of a reduced 20 rounds formula with weighted edges.

In all instances CNF formula is produced for SHA-1 1-block message with both message and hash values unfixed. That is, there are 512 binary variables representing the message and 160 binary variables representing the hash. The first three instances do not include edge weights, that is, edge weight is assumed to be equivalent for all edges.

I chose the examples considering the difficulty of a SAT based pre-image attack: given a 160 bit hash, how easy it is to determine a matching SHA-1 message. In the second instance, the SAT problem is solvable within seconds using a consumer grade personal computer and a state-of-the-art SAT solver. In the third instance, the problem is unsolvable within many hours with the same setup.

Results

Figure 1 shows full 80 rounds SHA-1 visualization. It has been produced using Gephi Force Atlas 2 layout thanks to its speed and expressive power. The layout is based on a force-directed placement algorithm [6], I believe similar to what was used in earlier work.

I colour-coded vertices of the graph as follows:

message (M) nodes are red
message schedule (W) nodes are orange
round variables (F_i and A_i) are green
hash value (H) nodes are blue
all auxiliary nodes are light grey

In addition, edges are coloured using colour of their "source" nodes. Normally the source node will correspond to CNF variable introduced earlier in the flow of calculations; for example, a message node will give colour to many edges and a hash node will give colour to a few.

Visualization of the variable interaction graph representing CNF formula obtained from encoding full 80 rounds SHA-1

Figure 1. 80-rounds SHA-1

The image demonstrates structure of the algorithm in a surprisingly clear intuitive way. It is possible to make the following observations:

cylindrical shape of the figure reflects circular nature of the calculations (circular shift from round to round) combined with the sequence of rounds

rounds of calculations follow from the bottom to the top, there are 80 of them each one represented with a green "ring"

message bits (red) are included into the "lower" rounds and the hash bits (blue) are derived from the "higher” rounds

there are 32 longitudinal lines of message schedule bits (orange) representing their sequential derivation and inclusion into consecutive rounds

Figures 2 and 3 show the same visualization for the two reduced versions of SHA-1. The shapes and structure of the figures reflect the reduced number of rounds while leaving other elements unchanged as expected. It is notable that despite visually moderate difference of complexity and structure, there is such a tremendous difference in solving difficulty of the underlying SAT problems.

Visualization of the variable interaction graph representing CNF formula obtained from encoding 20 rounds reduced SHA-1

Figure 2. 20-rounds SHA-1

Visualization of the variable interaction graph representing CNF formula obtained from encoding 25 rounds reduced SHA-1

Figure 3. 25-rounds SHA-1

Finally, Figure 4 shows the same formula as presented on Figure 2, with weighted edges. Shades of blue are used to distinguish different weights: "heavier" edges are shown in darker colours. While certain structural elements appear more vividly (e.g. variables that occur together in clauses encoding 5-operand addition and xor), overall shape and structure becomes less clean and visually appealing, partially due to Gephi and the layout algorithm limitations.

Visualization of the weighted variable incidence graph representing CNF formula obtained from encoding 20 rounds reduced SHA-1, shades of blue represent edge weights

Figure 4. 20 rounds SHA-1, weighted edges

Conclusions

The presented visualizations may provide educational value to anyone with interest in cryptanalysis or SAT. Furthermore, it will be interesting to use the described visualization method as a tool for hypothesis discovery and research of SAT solving techniques.

For the purpose of reproducibility, all DIMACS CNF files, GraphML files and high-resolution images have been published in Harward Dataverse, doi:10.7910/DVN/L4NBYE.

References

Carsten Sinz, Visualizing the internal structure of SAT instances (preliminary report), In Proc. of the 7th Intl. Conf. on Theory and Applications of Satisfiability Testing (SAT 2004), Vancouver, Canada, May 2004.
Carsten Sinz and Edda-Maria Dieringer, DPvis - a tool to visualize structured SAT instances, In Proc. of the 8th Intl. Conf. on Theory and Applications of Satisfiability Testing (SAT 2005), pages 257-268, St. Andrews, UK, June 2005. Springer-Verlag.
Carsten Sinz. Visualizing SAT Instances and Runs of the DPLL Algorithm. J. Automated Reasoning, 39(2):219-243, 2007
Newsham, Zack and Lindsay, William and Ganesh, Vijay and Liang, Jia Hui and Fischmeister, Sebastian and Czarnecki, Krzysztof. SATGraf: Visualizing the Evolution of SAT Formula Structure in Solvers, In Proc. of the 18th Intl. Conf. on Theory and Applications of Satisfiability Testing (SAT 2015), pages 62-70, 2015, doi: 10.1007/978-3-319-24318-4_6
Giráldez Crú, Jesús, Levy Díaz, Jordi, dir., Gonzàlez i Sabaté, Jordi. Beyond the structure of SAT formulas. [Barcelona]: Universitat Autònoma de Barcelona, 2016. 1 recurs electrònic (145 p.). ISBN 9788449064234. Tesi doctoral - Universitat Autònoma de Barcelona. Departament de Ciències de la Computació, 2016 url: https://ddd.uab.cat/record/165949
Mathieu Jacomy, Tommaso Venturini, Sebastien Heymann, Mathieu Bastian. ForceAtlas2, a Continuous Graph Layout Algorithm for Handy Network Visualization Designed for the Gephi Software, Published: June 10, 2014, doi: 10.1371/journal.pone.0098679