Visualizing SHA-1

Share: 
Translating SHA-1 into CNF results in formulas that are too large for unaided human perception. My SHA-1 encoding while compact, is no exception. Naturally, I wanted to find ways to explore its properties and structure. Since good visualization goes a long way in helping to grasp complexity, I considered visualizing the CNF formula as a graph.

The resulting visualization shows SHA-1 structure in a surprisingly clear intuitive fashion. Furthermore, reduced versions demonstrate the increasing complexity of the algorithm.

Methodology

Back in time, Carsten Sinz and his colleagues at the University of Tübingen have done notable work translating SAT instances into graphs and visualizing them [1, 2, 3]. They built DPvis and 3Dvis, tools for the purpose. While DPvis is still available to download from Web Archive, it is outdated. More recently, similar work in the University of Waterloo resulted into SATGraf [4], a tool focused on analysing communities and visualizing evolution of SAT instances while solving.

With respect to a CNF formula, it is easy to construct a directed implication graph for binary clauses without loss of information. However, since majority of the clauses in our encoding contain 3 or 4 literals, no straightforward translation exists.

DPvis operates with variable dependency (interaction) graphs [1, 2, 3]. SATGraf is based on a similar concept of variable incidence graph (VIG) [4, 5]. As these concepts are sufficient for showing complexity and structure of the formula through connections between CNF variables, I produced undirected graphs as follows:

1. CNF variables are represented as vertices
2. If two variables appear within the same clause, the connection between them is represented as an undirected edge

This representation is "lossy", that is, the original CNF formula cannot be recovered from the graph.

In addition, I experimented with edge weights. I used VIG edge weight definition from [5]. In particular, the total weight introduced by a single clause to all edges is assumed to be 1, then the equivalent fraction is attributed to each variable pair. This approach emphasises short clauses and variable pairs that occur frequently. Empirically, including weights into visualization does not help to emphasise structural elements however, perhaps partially due to limitations of the chosen visualization tool.

In order to derive meaningful structural interpretation, I was particularly interested in CNF variables that have meaning within SHA-1: message, hash, message schedule and round variables. This is because a large number of auxiliary variables is introduced into the formula in order to contain its size during translation.

Information about SHA-1 variables is not present within the CNF formula itself. To preserve it while encoding, I used a bespoke extended version of DIMACS CNF format with "named variable" definitions recorded within comments. This functionality is implemented in CGen, the encoding tool I used to translate SHA-1 into CNF. Another important benefit of using CGen  is that the formula size is optimised, resulting into a compact graph.

Thus, for each graph node associated with a bit of a SHA-1 variable I record the variable name and the bit index as its attributes. For the purpose of visualization, these attributes allow grouping and distinguishing SHA-1 variables using colour of the nodes, potentially their shape and placement as well.

I chose GraphML for storing the graph as it is a well known extensible format supported by many graph visualization tools. I decided to use Gephi  as a visualization tool for its slightly better performance and usability, although Cytoscape  includes comparable functionality as well.

CNF formula translation into GraphML as described above, is implemented in CGraph.  The tool can be used to produce graph representation for any DIMACS CNF file even if it contains no original variable definitions.

Experiment

Empirically, I chose four SHA-1 visualizations:

1. A full 80 rounds SHA-1 CNF formula.
2. A reduced 20 rounds formula as an example of a SAT problem solvable by state-of-the-art SAT solvers in trivial time.
3. A reduced 25 rounds formula as an example of a SAT problem unsolvable by state-of-the-art SAT solvers in reasonable time.
4. Representation of a reduced 20 rounds formula with weighted edges.

In all instances CNF formula is produced for SHA-1 1-block message with both message and hash values unfixed. That is, there are 512 binary variables representing the message and 160 binary variables representing the hash. The first three instances do not include edge weights, that is, edge weight is assumed to be equivalent for all edges.

I chose the examples considering the difficulty of a SAT based pre-image attack: given a 160 bit hash, how easy it is to determine a matching SHA-1 message. In the second instance, the SAT problem is solvable within seconds using a consumer grade personal computer and a state-of-the-art SAT solver. In the third instance, the problem is unsolvable within many hours with the same setup.

Results

Figure 1 shows full 80 rounds SHA-1 visualization. It has been produced using Gephi Force Atlas 2 layout thanks to its speed and expressive power. The layout is based on a force-directed placement algorithm [6], I believe similar to what was used in earlier work.

I colour-coded vertices of the graph as follows:
  • message (M) nodes are red
  • message schedule (W) nodes are orange
  • round variables (Fi and Ai) are green
  • hash value (H) nodes are blue
  • all auxiliary nodes are light grey
In addition, edges are coloured using colour of their "source" nodes. Normally the source node will correspond to CNF variable introduced earlier in the flow of calculations; for example, a message node will give colour to many edges and a hash node will give colour to a few.
Visualization of the variable interaction graph representing CNF formula obtained from encoding full 80 rounds SHA-1
Figure 1. 80-rounds SHA-1

The image demonstrates structure of the algorithm in a surprisingly clear intuitive way. It is possible to make the following observations:
  • cylindrical shape of the figure reflects circular nature of the calculations (circular shift from round to round) combined with the sequence of rounds
  • rounds of calculations follow from the bottom to the top, there are 80 of them each one represented with a green "ring"
  • message bits (red) are included into the "lower" rounds and the hash bits (blue) are derived from the "higher” rounds
  • there are 32 longitudinal lines of message schedule bits (orange) representing their sequential derivation and inclusion into consecutive rounds
Figures 2 and 3 show the same visualization for the two reduced versions of SHA-1. The shapes and structure of the figures reflect the reduced number of rounds while leaving other elements unchanged as expected. It is notable that despite visually moderate difference of complexity and structure, there is such a tremendous difference in solving difficulty of the underlying SAT problems.
Visualization of the variable interaction graph representing CNF formula obtained from encoding 20 rounds reduced SHA-1
Figure 2. 20-rounds SHA-1

Visualization of the variable interaction graph representing CNF formula obtained from encoding 25 rounds reduced SHA-1
Figure 3. 25-rounds SHA-1

Finally, Figure 4 shows the same formula as presented on Figure 2, with weighted edges. Shades of blue are used to distinguish different weights: "heavier" edges are shown in darker colours. While certain structural elements appear more vividly (e.g. variables that occur together in clauses encoding 5-operand addition and xor), overall shape and structure becomes less clean and visually appealing, partially due to Gephi and the layout algorithm limitations.

Visualization of the weighted variable incidence graph representing CNF formula obtained from encoding 20 rounds reduced SHA-1, shades of blue represent edge weights
Figure 4. 20 rounds SHA-1, weighted edges

Conclusions

The presented visualizations may provide educational value to anyone with interest in cryptanalysis or SAT. Furthermore, it will be interesting to use the described visualization method as a tool for hypothesis discovery and research of SAT solving techniques.

For the purpose of reproducibility, all DIMACS CNF files, GraphML files and high-resolution images have been published in Harward Dataverse, doi:10.7910/DVN/L4NBYE.

References

  1. Carsten Sinz, Visualizing the internal structure of SAT instances (preliminary report), In Proc. of the 7th Intl. Conf. on Theory and Applications of Satisfiability Testing (SAT 2004), Vancouver, Canada, May 2004.
  2. Carsten Sinz and Edda-Maria Dieringer, DPvis - a tool to visualize structured SAT instances, In Proc. of the 8th Intl. Conf. on Theory and Applications of Satisfiability Testing (SAT 2005), pages 257-268, St. Andrews, UK, June 2005. Springer-Verlag.
  3. Carsten Sinz. Visualizing SAT Instances and Runs of the DPLL Algorithm. J. Automated Reasoning, 39(2):219-243, 2007
  4. Newsham, Zack and Lindsay, William and Ganesh, Vijay and Liang, Jia Hui and Fischmeister, Sebastian and Czarnecki, Krzysztof. SATGraf: Visualizing the Evolution of SAT Formula Structure in Solvers, In Proc. of the 18th Intl. Conf. on Theory and Applications of Satisfiability Testing (SAT 2015), pages 62-70, 2015, doi: 10.1007/978-3-319-24318-4_6
  5. Giráldez Crú, Jesús, Levy Díaz, Jordi, dir., Gonzàlez i Sabaté, Jordi. Beyond the structure of SAT formulas. [Barcelona]: Universitat Autònoma de Barcelona, 2016. 1 recurs electrònic (145 p.). ISBN 9788449064234. Tesi doctoral - Universitat Autònoma de Barcelona. Departament de Ciències de la Computació, 2016 url: https://ddd.uab.cat/record/165949
  6. Mathieu Jacomy, Tommaso Venturini, Sebastien Heymann, Mathieu Bastian. ForceAtlas2, a Continuous Graph Layout Algorithm for Handy Network Visualization Designed for the Gephi Software, Published: June 10, 2014, doi: 10.1371/journal.pone.0098679