Skip to content

How it works

Similarity Computation

SigthHouse relies on BSIM which is a feature from Ghidra (released in December 2023) that allows computation similarity between functions. BSIM stands for Behavior SIMilarity and aims at identifying similar parts of functions. The idea behind BSIM is to generate a feature vector for each function in a binary. The vectors are generated by Pcode extract from function decompiled. Each feature represents a small piece of data flow and/or control flow of the associated function.

The decompiler normalizes the feature vector representation so that different, but functionally equivalent, pieces of code often produce the same features. Certain attributes, such as values of constants, names of registers, and data types, are intentionally not incorporated into the features. BSIM vectors are compared using cosine similarity. Discrepancies between the vectors for foo and bar which are caused by differences in compilers, target architectures, and/or small changes to the source code typically result in vectors which are close but not identical.

BSIM vectors typically look something like this:

(1:545c6155,1:7086215d,1:bd945601,1:ca0bb8a0,1:e123ddbb)
Where the number before the colon represents the number of consecutive hash element (which allow to factorize vectors when possible) and the one after is the hash of a small part of the Pcode.

To display these vector you can use the script provided by Ghidra: DumpBSimSignaturesScript.py.