How it works
Similarity Computation
SigthHouse relies on BSIM which is a feature from Ghidra (released in December 2023) that allows computation similarity between functions. BSIM stands for Behavior SIMilarity and aims at identifying similar parts of functions. The idea behind BSIM is to generate a feature vector for each function in a binary. The vectors are generated by Pcode extract from function decompiled. Each feature represents a small piece of data flow and/or control flow of the associated function.
The decompiler normalizes the feature vector representation so that different, but functionally equivalent, pieces of code often produce
the same features. Certain attributes, such as values of constants, names of registers, and data types, are intentionally not incorporated
into the features. BSIM vectors are compared using cosine similarity. Discrepancies between the vectors for foo and bar which are
caused by differences in compilers, target architectures, and/or small changes to the source code typically result in vectors which are
close but not identical.
BSIM vectors typically look something like this:
(1:545c6155,1:7086215d,1:bd945601,1:ca0bb8a0,1:e123ddbb)
To display these vector you can use the script provided by Ghidra: DumpBSimSignaturesScript.py.