Block features

In this 1st part, let's look at how to extract the block features for the paper.

Final snippet

from typing import Dict, Union, List
import quokka

# Use the code from arch.py in this repo
# Originally 
# https://github.com/Cisco-Talos/binary_function_similarity/blob/main/IDA_scripts/IDA_acfg_features/core/architecture.py
ARCH_MNEM = ...

FeaturesDict = Dict[str, Union[int, List[str], List[int]]]

def get_bb_features(block: quokka.Block) -> FeaturesDict:

    mnemonics = [inst.cs_inst.mnemonic for inst in block.instructions]
    arch = block.program.isa.name

    return {
        "bb_len": block.size, # (1)!
        # List features
        "bb_numerics": block.constants, # (2)!
        "bb_strings": block.strings, # (3)!
        # Numeric features
        "n_numeric_consts": len(block.constants), # (4)!
        "n_string_consts": len(block.strings), # (5)!
        "n_instructions": len(mnemonics), # (6)!
        "n_arith_instrs": sum(
            1 for m in mnemonics if m in ARCH_MNEM[arch]["arithmetic"]  # (7)!
        ),
        "n_call_instrs": sum(1 for m in mnemonics if m in ARCH_MNEM[arch]["call"]),
        "n_logic_instrs": sum(1 for m in mnemonics if m in ARCH_MNEM[arch]["logic"]),
        "n_transfer_instrs": sum(
            1 for m in mnemonics if m in ARCH_MNEM[arch]["transfer"]
        ),
        "n_redirect_instrs": sum(
            1
            for m in mnemonics
            if (m in ARCH_MNEM[arch]["unconditional"])
            or (m in ARCH_MNEM[arch]["conditional"])
            or (m in ARCH_MNEM[arch]["call"])
        ),
    }

First, let's take the len of the block as its size
The list of numerics constants used in the block is accessible using the .constants attribute
The list of strings found in the block is accessible by .strings
The number of constants is simply found using the len of the constants list
The number of strings is simply found using the len of the strings list
We count the number of instruction using the number of the mnemonics in the list.
Classify instructions using the ARCH_MNEM mapping provided

ARCH_MNEM

This mapping has been created by the paper's authors to classify the instructions in each architecture. For example, the mnemonic used to touch the stack in ARM are the following:

ARCH_MNEM = {}
ARCH_MNEM["ARM"]["stack"] = {
    'pop',
    'popeq',
    'popne',
    'pople',
    'pophs',
    'poplt',
    'push'
}

Obtaining the mnemonics

If you paid attention to the snippet, this line was used to obtain the mnemonics:

...
mnemonics = [inst.cs_inst.mnemonic for inst in block.instructions]
...

Why did we use the cs_inst attribute instead of the more simple one mnenomic from the Instruction class?

To simply demonstrate the usage of the capstone bindings.