bin2ml icon indicating copy to clipboard operation
bin2ml copied to clipboard

Implement EMBER Feature Extraction

Open valbucci opened this issue 1 year ago • 0 comments

I propose implementing the extraction of EMBER features, a widely-used benchmark originally designed for PE (Portable Executable) files. While some EMBER features are PE-specific, others are format-agnostic and could benefit analysis across multiple binary formats. Below, EMBER features are grouped into two categories:

Format-Agnostic Features (applicable to PE, ELF, and Mach-O)

  • Byte Histogram: byte frequencies (0–255) over the entire file.
  • Byte-Entropy Histogram: joint histogram of byte values and local entropy to approximate information density.
  • Strings: printable strings and compute related statistics (e.g., number, average length, character distribution, entropy).
  • Section Information: section details (names, sizes, entropy, virtual sizes, and hashed properties).
  • Imports Information: imported libraries and functions.
  • Exports Information: exported symbols and functions.
  • General File Information: metadata such as file size, virtual size, presence of debug data, and counts of relocations, resources, etc.

PE-Specific Features

  • Header File Information: information extracted from PE-specific headers (COFF and Optional Headers), such as timestamp, machine type, subsystem, linker versions, etc.
  • Data Directories: size and virtual address for each of the PE data directories (e.g., export table, import table, resource table, etc.).

Reference

features.py in EMBER's GitHub repo

valbucci avatar Mar 08 '25 17:03 valbucci