bin2ml
bin2ml copied to clipboard
Implement EMBER Feature Extraction
I propose implementing the extraction of EMBER features, a widely-used benchmark originally designed for PE (Portable Executable) files. While some EMBER features are PE-specific, others are format-agnostic and could benefit analysis across multiple binary formats. Below, EMBER features are grouped into two categories:
Format-Agnostic Features (applicable to PE, ELF, and Mach-O)
- Byte Histogram: byte frequencies (0–255) over the entire file.
- Byte-Entropy Histogram: joint histogram of byte values and local entropy to approximate information density.
- Strings: printable strings and compute related statistics (e.g., number, average length, character distribution, entropy).
- Section Information: section details (names, sizes, entropy, virtual sizes, and hashed properties).
- Imports Information: imported libraries and functions.
- Exports Information: exported symbols and functions.
- General File Information: metadata such as file size, virtual size, presence of debug data, and counts of relocations, resources, etc.
PE-Specific Features
- Header File Information: information extracted from PE-specific headers (COFF and Optional Headers), such as timestamp, machine type, subsystem, linker versions, etc.
- Data Directories: size and virtual address for each of the PE data directories (e.g., export table, import table, resource table, etc.).
Reference
features.py in EMBER's GitHub repo