attack_data Add bulk replay capabilities to replay.py

Today a user cannot point to a folder and ingest all datasets with the tool.

Apr 11 '22 23:04 josehelps

One idea to start the conversation is to split the current replay.yml into two parts ...

config.yml which contains the Splunk params (host/user/pass) + default index + update_timestamp
dataset.yml which would exists in each datasets directory (adding info to existing yml file) and contains name + source + sourcetype + index (if user wants to override default one in config.yml

Propose we standardize the per-directory yml filename to dataset.yml so it can easily be found/recognized.

Calling replay.py could look like this ...

python replay.py -h 
      -c config.yml         Splunk configuration (host/user/pass/index/override timestamp) (required)
       -d <directory>      Directory to recursively search for dataset.yml to start ingesting (required)

       -i <index>             Override index in config.yml (optional)
       -t                            Override config.yml and update timestamps (optional)
       -s <seconds>       Sleep seconds in between directory ingests (allow splunk to catchup indexing) (optional)

Each directory's *.yml currently seems to have the sourctypes but not linked/ordered with filename. Here's an example

author: Patrick Bareiss, Michael Haag
id: cc9b25d6-efc9-11eb-926b-550bf0943fbb
date: '2022-01-12'
description: 'Atomic Test Results: Successful Execution of test T1003.001-1 Windows
  Credential Editor Successful Execution of test T1003.001-2 Dump LSASS.exe Memory
  using ProcDump Return value unclear for test T1003.001-3 Dump LSASS.exe Memory using
  comsvcs.dll Successful Execution of test T1003.001-4 Dump LSASS.exe Memory using
  direct system calls and API unhooking Return value unclear for test T1003.001-6
  Offline Credential Theft With Mimikatz Return value unclear for test T1003.001-7
  LSASS read with pypykatz '
environment: attack_range
dataset:
- https://media.githubusercontent.com/media/splunk/attack_data/master/datasets/attack_techniques/T1003.001/atomic_red_team/windows-powershell.log
- https://media.githubusercontent.com/media/splunk/attack_data/master/datasets/attack_techniques/T1003.001/atomic_red_team/windows-security.log
- https://media.githubusercontent.com/media/splunk/attack_data/master/datasets/attack_techniques/T1003.001/atomic_red_team/windows-sysmon.log
- https://media.githubusercontent.com/media/splunk/attack_data/master/datasets/attack_techniques/T1003.001/atomic_red_team/windows-sysmon_creddump.log
- https://media.githubusercontent.com/media/splunk/attack_data/master/datasets/attack_techniques/T1003.001/atomic_red_team/windows-system.log
sourcetypes:
- XmlWinEventLog:Microsoft-Windows-Sysmon/Operational
- WinEventLog:Microsoft-Windows-PowerShell/Operational
- WinEventLog:System
- WinEventLog:Security
references:
- https://attack.mitre.org/techniques/T1003/001/
- https://github.com/redcanaryco/atomic-red-team/blob/master/atomics/T1003.001/T1003.001.md
- https://github.com/splunk/security-content/blob/develop/tests/T1003_001.yml

As you can see the 'dataset' files are in a different order than 'sourcetypes'. Propose we bring a formal linkage from the filename to the source/sourcetype (basically moving replay_parameters logic from replay.yml to each directory's dataset.yml file so it can be documented per dataset capture and replayed

author: Patrick Bareiss, Michael Haag
id: cc9b25d6-efc9-11eb-926b-550bf0943fbb
date: '2022-01-12'
description: 'Atomic Test Results: Successful Execution of test T1003.001-1 Windows
  Credential Editor Successful Execution of test T1003.001-2 Dump LSASS.exe Memory
  using ProcDump Return value unclear for test T1003.001-3 Dump LSASS.exe Memory using
  comsvcs.dll Successful Execution of test T1003.001-4 Dump LSASS.exe Memory using
  direct system calls and API unhooking Return value unclear for test T1003.001-6
  Offline Credential Theft With Mimikatz Return value unclear for test T1003.001-7
  LSASS read with pypykatz '
environment: attack_range
references:
- https://attack.mitre.org/techniques/T1003/001/
- https://github.com/redcanaryco/atomic-red-team/blob/master/atomics/T1003.001/T1003.001.md
- https://github.com/splunk/security-content/blob/develop/tests/T1003_001.yml

replay_parameters:
  - name: atomic_red_team/windows-powershell.log
       source: XmlWinEventLog:Microsoft-Windows-Sysmon/Operational
       sourcetype: xmlwineventlog
       notes: <optional>
  - name: windows-sysmon.log
       source: XmlWinEventLog:Microsoft-Windows-Sysmon/Operational
       sourcetype: xmlwineventlog

Apr 12 '22 13:04 fryguy04

I really dig this proposal, although it will cause us to have to refactor a few aspects of our testing pipeline to read from the new yaml structures. With this approach we can/should also create a spec for the dataset.yml and run CI/CD validation for it on every PR. Similarly to security_content repo here. Let me bring this back to the team and think through it but at the surface looks absolutely doable :smile:. Thank you so much for spending the time to write this up, super useful!

Apr 15 '22 20:04 josehelps