graylog2-server Automatically generate datanode.conf.example

Generate the datanode.conf.example file with all possible configuration options. Also generate the csv documentation for the same.

/nocl

Description

This PR adds two maven tasks that generate datanode.conf.example and datanode-conf-docs.csv files. The conf.example will be included in our dist packages. The csv may be used for documentation purposes.

The datanode assembly now includes this generated file instead of the manually created.

Motivation and Context

Automate and enforce documentation of configuration options.

How Has This Been Tested?

Manually

Screenshots (if appropriate):

#####################################
# GRAYLOG DATANODE CONFIGURATION FILE
#####################################
# This is the Graylog DataNode configuration file. The file has to use ISO 8859-1/Latin-1 character encoding.
# Characters that cannot be directly represented in this encoding can be written using Unicode escapes
# as defined in https://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-3.3, using the \u prefix.
# For example, \u002c.
# 
# * Entries are generally expected to be a single line of the form, one of the following:
# 
# propertyName=propertyValue
# propertyName:propertyValue
# 
# * White space that appears between the property name and property value is ignored,
# so the following are equivalent:
# 
# name=Stephen
# name = Stephen
# 
# * White space at the beginning of the line is also ignored.
# 
# * Lines that start with the comment characters ! or # are ignored. Blank lines are also ignored.
# 
# * The property value is generally terminated by the end of the line. White space following the
# property value is not ignored, and is treated as part of the property value.
# 
# * A property value can span several lines if each line is terminated by a backslash (‘\’) character.
# For example:
# 
# targetCities=\
# Detroit,\
# Chicago,\
# Los Angeles
# 
# This is equivalent to targetCities=Detroit,Chicago,Los Angeles (white space at the beginning of lines is ignored).
# 
# * The characters newline, carriage return, and tab can be inserted with characters \n, \r, and \t, respectively.
# 
# * The backslash character must be escaped as a double backslash. For example:
# 
# path=c:\\docs\\doc1
# 
# 

# You MUST set a secret to secure/pepper the stored user passwords here. Use at least 16 characters.
# Generate one by using for example: pwgen -N 1 -s 96
# ATTENTION: This value must be the same on all Graylog and Datanode nodes in the cluster.
# Changing this value after installation will render all user sessions and encrypted values
# in the database invalid. (e.g. encrypted access tokens)
password_secret = 

# Do not perform any preflight checks when starting Datanode.
#skip_preflight_checks = false

# How many milliseconds should datanode wait for termination of all tasks during the shutdown.
#shutdown_timeout = 30000

# Directory where Datanode will search for an opensearch distribution.
#opensearch_location = dist

# Data directory of the embedded opensearch. Contains indices of the opensearch.
# May be pointed to an existing opensearch directory during in-place migration to Datanode
#opensearch_data_location = datanode/data

# Logs directory of the embedded opensearch
#opensearch_logs_location = datanode/logs

# Configuration directory of the embedded opensearch. This is the directory where the opensearch
# process will store its configuration files. Caution, each start of the Datanode will regenerate
# the complete content of the directory!
#opensearch_config_location = datanode/config

# Source directory of the additional configuration files for the Datanode. Additional certificates can be provided here.
#config_location = 

# How many log entries of the opensearch process should Datanode hold in memory and make accessible via API calls.
#process_logs_buffer_size = 500

# Unique name of this Datanode instance. use this, if your node name should be different from the hostname
# that's found by programmatically looking it up.
#node_name = 

# Comma separated list of opensearch nodes that are eligible as manager nodes.
#initial_cluster_manager_nodes = 

# Opensearch heap memory. Initial and maxmium heap must be identical for OpenSearch, otherwise the boot fails.
# So it's only one config option.
#opensearch_heap = 1g

# HTTP port on which the embedded opensearch listens
#opensearch_http_port = 9200

# Transport port on which the embedded opensearch listens
#opensearch_transport_port = 9300

# Provides a list of the addresses of the master-eligible nodes in the cluster.
#opensearch_discovery_seed_hosts = []

# Binds an OpenSearch node to an address. Use 0.0.0.0 to include all available network interfaces,
# or specify an IP address assigned to a specific interface.
#opensearch_network_host = 

# Relative path (to config_location) to a keystore used for opensearch transport layer TLS
#transport_certificate = 

# Password for a keystore defined in transport_certificate
#transport_certificate_password = 

# Relative path (to config_location) to a keystore used for opensearch REST layer TLS
#http_certificate = 

# Password for a keystore defined in http_certificate
#http_certificate_password = 

# The auto-generated node ID will be stored in this file and read after restarts. It is a good idea
# to use an absolute file path here if you are starting Graylog DataNode from init scripts or similar.
#node_id_file = data/node-id

# HTTP bind address. The network interface used by the Graylog DataNode to bind all services.
#bind_address = 0.0.0.0

# HTTP port. The port where the DataNode REST api is listening
#datanode_http_port = 8999

# Name of the cluster that the embedded opensearch will form. Should be the same for all Datanodes in one cluster.
#clustername = datanode-cluster

# This configuration should be used if you want to connect to this Graylog DataNode's REST API
# and it is available on another network interface than $http_bind_address,
# for example if the machine has multiple network interfaces or is behind a NAT gateway.
#http_publish_uri = 

# Enable GZIP support for HTTP interface. This compresses API responses and therefore helps to reduce
# overall round trip times.
#http_enable_gzip = true

# The maximum size of the HTTP request headers in bytes
#http_max_header_size = 8192

# The size of the thread pool used exclusively for serving the HTTP interface.
#http_thread_pool_size = 64

# Cache size for searchable snaphots
#node_search_cache_size = 10gb

# Filesystem path where searchable snapshots should be stored
#path_repo = 

# This setting limits the number of clauses a Lucene BooleanQuery can have.
#opensearch_indices_query_bool_max_clause_count = 32768

# The list of the opensearch node’s roles.
#node_roles = [cluster_manager, data, ingest, remote_cluster_client, search]

# Configures verbosity of embedded opensearch logs.
# Possible values OFF, FATAL, ERROR, WARN, INFO, DEBUG, and TRACE, default is INFO
#opensearch_logger_org_opensearch = 

# Configures opensearch audit log storage type. See https://opensearch.org/docs/2.13/security/audit-logs/storage-types/
#opensearch_plugins_security_audit_type = 

#### OpenSearch JWT token usage
# communication between Graylog and OpenSearch is secured by JWT. These are the defaults used for the token usage
# adjust them, if you have special needs.

# This configuration defines interval between token regenerations.
#indexer_jwt_auth_token_caching_duration = 60 seconds

# This configuration defines validity interval of JWT tokens
#indexer_jwt_auth_token_expiration_duration = 180 seconds

# Increase this value according to the maximum connections your MongoDB server can handle from a single client
# if you encounter MongoDB connection problems.
#mongodb_max_connections = 1000

# MongoDB connection string. See https://docs.mongodb.com/manual/reference/connection-string/ for details
#mongodb_uri = mongodb://localhost/graylog

# Maximum number of attempts to connect to MongoDB on boot for the version probe.
# Default 0 means retry indefinitely until a connection can be established
#mongodb_version_probe_attempts = 0

# allowed TLS protocols for system wide TLS enabled servers. (e.g. message inputs, http interface)
# Setting this to an empty value, leaves it up to system libraries and the used JDK to chose a default.
#enabled_tls_protocols = 

# S3 repository access key for searchable snapshots
#s3_client_default_access_key = 

# S3 repository secret key for searchable snapshots
#s3_client_default_secret_key = 

# S3 repository protocol for searchable snapshots
#s3_client_default_protocol = http

# S3 repository endpoint for searchable snapshots
#s3_client_default_endpoint = 

# S3 repository region for searchable snapshots
#s3_client_default_region = us-east-2

# S3 repository path-style access for searchable snapshots
#s3_client_default_path_style_access = true

Types of changes

[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[x] Refactoring (non-breaking change)
[ ] Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

[x] My code follows the code style of this project.
[ ] My change requires a change to the documentation.
[ ] I have updated the documentation accordingly.
[x] I have read the CONTRIBUTING document.
[ ] I have added tests to cover my changes.

Apr 23 '24 16:04 todvora

@bernd what do you think? Is this a valid and usable approach? Can we merge it? Thanks!

May 13 '24 08:05 todvora

@bernd what do you think? Is this a valid and usable approach? Can we merge it? Thanks!

@todvora, I like the general approach! :+1: We should improve a few details before we ship the generated config, though.

Our server config uses a space character around the =. I think that makes it more readable. The generated config is not doing that yet. (mongodb_uri=mongodb://localhost/graylog vs mongodb_uri = mongodb://localhost/graylog)
The generator currently doesn't support ordering. In the server config, we put important settings that users must set at the top of the file. That makes them easy to spot.
In the server config we comment the setting with its default value without a space between the comment character and the setting name. That makes the setting easier to spot.
The generator currently doesn't support sections to create groups of related settings.

May 13 '24 11:05 bernd

Our server config uses a space character around the =. I think that makes it more readable

Easy to implement, done.

The generator currently doesn't support ordering. In the server config, we put important settings that users must set at the top of the file. That makes them easy to spot.

That happens already. Properties that are mandatory and don't have default value, these that users need to fill in, are ordered at the top of the configuration file. Otherwise the order follows order of the properties in the java config beans. Reordering in java leads to reodering in the config file. Seems natural way to handle this.

In the server config we comment the setting with its default value without a space between the comment character and the setting name. That makes the setting easier to spot.

Fixed

The generator currently doesn't support sections to create groups of related settings.

Indeed. IDK how to add this in the current situation without significant overhead in the java config files. From my POV, the sections should correspond to java configuration classes: one class == one section. Then we can add a heading and group description to the class itself. But for now, at least in the datanode, we have (almost)everything cramped into one configuration file. Given this configuration part:

#### OpenSearch JWT token usage
#
# communication between Graylog and OpenSearch is secured by JWT. These are the defaults used for the token usage
# adjust them, if you have special needs.
#
# indexer_jwt_auth_token_caching_duration = 60s
# indexer_jwt_auth_token_expiration_duration = 180s

I see a JwtTokenConfiguration.java class with its own header OpenSearch JWT token usage and communication between Graylog and OpenSearch is secured by JWT. These are the defaults used for the token usage. Adjust them, if you have special needs. as description.

This would be rather simple to implement but would require significant changes in the configuration classes and all of their usages. Is that something we'd consider? I personally think it would offer benefits for the code base as well. Now we drag the whole configuration everywhere, or even worse from the maintenance perspective, use named injects, which make any refactoring a lot harder and error prone.

My suggestion - for now go without sections. Split datanode configuration to more specific config beans. Add section documentation to the beans and see if that will work well for us.

May 14 '24 12:05 todvora

Updated the generated configuration in the PR description.

May 15 '24 07:05 todvora

Our server config uses a space character around the =. I think that makes it more readable

Easy to implement, done.

:+1: Thanks!

The generator currently doesn't support ordering. In the server config, we put important settings that users must set at the top of the file. That makes them easy to spot.

That happens already. Properties that are mandatory and don't have default value, these that users need to fill in, are ordered at the top of the configuration file. Otherwise the order follows order of the properties in the java config beans. Reordering in java leads to reodering in the config file. Seems natural way to handle this.

Ah cool. Sorry, I have missed that. I think my eye caught the node_id_file setting, which is currently at the top of the server conf file and is now somewhere down below. Old habits. :smile:

In the server config we comment the setting with its default value without a space between the comment character and the setting name. That makes the setting easier to spot.

Fixed

:+1: Thanks!

The generator currently doesn't support sections to create groups of related settings.

Indeed. IDK how to add this in the current situation without significant overhead in the java config files. From my POV, the sections should correspond to java configuration classes: one class == one section. Then we can add a heading and group description to the class itself. But for now, at least in the datanode, we have (almost)everything cramped into one configuration file. Given this configuration part:
#### OpenSearch JWT token usage
#
# communication between Graylog and OpenSearch is secured by JWT. These are the defaults used for the token usage
# adjust them, if you have special needs.
#
# indexer_jwt_auth_token_caching_duration = 60s
# indexer_jwt_auth_token_expiration_duration = 180s
I see a JwtTokenConfiguration.java class with its own header OpenSearch JWT token usage and communication between Graylog and OpenSearch is secured by JWT. These are the defaults used for the token usage. Adjust them, if you have special needs. as description.

This would be rather simple to implement but would require significant changes in the configuration classes and all of their usages. Is that something we'd consider? I personally think it would offer benefits for the code base as well. Now we drag the whole configuration everywhere, or even worse from the maintenance perspective, use named injects, which make any refactoring a lot harder and error prone.

My suggestion - for now go without sections. Split datanode configuration to more specific config beans. Add section documentation to the beans and see if that will work well for us.

I think we could add an optional "section" value to the @Documentation annotation. It's not perfect because it's repetitive, but it would be an easy way to emulate the structure of the server config file. It's okay to generate the file without sections for the Data Node config. But before we auto-generate the server config, we should add a way to use sections.

May 16 '24 07:05 bernd

Added @DocumentationSection annotation that allows defining a heading and description on both class and field level. Fields with same heading will be grouped together. This allows simulating sections even for flat class structure, for the price of annotation repetition. The clean solution would still be to separate sections into standalone configuration classes. But now we have some options and can emulate a lot without massive changes to the configuration structure.

The generated datanode.conf in the PR description is now updated with the latest changes. It contains two levels of sections - one top level with overall heading and description and one 2nd level JWT section which is defined on field level.

Jun 19 '24 09:06 todvora

@bernd could you please have a look if this sections support is sufficient? Thanks!

Jun 25 '24 08:06 todvora

@bernd could you please have a look if this sections support is sufficient? Thanks!

I currently don't have time for a review. The approach looks good to me, though. :+1:

Jun 26 '24 12:06 bernd

@bernd could you please have a look if this sections support is sufficient? Thanks!

I currently don't have time for a review. The approach looks good to me, though. 👍

I will do the review

Jun 26 '24 12:06 moesterheld

In general, automatic generation works.

However, I think that sections should be generated differently. Currently, we show them like this:

###############
# HTTP settings
###############

#### HTTP bind address
#
# The network interface used by the Graylog HTTP interface.
#
# This network interface must be accessible by all Graylog nodes in the cluster and by all clients
# using the Graylog web interface.
#
# If the port is omitted, Graylog will use port 9000 by default.
#
# Default: 127.0.0.1:9000
#http_bind_address = 127.0.0.1:9000
#http_bind_address = [2001:db8::1]:9000

where HTTP settings is the section header and HTTP bind address would be something like a heading for the parameter

Jul 01 '24 09:07 moesterheld