A sudden increase of epollWait() CPU utilization after a particular commit.
Greetings! I am a master's student conducting research on performance diagnosis. My current focus is on studying the evolution of distributed systems and the occurrence of performance regression during their development. As part of my research, I am examining Apache Ignite as a case study.
To conduct my study, I used the YCSB benchmark (https://github.com/brianfrankcooper/YCSB/tree/master/ignite), configuring it to perform 1,000,000 update operations and initialize with 100,000 records. I conducted the testing with a single thread and 3 nodes on a single machine using 3 different ports. Upon analyzing the results, I observed that the average update latency increased by approximately 10% from version 2.7.6 to version 2.14.0 (compiled by jdk8 locally). I am currently attempting to understand the cause behind this increase. For your reference, I have included the detailed configuration of one of the nodes below.
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2018 YCSB contributors. All rights reserved.
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!--
Ignite Spring configuration file to startup Ignite cache.
This file demonstrates how to configure cache using Spring. Provided cache
will be created on node startup.
Use this configuration file when running HTTP REST examples (see 'examples/rest' folder).
When starting a standalone node, you need to execute the following command:
{IGNITE_HOME}/bin/ignite.{bat|sh} examples/config/example-cache.xml
When starting Ignite from Java IDE, pass path to this file to Ignition:
Ignition.start("examples/config/example-cache.xml");
-->
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="
http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans.xsd">
<bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
<property name="dataStorageConfiguration">
<bean class="org.apache.ignite.configuration.DataStorageConfiguration">
<property name="walMode" value="LOG_ONLY"/>
<property name="storagePath" value="/data/dbignite2.10.0-SNAPSHOT"/>
<property name="walPath" value="/data/walignite2.10.0-SNAPSHOT"/>
<property name="walArchivePath" value="/data/walarchignite2.10.0-SNAPSHOT"/>
<property name="walHistorySize" value="1"/>
<property name="metricsEnabled" value="true"/>
<property name="defaultDataRegionConfiguration">
<bean class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="name" value="default_data_region"/>
<property name="persistenceEnabled" value="false"/>
<!-- Setting the max size of the default region to 10GB. -->
<property name="maxSize" value="#{10L * 1024 * 1024 * 1024}"/>
<!-- Setting the initial size of the default region to 10GB. -->
<property name="initialSize" value="#{10L * 1024 * 1024 * 1024}"/>
<property name="checkpointPageBufferSize" value="#{1L * 1024 * 1024 * 1024}"/>
<property name="metricsEnabled" value="true"/>
</bean>
</property>
</bean>
</property>
<property name="cacheConfiguration">
<list>
<bean class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="usertable"/>
<property name="atomicityMode" value="ATOMIC"/>
<property name="cacheMode" value="PARTITIONED"/>
<property name="backups" value="1"/>
<property name="writeSynchronizationMode" value="FULL_SYNC"/>
</bean>
</list>
</property>
<!-- Explicitly configure TCP discovery SPI to provide list of initial nodes. -->
<property name="discoverySpi">
<bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
<property name="localPort" value="47500" />
<property name="ipFinder">
<bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
<property name="addresses">
<list>
<!--The list of hosts includes client host. -->
<!--<value><hostname_or_IP>:47500..47509</value>-->
<!--<value><hostname_or_IP>:47500..47509</value>-->
<value>10.1.0.16:47500</value>
<value>10.1.0.16:47501</value>
<value>10.1.0.16:47502</value>
</list>
</property>
</bean>
</property>
</bean>
</property>
</bean>
</beans>
Next, I utilized JFR to profile the runtime and discovered a significant increase in the profiling samples of sun.nio.ch.EPollArrayWrapper.epollWait(). Specifically, the overall method profiling samples increased by approximately 2000 (equivalent to around 10% of v2.7.6's overall method profiling samples, which matches the increase ratio of latency), while the increase in samples of epollWait() was around 5500, and java.net.PlainSocketImpl.socketAccept() decreased by approximately 3900 samples. To gather this data, I conducted ten rounds of testing using the benchmark mentioned above on an Ubuntu 18.04.1 LTS system with x86_64 arch, 16 cores, and 132GB memory.
Based on my findings, I concluded that epollWait() is the primary cause of the latency increase. I have since attempted to narrow down the issue and locate the particular commit(s) responsible. I discovered that commit 1094fff had an increase in samples of approximately 3400 compared to its parent. This increase accounts for around 61.82% of the total increase in epollWait() samples from v2.7.6 to v2.14.0. However, the latency did not significantly increase from its parent, as socketAccept() also decreased by approximately 3400 samples.
At this point, I would like to understand why this particular commit caused such a significant increase in epollWait() samples. The commit appears to only disable JMX monitoring, but since I lack context, I am seeking suggestions on where to focus my further study. Can you please provide any advice or recommendations?