opensearch-java [BUG] ApacheHttpClient5Transport hangs when content is too large

What is the bug?

When using java client with ApacheHttpClient5Transport, large requests that exceeds AWS network limit (e.g. 10MB on m6g.large.search) will cause the request to hang and permanently blocks the thread. In actuality, AWS responds with 400 (or 413 if the content-length header is present in the request, which is missing in this case, but that's another issue).

This bug only affects ApacheHttpClient5Transport, but works correctly on the legacy RestClientTransport (i.e. throws an exception). It also works with curl (i.e. returns a 400 or 413 response).

How can one reproduce the bug?

Steps to reproduce the behavior.

Create an AWS OpenSearch Service domain, with data nodes of an instance-type with 10MB network quota, e.g. m6g.large.search
Make a bulk request >10MB, with the following junit test:

public class BulkLimitTest {
    private static final String OPENSEARCH_URL = "https:/my-domain.ap-southeast-2.es.amazonaws.com";
    private static final String INDEX = "sheep-index";
    private OpenSearchClient client;

    @Before
    public void setup() throws IOException, URISyntaxException {
        ConnectionConfig connectionConfig = ConnectionConfig.custom()
                .setConnectTimeout(5L, TimeUnit.SECONDS)
                .setSocketTimeout(5, TimeUnit.SECONDS)
                .build();
        PoolingAsyncClientConnectionManager connectionManager = PoolingAsyncClientConnectionManagerBuilder
                .create()
                .setDefaultConnectionConfig(connectionConfig)
                .build();
        ApacheHttpClient5Transport transport = ApacheHttpClient5TransportBuilder.builder(HttpHost.create(OPENSEARCH_URL))
                .setHttpClientConfigCallback(builder -> builder.setConnectionManager(connectionManager))
                .build();

        client = new OpenSearchClient(transport);
        client.indices().delete(d -> d.index(INDEX).ignoreUnavailable(true));
        client.indices().create(r -> r.index(INDEX));
    }

    @Test
    public void testBulkLimit() throws IOException {
//        final int SIZE_MB = 8; // ---> this works
        final int SIZE_MB = 10; // ---> this hangs
        BulkRequest request = generateRequest(SIZE_MB * 1024 * 1024);
        client.bulk(request);
    }

    private BulkRequest generateRequest(int size) {
        BulkRequest.Builder builder = new BulkRequest.Builder();
        final String SAMPLE = """
            Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed efficitur in metus quis lacinia.
            Nullam vel blandit lacus. Nam ornare purus nibh, et varius nunc finibus non. 
        """;

        int unitSize = SAMPLE.length();
        for (int i=0, s=0; s < size; i++, s+=unitSize) {
            String id = "sheep-" + i;
            builder.operations(o -> o.index(d -> d
                    .index(INDEX)
                    .id(id)
                    .document(Map.of("content", SAMPLE))));
        }
        return builder.build();
    }
}

What is the expected behavior?

Throws OpenSearchException or ResponseException with 400 or 413 error.

What is the actual behavior?

The thread hangs forever. No exceptions, no timeouts.

What is your host/environment?

Tested on:
- Mac OS (M1)
- EC2 Amazon Linux
AWS OpenSearch Service
- engine version: 2.11
- data-nodes: 3 x m6g.large.search
opensearch-java client version: 2.10.1

Do you have any screenshots?

No. See the junit code above.

Do you have any additional context?

Only affects ApacheHttpClient5Transport. Works correctly with RestClientTransport, curl, postman, etc, i.e. produces 400 or 413 errors.

Jul 19 '24 06:07 hendryluk

Looks like a bug :( Do you think you can try to turn your repro into a (failing) unit test with a server mock response?

Jul 19 '24 13:07 dblock

The issue seems to be really related to quota setting (it is not reproducible locally at least):

Create an AWS OpenSearch Service domain, with data nodes of an instance-type with 10MB network quota, e.g. m6g.large.search

@dblock I may need your help here, wondering if we could have an AWS OpenSearch Service for a few hours to troubleshot the issue?

Jul 19 '24 13:07 reta

@reta I can help offline, yes.

Jul 19 '24 16:07 dblock

Do you think you can try to turn your repro into a (failing) unit test with a server mock response?

@dblock Unfortunately it's not easily replicable with a mock server that returns a specific error response. The bug only occurs with the specific low level network behavior of AWS when we reach the network quota, in which case we'll get a specific sequence of events that's different from normal, e.g. it closes the request stream (preventing it from sending more data) before writing the status to the response.

So I think the best way would be to test this bug is with actual AWS.

Jul 21 '24 23:07 hendryluk

Force-closing the request stream in an intercepted call is usually a solution, but I am talking without trying.

Jul 22 '24 14:07 dblock

@hendryluk @dblock interesting, the issue is related to HTTP protocol chosen:

RestClientTransport for 2.x is always using HTTP/1.1, (HC4), but
ApacheHttpClient5Transport talks HTTP/2.0 and there is no response callback (looking why now)

Setting HttpVersionPolicy for the client fixes the issue, but I am not sure where exactly the problem is: OS 2.x does not support HTTP/2.0, why protocol is negotiated this way? gateway or LB in front?

ApacheHttpClient5TransportBuilder.builder(HttpHost.create(OPENSEARCH_URL))
     .setHttpClientConfigCallback(builder -> builder.setVersionPolicy(HttpVersionPolicy.FORCE_HTTP_1))

Jul 29 '24 16:07 reta

A bit more on that: the behaviour vary if the payload is chunked (buffer size) or not (sent at once), 400 vs 413 is returned.

Jul 30 '24 21:07 reta

Thanks @reta for looking into the issue and for the workaround - we'll give that a try.

Aug 06 '24 03:08 hendryluk