azure-sdk-for-java [QUERY] Download speed using datalake client

Query/Question A clear and concise ask/query. Is there a upper limit when using datalake client to download a large blob? When I use AzCopy to download a blob, the speed can reach 1Gb/s or even 5GB/s in the same VM. But when I use datalake FileClient to download a file, it can only reach 60+MB/s even I changed the blocksize to 100MB other than using the default 4MB. Could you please give more insights or suggestions on how to speed up the process to get a blob ?

BTW, I am using a sufficient resources VM in the same region with the Storage Account.

Here's part of my code. Just read bytes from InputStream.

    DataLakeFileSystemClient fileSystemClient = serviceClient.getFileSystemClient("download");

    DataLakeDirectoryClient directoryClient =
    fileSystemClient.getDirectoryClient("test");

    DataLakeFileClient fileClient = directoryClient.getFileClient("2gb.test");



    File file = new File("K:\\10gbdown.test"+UUID.randomUUID());
    
    try {
        
        DataLakeFileInputStreamOptions options = new DataLakeFileInputStreamOptions();
        options.setBlockSize(100 * 1024 * 1024);
        InputStream inputStream = new BufferedInputStream(fileClient.openInputStream(options).getInputStream());

        DataLakeRequestConditions conditions = new DataLakeRequestConditions();
        
        
        long start = System.currentTimeMillis();
        inputStream.readAllBytes();
        // OutputStream targetStream = new FileOutputStream(file);
    
        // fileClient.read(targetStream);
    
        // targetStream.close();

        System.out.println(System.currentTimeMillis() - start);
    } catch (FileNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

Why is this not a Bug or a feature Request? A clear explanation of why is this not a bug or a feature request?

Setup (please complete the following information if applicable):

OS: [e.g. iOS] Windows
IDE: [e.g. IntelliJ] VS code
Library/Libraries: [e.g. com.azure:azure-core:1.16.0 (groupId:artifactId:version)] com.azure. azure-storage-file-datalake latest

Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

[ ] Query Added
[ ] Setup information Added

Aug 08 '22 02:08 chuckz1321

Be more specific, customer would like to read stream into their client. But DataLakeFileInputStreamOptions can only configure for limited parameters. It does not have some parallelism parameters like blob client. Or do we need to modify some configuration is Netty Http client like buffer size or something?

Aug 08 '22 09:08 chuckz1321

Hi, @chuckz1321. If the main concern is performance, I would recommend using the downloadToFile method and then reading the file from disk into memory. It's true that disk->memory is an extra step, but it'll be overshadowed by the speedup gained from the downloadToFile method. The inputStream you're using reads the entire blob sequentially whereas downloadToFile will download chunks in parallel. The latter is more akin to the strategy azcopy uses. Can you please give this a try and let us know if you're still having issues?

Aug 08 '22 17:08 rickle-msft

    BlobContainerClient blobContainerClient = blobServiceClient.getBlobContainerClient("download");

    BlobClient blobClient = blobContainerClient.getBlobClient("test/2gb.test");
    String filename = "K:\\" + UUID.randomUUID() + "2gb.test";
    BlobDownloadToFileOptions options = new BlobDownloadToFileOptions(filename);
    ParallelTransferOptions parallelTransferOptions = new ParallelTransferOptions();
    parallelTransferOptions.setMaxConcurrency(100);
    parallelTransferOptions.setBlockSizeLong((long)100 * 1024 * 1024);
    parallelTransferOptions.setMaxSingleUploadSizeLong((long)100 * 1024 * 1024);
    options.setParallelTransferOptions(parallelTransferOptions);
    long start = System.currentTimeMillis();
    blobClient.downloadToFileWithResponse(options, null, Context.NONE);
    System.out.println(System.currentTimeMillis() - start + ", " + filename);

The download speech reached 200MB/s when I use code above. But it still much slower than AzCopy. Could you please give more advice?

Aug 09 '22 02:08 chuckz1321

Realistically, you won't see the same performance using the sdk that you will using azcopy. Azcopy being its own process is designed specifically to be hyper performant because it can make assumptions and expectations that an sdk can't, since an sdk has to integrate into an app environment. The downloadToFile option is the most performant option in the sdk, and from this point, the best thing to do is run some experiments to vary the block size and concurrency levels and see what values work the best in your environment.

Aug 09 '22 21:08 rickle-msft

@rickle-msft Could you please be more specific on "it can make assumptions and expectations" if possible? What assumptions and expectations does AzCopy do?

Aug 10 '22 02:08 chuckz1321

I don't work on azcopy, so I can't get too specific, but I know it can be greedier in the way it uses memory and concurrency resources. Because it is a dedicated app, it can assume it is free to more liberally use system resources or have a slightly larger package size. Conversely, an sdk needs to have a minimal footprint on the size of the final application, and it must be more moderate in its resource usage because of other workloads an app may be doing

Aug 10 '22 17:08 rickle-msft

Hi @chuckz1321

Following up on this thread. Has your issue been resolved? If so, we can go ahead and close this thread. If you're still running into some issues or have additional questions, can you please let us know what blockers you are running into?

Thanks!

Sep 20 '22 21:09 ibrahimrabab

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

Jan 20 '23 20:01 ghost