Ziyue Huang

Results 20 comments of Ziyue Huang

This is not a killing problem, as the backend can switch to faster kernel if the attention pattern has optimized handcrafted kernel. Block-sparse attention seems not very appealing to me,...

Waiting for https://github.com/apache/incubator-mxnet/pull/19387 to be merged.

benchmark script ``` import numpy as np from numpy.testing import assert_allclose import mxnet as mx from gluonnlp.attention_cell import masked_softmax, MultiHeadAttentionCell, MultiHeadSlidingWindowAttentionCell import time def test_multi_head_sliding_window_dot_attention_cell(): def gen_sliding_window_mask_full(batch_size, seq_length, w, symmetric,...

Just found that the API behavior in BytePS master branch changes recently...... Not sure if that is intended or bugs. Tracked here (https://github.com/bytedance/byteps/issues/292).

Sorry for the late reply. @szha core dump due to undefined symbols is fixed after 0820 wheel, and I didn't record the undefined symbols. For the segfault, below is the...

请问一下,现在这个《1.2 Structured Streaming 之 Output Modes 解析》在哪里?

@ymjiang Hi, did you test the accuracy of the BERT model trained by https://github.com/byteps/examples/blob/master/mxnet/bert-large ? It seems that in this script, the NSP loss is normalized (by batch_size) on each...

Why not multiply `num_workers` back to the gradients at the end of `_allreduce_grads`? In order to let this API compute the sum, which is consistent to the previous API (and...

Let me summarize the API behavior change after/before this PR, and feel free to correct me if I make some mistakes :) - `allreduce_grads` computes the average instead of the...

https://github.com/bytedance/byteps/blob/b8948f0927/byteps/mxnet/__init__.py#L201 only takes effect in `step` or `update`. `bps.trainer.allreduce_grads` will compute the sum. `allreduce_grads` then `update` is allowed by MXNet API and heavily used in gluon-nlp.