deep-code-search about converts test to num

您的keras版本config中词汇表的大小设置的是10000，因为在您给的pkl文件中dict['<s>']=0，dict['</s>']=0，dict['UNK']=1 其实<s> </s>是相同的，所以len=10000+1 keras版本 convert函数中 return [vocab.get(w, 0) for w in words]在将本文转换为数字的时候，您将unk默认值设置为0，但是pkl中unk是1阿。而且您的 return pad_sequences(data, maxlen=len, padding='post', truncating='post', value=0)中填充的是0，在pkl中是和是为0的，与pad的意义是否相符阿？，这里是不是存在问题阿。我刚接触nlp不久，不知道我的理解是不是对的。

在我的立即填充应该是pad标识符，就是您pytorch版本中的数据pad=0，<s>=1,</s>=2,<unk>=3,这里的pad就是0。与json数据相符。

关于keras版本的数据我在做将文本转换为数字的数据时（因为文本—>数字映射——>pkl） dict['<s>']=0，dict['</s>']=0，dict['UNK']=1，我这么改下面的函数合理么？ (1)return [vocab.get(w, 1) for w in words]，将默认从vocab.get(w, 0)改成vocab.get(w, 1)
(2)return pad_sequences(data, maxlen=len, padding='post', truncating='post', value=0),这里认为pad填充的就是<s>,</s>?

May 20 '20 09:05 hellokitty753159

谢谢指正。这里确实是有问题，在pytorch版本已经改过来了。原先pad和<s>, </s>共用0是考虑到pad已经由seq_len指出来了，<s>和</s>一般不会连续出现，用同一个符号应该影响不大。模型也没有decoder。这两个字符不起什么作用。

May 21 '20 02:05 guxd

谢谢指正。这里确实是有问题，在pytorch版本已经改过来了。原先pad和, 共用0是考虑到pad已经由seq_len指出来了，和一般不会连续出现，用同一个符号应该影响不大。模型也没有decoder。这两个字符不起什么作用。

那convert函数中 return [vocab.get(w, 0) for w in words] 这里的函数应该是将unk置为0吧。与pad和冲突，是否应该改成vocab.get(w, 1) (您pkl提供的unk是1)

我已经看到您的pytorch版本对这些问题修改过啦。但是我现在的任务都是在keras改的，所以还是打算在keras验证效果，再实现pytorch版本，希望您指导一下上述问题。如果方便提供一下keras版本和pytorch版本将desc.txt文本数据转换为词汇表映射的代码，方便我检测我的处理方式是否正确。感觉细节影响还是不小的。感谢。

May 21 '20 15:05 hellokitty753159

可以的，改成vocab.get(w,1)试试效果有没有提升。词汇表构建的代码如下。

from collections import Counter
def create_dictionary(lang_file,vocabsize):
    input_file=open(lang_file, 'r', encoding='latin1')
    print("Counting words in %s" % lang_file)
    counter = Counter()
    for line in input_file:
        words = line.strip().split(' ')
        counter.update(words)
    print("%d unique words with a total of %d words."
                   % (len(counter),  sum(counter.values())))
    input_file.close()
    vocab = {'<pad>':0, '<s>': 1, '</s>': 2, '<unk>': 3}
    num_reserved = len(vocab)
    vocab_count = counter.most_common(vocabsize - num_reserved)
    print("Creating dictionary of %s most common words, covering "
                    "%2.1f%% of the text."
                    % (vocabsize,100.0 * sum([count for word, count in vocab_count]) /
                       sum(counter.values())))
    for i, (word, count) in enumerate(vocab_count):
        if word=='':
            continue
        vocab[word] = i + num_reserved
    return vocab

May 23 '20 14:05 guxd

可以的，改成vocab.get(w,1)试试效果有没有提升。词汇表构建的代码如下。

from collections import Counter
def create_dictionary(lang_file,vocabsize):
    input_file=open(lang_file, 'r', encoding='latin1')
    print("Counting words in %s" % lang_file)
    counter = Counter()
    for line in input_file:
        words = line.strip().split(' ')
        counter.update(words)
    print("%d unique words with a total of %d words."
                   % (len(counter),  sum(counter.values())))
    input_file.close()
    vocab = {'<pad>':0, '<s>': 1, '</s>': 2, '<unk>': 3}
    num_reserved = len(vocab)
    vocab_count = counter.most_common(vocabsize - num_reserved)
    print("Creating dictionary of %s most common words, covering "
                    "%2.1f%% of the text."
                    % (vocabsize,100.0 * sum([count for word, count in vocab_count]) /
                       sum(counter.values())))
    for i, (word, count) in enumerate(vocab_count):
        if word=='':
            continue
        vocab[word] = i + num_reserved
    return vocab

我发现一个问题，当取10000 common words作为词汇表的适合，apiseq序列的词汇量特别大，当 def sent2ids(sents): if type(sents) == str: words = sents.strip().lower().split(' ') return [vocab.get(w, word2id['']) for w in words] 执行这个将文本转换为数字的时候(因为.h5数据是将数字存储的)，我发现10000之外的词，也就是unk=3的值特别多，而且api的Creating dictionary of 10000 most common words, covering 76.2% of the text. 其他都是95以上。关于这个问题您有什么建议么？因为apiseq是有序的，在一个方法中有的api会重复出现多次，例如if else这种判断语句，当然还有其他情况。关于这个问题您是怎么处理的。

May 28 '20 05:05 hellokitty753159

这种情况可以考虑取前20000的API. 让覆盖率达到90%以上应该就可以了。判断语句省略了。预处理中将连续出现的重复API消重了。你也可以把控制语句作为特殊符号加进去。

May 29 '20 04:05 guxd