webmagic icon indicating copy to clipboard operation
webmagic copied to clipboard

RedisScheduler这个类中应该在读取了HASH表后应将数据删除,否则会占用大量的内存

Open janson1983 opened this issue 7 years ago • 1 comments

@Override
public synchronized Request poll(Task task) {
    Jedis jedis = pool.getResource();
    try {
        String url = jedis.lpop(getQueueKey(task));
        if (url == null) {
            return null;
        }
        String key = ITEM_PREFIX + task.getUUID();
        String field = DigestUtils.shaHex(url);
        byte[] bytes = jedis.hget(key.getBytes(), field.getBytes());
        if (bytes != null) {
            Request o = JSON.parseObject(new String(bytes), Request.class);
            jedis.hdel(key.getBytes(), field.getBytes());
            return o;
        }
        Request request = new Request(url);
        return request;
    } finally {
        pool.returnResource(jedis);
    }
}

应该在这个获取到Request后清除redis中的数据,否则就是垃圾数据了。

@Override
public boolean isDuplicate(Request request, Task task) {
    Jedis jedis = pool.getResource();
    try {
        return jedis.sadd(getSetKey(task), DigestUtils.md5Hex(request.getUrl())) == 0;
    } finally {
        pool.returnResource(jedis);
    }

}

这里也应该将request.getUrl()行进md5加密转换,这样节省内存,应该这块是判断是否抓重,数据会一直存在的,所以这块内存能小几字节加起来都会很多。

janson1983 avatar Jun 21 '18 10:06 janson1983

对啊,我也觉得转换为md5是比较节省资源的做法,我也给作者提过,不过没有得到回复。

yuweiming2016 avatar Mar 10 '21 02:03 yuweiming2016