Issue of pre-processing the outbrain dataset
Hi,
I tried to process the Outbrain dataset with 'java -cp StreamingRec.jar org.streamingrec.data.loading.ReadOutbrain --input-folder=<folder_to_outbrain_files> --out-items=<path_to_item_output_file> --out-clicks=<path_to_clicks_output_file> --publisher=43' and got the Events.csv, the first 5 rows are as follows:
| Publisher | Category | ItemID | Cookie | Timestamp | keywords |
|---|---|---|---|---|---|
| 1707327 | 18787 | 1465876800836 | NaN | NaN | NaN |
| 1513276 | 134357 | 1465876801429 | NaN | NaN | NaN |
| 1766890 | 38208 | 1465876801758 | NaN | NaN | NaN |
| 1513276 | 136834 | 1465876802078 | NaN | NaN | NaN |
| 830700 | 2429 | 1465876802136 | NaN | NaN | NaN |
I reviewed the code and found that the Publisher = document_id, Category = user_id, and ItemID = timestamp, which is really peculiar. Could you please explain why you are processing the dataset in such a way? Why the Timestamp is NaN in stead of the true timestamp? Thanks a lot.
Hi Hao,
sorry for the late reply. It seems I seems the notification mail I got from github went the the spam folder.
I had a look at the code but I cannot find any line where these assingments (e.g., Category = user_id) happen. If you are still interested in this topic, please send me a line number so that I can investigate further.
Best, Michael