StreamingRec icon indicating copy to clipboard operation
StreamingRec copied to clipboard

Issue of pre-processing the outbrain dataset

Open nightlyjourney opened this issue 6 years ago • 1 comments

Hi,

I tried to process the Outbrain dataset with 'java -cp StreamingRec.jar org.streamingrec.data.loading.ReadOutbrain --input-folder=<folder_to_outbrain_files> --out-items=<path_to_item_output_file> --out-clicks=<path_to_clicks_output_file> --publisher=43' and got the Events.csv, the first 5 rows are as follows:

Publisher Category ItemID Cookie Timestamp keywords
1707327 18787 1465876800836 NaN NaN NaN
1513276 134357 1465876801429 NaN NaN NaN
1766890 38208 1465876801758 NaN NaN NaN
1513276 136834 1465876802078 NaN NaN NaN
830700 2429 1465876802136 NaN NaN NaN

I reviewed the code and found that the Publisher = document_id, Category = user_id, and ItemID = timestamp, which is really peculiar. Could you please explain why you are processing the dataset in such a way? Why the Timestamp is NaN in stead of the true timestamp? Thanks a lot.

nightlyjourney avatar Oct 10 '19 23:10 nightlyjourney

Hi Hao,

sorry for the late reply. It seems I seems the notification mail I got from github went the the spam folder.

I had a look at the code but I cannot find any line where these assingments (e.g., Category = user_id) happen. If you are still interested in this topic, please send me a line number so that I can investigate further.

Best, Michael

mjugo avatar Jun 26 '20 18:06 mjugo