Issue of pre-processing the outbrain dataset

Open nightlyjourney opened this issue 6 years ago • 1 comments

Hi,

I tried to process the Outbrain dataset with 'java -cp StreamingRec.jar org.streamingrec.data.loading.ReadOutbrain --input-folder=<folder_to_outbrain_files> --out-items=<path_to_item_output_file> --out-clicks=<path_to_clicks_output_file> --publisher=43' and got the Events.csv, the first 5 rows are as follows:

Publisher	Category	ItemID	Cookie	Timestamp	keywords
1707327	18787	1465876800836	NaN	NaN	NaN
1513276	134357	1465876801429	NaN	NaN	NaN
1766890	38208	1465876801758	NaN	NaN	NaN
1513276	136834	1465876802078	NaN	NaN	NaN
830700	2429	1465876802136	NaN	NaN	NaN

I reviewed the code and found that the Publisher = document_id, Category = user_id, and ItemID = timestamp, which is really peculiar. Could you please explain why you are processing the dataset in such a way? Why the Timestamp is NaN in stead of the true timestamp? Thanks a lot.

Oct 10 '19 23:10 nightlyjourney

Hi Hao,

sorry for the late reply. It seems I seems the notification mail I got from github went the the spam folder.

I had a look at the code but I cannot find any line where these assingments (e.g., Category = user_id) happen. If you are still interested in this topic, please send me a line number so that I can investigate further.

Best, Michael

Jun 26 '20 18:06 mjugo