JS16_ProjectA icon indicating copy to clipboard operation
JS16_ProjectA copied to clipboard

Classify importance of Data (Characters, Houses,...) by Centrality measure

Open sacdallago opened this issue 9 years ago • 54 comments

Hi there,

as mentioned around and in other teams, it would be extremely useful to have a measure of importance of a character, house, and everything that can be scraped from the wiki and has a unique link.

This can be done by creating a graph of in- and outgoing links via https://www.npmjs.com/package/ngraph.centrality and then storing in separate values the in-, out-, degree and betweenness centralities. Just be careful: the centralities are graph dependent. Thus you need to think wether you want to have a mega-graph or separate graphs for characters, houses,... Both are correct, it's a matter of choice. Also: think about the package handles multiple links from A -> B

Last but not least, some other teams have already been working on these things, so @Rostlab/js_cs_sose_2016_students please raise your voice. For now we have been wrongly advising this as page rank but that's just one flavor of centrality measure.

EDIT: Centrality is: https://en.wikipedia.org/wiki/Centrality

sacdallago avatar Mar 20 '16 14:03 sacdallago

Is this a call for help/collaboration with another team or it is an issue for Project_A?

boriside avatar Mar 20 '16 18:03 boriside

I did something . I took the data from Guy's wiki scraper and generated some jsons from it. I then matched this with the character list provided by and ordered by score.

The code can be found here . Its not great but it works. Maybe this helps someone...

kajo404 avatar Mar 20 '16 19:03 kajo404

It's a call for collab if someone has done something about it. But I think I described quite extensively how you could implement this, regardless of other's people work. So :D you know... :D

sacdallago avatar Mar 21 '16 09:03 sacdallago

Well, we have just gathered some images for the characters, we created paths for and which are most important/popular. Those images can be found here: https://github.com/Rostlab/JS16_ProjectC_Group10/tree/develop/mockup/img/persons

AlexBeischl avatar Mar 21 '16 16:03 AlexBeischl

@AlexBeischl that's still not really what I had in mind :) but again, a start: @kordianbruck @Adiolis can you assign someone except you two or @togiberlin to take care of this one? Assign as in, this issue is assigned.

sacdallago avatar Mar 21 '16 21:03 sacdallago

I asked already on the facebook group, who wants to do this. Till now, there are no volunteers. I am really not able to also do this. :angry: Should we really just assign some one, @sacdallago :smile:

Legenzoo avatar Mar 22 '16 17:03 Legenzoo

Is this even a part of Project A?

Legenzoo avatar Mar 22 '16 18:03 Legenzoo

Yea, I don't think this is in the project scope of A nor can be done till the 25. I vouch to move this issue to 'someday'

kordianbruck avatar Mar 22 '16 18:03 kordianbruck

I actually think it is within the scope of A, but since this requirement showed up late in the game we can defer it to the next version.

In the mean time, please integrate the data from here https://rostlab.org/~gyachdav/awoiaf/Data/pageRank/allchars.tgz

it's not perfect but at least gives a measure of which character is more referenced than others. I believe the range is [1-300](unknown to popular) . All you want to do here is, read the first item in the array from which you pick up the page_name to identify the character and assign that character the "score" value.

We really need this popularity measure in, to be able to sort the characters by the most important ones. otherwise we run into a case where we show bunch of negligible characters on the character portal.

gyachdav avatar Mar 22 '16 18:03 gyachdav

@gyachdav thank you. We will do that.

Legenzoo avatar Mar 22 '16 19:03 Legenzoo

I have implemented the updating of the pageRanks. With every refill, update etc. of the characters the pageRanks are added.

To just add the pageRanks and do nothing more run: npm run updatePageRanks --update=characters Then the pageRanks.json in dir data/ is added to the db. With the --file=dir/file.json option, one can change the json file to use.

To create this json from the dir that guy provided run: npm run updatePageRanks --dir=PATHTODIR With this the many _data files are transformed to just one json, that contains the names and scores of the characters. With the --to=dir/file.json option, one can define to which json file the result is saved.

So, @kordianbruck, please run npm run updatePageRanks --update=characters on the public server.

Legenzoo avatar Mar 23 '16 17:03 Legenzoo

but this is not done with the fancy alg I suggested above, right @Adiolis ? In that case please leave this open with the sometime milestone and no assignation :P

sacdallago avatar Mar 23 '16 20:03 sacdallago

@Adiolis can you please run quick stats on all characters and report min,max median and mean and stddev for "pageRank". It will help with interpreting level of importantce for a character. A histogram of pageRank would also be very helpful. Thanks!

gyachdav avatar Mar 25 '16 13:03 gyachdav

@gyachdav i used pagerank for PLOD min = 0, max = 300 i normalized the values and only around 300 characters have over 0.1 normalized rank

Hack3l avatar Mar 25 '16 13:03 Hack3l

thanks can you post here your normalized ranking?

gyachdav avatar Mar 25 '16 13:03 gyachdav

i'm basically interested in cases like this https://got-api.bruck.me/api/characters/Tormund where the pageRank is rather low but in the show still plays a prominent role to get tweets.

gyachdav avatar Mar 25 '16 13:03 gyachdav

Here my normalized pagerank and 60 is not that low normalized all above 0.1 are pretty much popular pagerank_normalized_json.txt

Hack3l avatar Mar 25 '16 13:03 Hack3l

Thanks @Hack3l!

@Adiolis see yourself as excused from this task :smile:

gyachdav avatar Mar 25 '16 14:03 gyachdav

As mentioned before the "pagerank" is far from perfect and so characters like http://awoiaf.westeros.org/index.php/Betharios_of_Braavos have mysteriously got top rank. can you please rescan the list and half the page rank points if the character does not have an image associated with it? that will make sure that all minor characters will be placed back in their proper place.

@adiolis would you do the honor?

@kordianbruck promised to improve this one day. When this happens we would get rid of this nasty hack.

gyachdav avatar Mar 26 '16 23:03 gyachdav

@gyachdav Sure, i can do that. Not sure, when, because i am running into two exams, but i will find some time in between my learning sessions :stuck_out_tongue: .

Legenzoo avatar Mar 26 '16 23:03 Legenzoo

thanks and good luck!

On Mar 26, 2016, at 7:13 PM, Michael Legenc [email protected] wrote:

@gyachdav Sure, i can do that. Not sure, when, because i am running into two exams, but i will find some time in between my learning sessions .

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub

gyachdav avatar Mar 26 '16 23:03 gyachdav

@gyachdav Okay, that was very easy and so i have done it right now ;)

@kordianbruck Please run npm run updatePageRanks --update=characters

Legenzoo avatar Mar 26 '16 23:03 Legenzoo

How do I access this updated data?

kajo404 avatar Mar 27 '16 00:03 kajo404

@Hack3l once the data is updated can you also update your normalized rankings and post it here?

gyachdav avatar Mar 27 '16 01:03 gyachdav

Yes, it whould be nice if u notify me once it is updated.

Hack3l avatar Mar 27 '16 08:03 Hack3l

@kajo404 there should be API calls for this

sacdallago avatar Mar 27 '16 09:03 sacdallago

I know I can get the page rank from A but our system works with the normalized page ranks from @Hack3l so an update of that would be nice

kajo404 avatar Mar 28 '16 18:03 kajo404

Data updated!

kordianbruck avatar Mar 29 '16 11:03 kordianbruck

Cool now I am only waiting for @Hack3l to post a like to the updated normalized ranks

kajo404 avatar Mar 29 '16 15:03 kajo404

Here the updated normalized ranks. pagerank_normalized_json.txt

Hack3l avatar Mar 30 '16 11:03 Hack3l