TopicModelComparison WackyPedia corpus processing

The code for ExtractUCIStats.scala seems to process the tab delimited combined corpus and not the external WackyPedia corpus. Is there a newer version of ExtractUCIStats.scala that uses WackyPedia?

Many thanks.

Apr 18 '13 09:04 aneesha

Another method I implemented to compute UCI co-occurrence stats is to

Download a Wikipedia dump, download the latest pages-articles.xml.bz2 (9GB compressed, 42GB uncompressed)
Index it into Solr, I'll be happy to give more detail about my config files, if needed
Query Solr with pairs of words to get the count of document

import requests, re

remote_occurrences = {}

def solr_occurrence(self, w1, w2=None):
    # Caching to prevent querying several times the same pair
    key = (w1, w2) if w2 else w1
    if key in remote_occurrences:
        return remote_occurrences[key]

    solr_url = "http://localhost:8080/wikipedia/select"

    params = {'q': '+' + w1, 'rows': 0}
    if w2:
        params['q'] += ' +' + w2

    response = requests.get(solr_url, params=params).text
    docs_founds = re.findall('numFound="([0-9]+)"', response)
    count = int(docs_founds[0])
    remote_occurrences[key] = count
    return count


words = ['france', 'cheese', 'baguette']
for w1 in words:
    for w2 in words:
        print w1, w2, solr_occurrence(w1, w2)

outputs

france france 345552
france cheese 15048
france baguette 412
cheese france 345552
cheese cheese 15048
cheese baguette 412
baguette france 345552
baguette cheese 15048
baguette baguette 412

May 10 '13 15:05 qpleple

thanks @qpleple

Nov 07 '13 13:11 aneesha

@qpleple it would be great if you could share your solr config. thanks

Nov 08 '13 01:11 aneesha

@qpleple the method you suggest does not use a sliding window - is my interpretation correct

Nov 09 '13 11:11 aneesha

@aneesha I don't have access to the server I was running Solr on anymore, I can't help you with the config files. But I remember the guide http://wiki.apache.org/solr/DataImportHandler#Example:_Indexing_wikipedia was pretty clear.

What do you mean by sliding window?

Nov 09 '13 11:11 qpleple

@qpleple I have got Solr working and indexed the wikipedia corpus. I know need to integrate this with https://github.com/fozziethebeat/TopicModelComparison/blob/master/src/main/scala/edu/ucla/sspace/ExtractUCIStats.scala but in the code (see line 35) and the paper a sliding window of 20 tokens is used.

Nov 09 '13 11:11 aneesha

@aneesha Indeed, my method doesn't use a sliding window. 2 words co-occur if they are present in the same document (even far away one another).

Nov 09 '13 11:11 qpleple