TopicModelComparison icon indicating copy to clipboard operation
TopicModelComparison copied to clipboard

WackyPedia corpus processing

Open aneesha opened this issue 12 years ago • 7 comments

The code for ExtractUCIStats.scala seems to process the tab delimited combined corpus and not the external WackyPedia corpus. Is there a newer version of ExtractUCIStats.scala that uses WackyPedia?

Many thanks.

aneesha avatar Apr 18 '13 09:04 aneesha

Another method I implemented to compute UCI co-occurrence stats is to

  1. Download a Wikipedia dump, download the latest pages-articles.xml.bz2 (9GB compressed, 42GB uncompressed)
  2. Index it into Solr, I'll be happy to give more detail about my config files, if needed
  3. Query Solr with pairs of words to get the count of document
import requests, re

remote_occurrences = {}

def solr_occurrence(self, w1, w2=None):
    # Caching to prevent querying several times the same pair
    key = (w1, w2) if w2 else w1
    if key in remote_occurrences:
        return remote_occurrences[key]

    solr_url = "http://localhost:8080/wikipedia/select"

    params = {'q': '+' + w1, 'rows': 0}
    if w2:
        params['q'] += ' +' + w2

    response = requests.get(solr_url, params=params).text
    docs_founds = re.findall('numFound="([0-9]+)"', response)
    count = int(docs_founds[0])
    remote_occurrences[key] = count
    return count


words = ['france', 'cheese', 'baguette']
for w1 in words:
    for w2 in words:
        print w1, w2, solr_occurrence(w1, w2)

outputs

france france 345552
france cheese 15048
france baguette 412
cheese france 345552
cheese cheese 15048
cheese baguette 412
baguette france 345552
baguette cheese 15048
baguette baguette 412

qpleple avatar May 10 '13 15:05 qpleple

thanks @qpleple

aneesha avatar Nov 07 '13 13:11 aneesha

@qpleple it would be great if you could share your solr config. thanks

aneesha avatar Nov 08 '13 01:11 aneesha

@qpleple the method you suggest does not use a sliding window - is my interpretation correct

aneesha avatar Nov 09 '13 11:11 aneesha

@aneesha I don't have access to the server I was running Solr on anymore, I can't help you with the config files. But I remember the guide http://wiki.apache.org/solr/DataImportHandler#Example:_Indexing_wikipedia was pretty clear.

What do you mean by sliding window?

qpleple avatar Nov 09 '13 11:11 qpleple

@qpleple I have got Solr working and indexed the wikipedia corpus. I know need to integrate this with https://github.com/fozziethebeat/TopicModelComparison/blob/master/src/main/scala/edu/ucla/sspace/ExtractUCIStats.scala but in the code (see line 35) and the paper a sliding window of 20 tokens is used.

aneesha avatar Nov 09 '13 11:11 aneesha

@aneesha Indeed, my method doesn't use a sliding window. 2 words co-occur if they are present in the same document (even far away one another).

qpleple avatar Nov 09 '13 11:11 qpleple