WackyPedia corpus processing
The code for ExtractUCIStats.scala seems to process the tab delimited combined corpus and not the external WackyPedia corpus. Is there a newer version of ExtractUCIStats.scala that uses WackyPedia?
Many thanks.
Another method I implemented to compute UCI co-occurrence stats is to
- Download a Wikipedia dump, download the latest pages-articles.xml.bz2 (9GB compressed, 42GB uncompressed)
- Index it into Solr, I'll be happy to give more detail about my config files, if needed
- Query Solr with pairs of words to get the count of document
import requests, re
remote_occurrences = {}
def solr_occurrence(self, w1, w2=None):
# Caching to prevent querying several times the same pair
key = (w1, w2) if w2 else w1
if key in remote_occurrences:
return remote_occurrences[key]
solr_url = "http://localhost:8080/wikipedia/select"
params = {'q': '+' + w1, 'rows': 0}
if w2:
params['q'] += ' +' + w2
response = requests.get(solr_url, params=params).text
docs_founds = re.findall('numFound="([0-9]+)"', response)
count = int(docs_founds[0])
remote_occurrences[key] = count
return count
words = ['france', 'cheese', 'baguette']
for w1 in words:
for w2 in words:
print w1, w2, solr_occurrence(w1, w2)
outputs
france france 345552
france cheese 15048
france baguette 412
cheese france 345552
cheese cheese 15048
cheese baguette 412
baguette france 345552
baguette cheese 15048
baguette baguette 412
thanks @qpleple
@qpleple it would be great if you could share your solr config. thanks
@qpleple the method you suggest does not use a sliding window - is my interpretation correct
@aneesha I don't have access to the server I was running Solr on anymore, I can't help you with the config files. But I remember the guide http://wiki.apache.org/solr/DataImportHandler#Example:_Indexing_wikipedia was pretty clear.
What do you mean by sliding window?
@qpleple I have got Solr working and indexed the wikipedia corpus. I know need to integrate this with https://github.com/fozziethebeat/TopicModelComparison/blob/master/src/main/scala/edu/ucla/sspace/ExtractUCIStats.scala but in the code (see line 35) and the paper a sliding window of 20 tokens is used.
@aneesha Indeed, my method doesn't use a sliding window. 2 words co-occur if they are present in the same document (even far away one another).