Documentation

Open gabays opened this issue 4 years ago • 1 comments

State clearly which step is compulsory is compulsory and which one is not at the beginning
State clearly what kind of data one will need: a. a reference and a test set? b. 1 file/per author? Or multiple files is OK if it starts with the same name?
Give a number to the three steps to be clear about the order (and the fact that there are three steps, the second being optional)
Give an example of debug_authors.csv, feature_list.json, feats_tests.csv langcert_revised.csv… so that we know what kind of data you expect (what is a column, what is a row…)
Move Alternatively, you can choose to do not specific split, but to use a leave-one-out approach. just under the title part so that it is clear that it is not a compulsory step
Drop a couple of lines on how to choose the --sampling options
Provide an example to play with, so that people ca check if everything works fine and observe the structure of the data

With that you should solve a lot of problems (and avoid a lot of emails like mine)

Dec 27 '21 09:12 gabays

Here is my script :

python main.py -s train/* -t chars -n 3 
mv feats_tests_n3_k_5000.csv train.csv
python main.py -s test/* -t chars -n 3 -f feature_list_chars3grams5000mf.json
mv feats_tests_n3_k_5000.csv test.csv
python train_svm.py train.csv --test_path test.csv --norms --final

Notice that, for the first main.py, I get "K Limit ignored because the size of the list is lower (3302 < 5000)".

Then I get this error in from svm.py l. 190 :

myclasses = pipe.classes_
        decs = pipe.decision_function(test)
        dists = {}
        for myclass in enumerate(myclasses):
            dists[myclass[1]] = [d[myclass[0]] for d in decs]

-->

dists[myclass[1]] = [d[myclass[0]] for d in decs]
IndexError: invalid index to scalar variable.

May 30 '22 20:05 EtienneFerrandi