question on synthetic data generation to training data

Open shufan1 opened this issue 1 year ago • 1 comments

After a synthetic proof is generated, how is it used as training data? For example, if a synthetic proof has N auxiliary constructions between the premise statement s_i and the conclusion statement s_N+1, would you make multiple training data entries from this proof by taking every intermediate statement from doing one auxiliary point construction at a time? i.e. you would generate N-1 data entry from this single N-construction proof:

If each single data entry follows this format

<s_i>< s_N+1><auxiliary construction _i>
<s_i+1>< s_N+1><auxiliary construction _i+1>
<s_i+2>< s_N+1><auxiliary construction _i+2>
...
<s_N>< s_N+1><auxiliary construction _N> where s_i+1 is the statements after doing auxiliary construction _i on <s_i>

I believe the paper says it made 100M proofs, of which 9M have at least one auxiliary construction. It later also says the fine-tuning used 9M data. How are those proofs with multiple constructions handled? I assume the transformer model only predicts one auxiliary construction at a time. I might have misunderstood this part. Please let me know if I can clarify my questions. Thanks so much for your help.

Mar 08 '24 20:03 shufan1

How do you generate the random proofs in the first place? I mean concretely not conceptually. Is there a code snippet?

Mar 21 '24 16:03 ParthaEth