alphageometry icon indicating copy to clipboard operation
alphageometry copied to clipboard

question on synthetic data generation to training data

Open shufan1 opened this issue 1 year ago • 1 comments

After a synthetic proof is generated, how is it used as training data? For example, if a synthetic proof has N auxiliary constructions between the premise statement si and the conclusion statement sN+1, would you make multiple training data entries from this proof by taking every intermediate statement from doing one auxiliary point construction at a time? i.e. you would generate N-1 data entry from this single N-construction proof:

If each single data entry follows this format

  • <si>< sN+1><auxiliary construction i>
  • <si+1>< sN+1><auxiliary construction i+1>
  • <si+2>< sN+1><auxiliary construction i+2>
  • ...
  • <sN>< sN+1><auxiliary construction N> where si+1 is the statements after doing auxiliary construction i on <si>

I believe the paper says it made 100M proofs, of which 9M have at least one auxiliary construction. It later also says the fine-tuning used 9M data. How are those proofs with multiple constructions handled? I assume the transformer model only predicts one auxiliary construction at a time. I might have misunderstood this part. Please let me know if I can clarify my questions. Thanks so much for your help.

shufan1 avatar Mar 08 '24 20:03 shufan1

How do you generate the random proofs in the first place? I mean concretely not conceptually. Is there a code snippet?

ParthaEth avatar Mar 21 '24 16:03 ParthaEth