ProX icon indicating copy to clipboard operation
ProX copied to clipboard

Code Dataset

Open mtasic85 opened this issue 1 year ago • 1 comments

Hi there, great work!

Do you have plans for code datatset, and if yes when can we expect it?

mtasic85 avatar Oct 07 '24 10:10 mtasic85

Hi @mtasic85, thank you for your interest in ProX! We will try it on code data in the coming days; however, I can't confirm the exact timeline yet. Unlike our web and math data which largely come from web documents, code data mainly comes from GitHub, which involves a different approach for downloading, organizing and training. We're still figuring out how to conduct some initial experiments.

koalazf99 avatar Oct 08 '24 12:10 koalazf99

@mtasic85 If you are still interested in large-scale and high-quality code dataset, you may find our new MegaMath dataset helpful, especially the megamath-code subset. Although our primary goal is to find high-quality math-related code, but I think overall this is highlt relevant to your demands as well. And we've contained more synthetic code data in MegaMath, feel free to take a look!

We also give many details in our paper, please see: https://hf.co/papers/2504.02807

koalazf99 avatar Apr 09 '25 13:04 koalazf99