Code Dataset
Hi there, great work!
Do you have plans for code datatset, and if yes when can we expect it?
Hi @mtasic85, thank you for your interest in ProX! We will try it on code data in the coming days; however, I can't confirm the exact timeline yet. Unlike our web and math data which largely come from web documents, code data mainly comes from GitHub, which involves a different approach for downloading, organizing and training. We're still figuring out how to conduct some initial experiments.
@mtasic85 If you are still interested in large-scale and high-quality code dataset, you may find our new MegaMath dataset helpful, especially the megamath-code subset. Although our primary goal is to find high-quality math-related code, but I think overall this is highlt relevant to your demands as well. And we've contained more synthetic code data in MegaMath, feel free to take a look!
We also give many details in our paper, please see: https://hf.co/papers/2504.02807