training icon indicating copy to clipboard operation
training copied to clipboard

Data samples URL?

Open futurechimp opened this issue 11 years ago • 8 comments

Hi,

I am interested in completing the tutorials, although I'm not at an Ampcamp (so I don't have access to the AMIs you're using there).

Is there anywhere I can download the Wikipedia data set you're using as the basis of the tutorials? I have looked on the Wikipedia public datasets pages but I don't see anything that looks right. A link to the dataset at the very start of the tutorials would be really helpful.

futurechimp avatar Mar 13 '14 11:03 futurechimp

Did you see these instructions? http://ampcamp.berkeley.edu/big-data-mini-course/launching-a-bdas-cluster-on-ec2.html

These point you to a script that will launch EC2 instances for you and automatically load the data; those will work even if you're not at an AMPCamp. Are those scripts not working for you?

On Thu, Mar 13, 2014 at 4:35 AM, Dave Hrycyszyn [email protected]:

Hi,

I am interested in completing the tutorials, although I'm not at an Ampcamp (so I don't have access to the AMIs you're using there).

Is there anywhere I can download the Wikipedia data set you're using as the basis of the tutorials? I have looked on the Wikipedia public datasets pages but I don't see anything that looks right. A link to the dataset at the very start of the tutorials would be really helpful.

Reply to this email directly or view it on GitHubhttps://github.com/amplab/training/issues/145 .

kayousterhout avatar Mar 13 '14 15:03 kayousterhout

Hey, thanks for the pointer - I didn't realize I'd need to actually use the EC2 setup (I have Spark and Shark running locally and I was in a "run it on my machine" mindset when I asked the question).

I'm sure the scripts run fine (and will try them out to be sure), I was just wondering if that dataset is available publicly anywhere. If not, I'll grab it off the server and pull it down to my local setup.

futurechimp avatar Mar 13 '14 15:03 futurechimp

You can get the data the wiki stats data from s3 in the bucket s3://ampcamp-data/wikistats_20090505-01

shivaram avatar Mar 13 '14 16:03 shivaram

I'm using local cluster also, would be nice to provide public URL for dataset.

petro-rudenko avatar May 05 '14 13:05 petro-rudenko

I too think it would be great to have a public URL for the datasets.

dossett avatar Jun 18 '14 18:06 dossett

The files are publicly acessible - you can copy them down via a tool like s3cmd (https://github.com/s3tools/s3cmd)

Alternatively - the files in that bucket are numbered part-00096 through part-00167. It is possible to access them at a URL like this:

http://ampcamp-data.s3.amazonaws.com/wikistats_20090505-01/part-00167

On Wed, Jun 18, 2014 at 11:09 AM, Aaron Niskode-Dossett < [email protected]> wrote:

I too think it would be great to have a public URL for the datasets.

— Reply to this email directly or view it on GitHub https://github.com/amplab/training/issues/145#issuecomment-46472092.

etrain avatar Jun 18 '14 18:06 etrain

Thank you! What about the MovieLens data used in the MLlib section?

dossett avatar Jun 18 '14 18:06 dossett

Those files are small and so we just included them in the AMI - they are available here: http://files.grouplens.org/datasets/movielens/ml-1m.zip

On Wed, Jun 18, 2014 at 11:58 AM, Aaron Niskode-Dossett < [email protected]> wrote:

Thank you! What about the MovieLens data used in the MLlib section?

— Reply to this email directly or view it on GitHub https://github.com/amplab/training/issues/145#issuecomment-46478428.

etrain avatar Jun 18 '14 19:06 etrain