Create dataset spotify_podcast_dataset
- uid: spotify_podcast_dataset
- type: processed
- description:
-
name: Spotify Podcast Dataset
-
description: Podcasts are a rapidly growing audio-only medium that involve new patterns of usage and new communicative conventions and motivate research in many new directions.To facilitate such research, we present the Spotify English-Language Podcast Dataset.
This dataset consists of 100,000 episodes from different podcast shows on Spotify. The dataset is available for research purposes.
The dataset was initially created for use in the the TREC Podcasts Track shared tasks. Participants were asked to work on two tasks focusing on understanding podcast content, and enhancing the search functionality within podcasts.
We are releasing this dataset more widely to facilitate research on podcasts through the lens of speech and audio technology, natural language processing, information retrieval, and linguistics. The dataset contains about 50,000 hours of audio, and over 600 million transcribed words. The episodes span a variety of lengths, topics, styles, and qualities.
-
homepage: https://podcastsdataset.byspotify.com/
-
validated: True
-
- languages:
- language_names:
- English
- language_comments:
- language_locations:
- Northern America
- Europe
- validated: False
- language_names:
- custodian:
- name: Spotify
- in_catalogue:
- type: A commercial entity
- location: United States of America
- contact_name: Ann Clifton
- contact_email: [email protected]
- contact_submitter: False
- additional: https://aclanthology.org/2020.coling-main.519.pdf
- validated: False
- availability:
- procurement:
- for_download: Yes - after signing a user agreement
- download_url: https://forms.gle/kywjSQg5VsrCUeTm8
- download_email:
- licensing:
- has_licenses: Yes
- license_text:
- license_properties:
- research use
- do not distribute
- license_list:
- pii:
- has_pii: Yes
- generic_pii_likely: very likely
- generic_pii_list:
- names
- URLs
- numeric_pii_likely: unlikely
- numeric_pii_list:
- telephone numbers
- sensitive_pii_likely: very likely
- sensitive_pii_list:
- racial or ethnic origin
- political opinions
- religious or philosophical beliefs
- no_pii_justification_class:
- no_pii_justification_text:
- validated: False
- procurement:
- processed_from_primary:
- from_primary: Taken from primary source
- primary_availability: Yes - their documentation/homepage/description is available
- primary_license: Yes - the dataset curators have obtained consent from the source material owners
- primary_types:
- podcasts
- validated: False
- from_primary_entries:
- media:
- category:
- text
- audiovisual
- text_format:
- .TXT
- audiovisual_format:
- .OGG
- image_format:
- database_format:
- text_is_transcribed: Yes - audiovisual
- instance_type: episode
- instance_count: 10K<n<100K
- instance_size: 100<n<10,000
- validated: False
- category:
- fname: spotify_podcast_dataset.json
#self-assign
Hi @yjernite, what is the status of this dataset?