data_tooling Create dataset spotify_podcast

uid: spotify_podcast_dataset
type: processed
description:
- name: Spotify Podcast Dataset
- description: Podcasts are a rapidly growing audio-only medium that involve new patterns of usage and new communicative conventions and motivate research in many new directions.To facilitate such research, we present the Spotify English-Language Podcast Dataset.
  
  This dataset consists of 100,000 episodes from different podcast shows on Spotify. The dataset is available for research purposes.
  
  The dataset was initially created for use in the the TREC Podcasts Track shared tasks. Participants were asked to work on two tasks focusing on understanding podcast content, and enhancing the search functionality within podcasts.
  
  We are releasing this dataset more widely to facilitate research on podcasts through the lens of speech and audio technology, natural language processing, information retrieval, and linguistics. The dataset contains about 50,000 hours of audio, and over 600 million transcribed words. The episodes span a variety of lengths, topics, styles, and qualities.
- homepage: https://podcastsdataset.byspotify.com/
- validated: True
languages:
- language_names:
  - English
- language_comments:
- language_locations:
  - Northern America
  - Europe
- validated: False
custodian:
- name: Spotify
- in_catalogue:
- type: A commercial entity
- location: United States of America
- contact_name: Ann Clifton
- contact_email: [email protected]
- contact_submitter: False
- additional: https://aclanthology.org/2020.coling-main.519.pdf
- validated: False
availability:
- procurement:
  - for_download: Yes - after signing a user agreement
  - download_url: https://forms.gle/kywjSQg5VsrCUeTm8
  - download_email:
- licensing:
  - has_licenses: Yes
  - license_text:
  - license_properties:
    - research use
    - do not distribute
  - license_list:
- pii:
  - has_pii: Yes
  - generic_pii_likely: very likely
  - generic_pii_list:
    - names
    - URLs
  - numeric_pii_likely: unlikely
  - numeric_pii_list:
    - telephone numbers
  - sensitive_pii_likely: very likely
  - sensitive_pii_list:
    - racial or ethnic origin
    - political opinions
    - religious or philosophical beliefs
  - no_pii_justification_class:
  - no_pii_justification_text:
- validated: False
processed_from_primary:
- from_primary: Taken from primary source
- primary_availability: Yes - their documentation/homepage/description is available
- primary_license: Yes - the dataset curators have obtained consent from the source material owners
- primary_types:
  - podcasts
- validated: False
- from_primary_entries:
media:
- category:
  - text
  - audiovisual
- text_format:
  - .TXT
- audiovisual_format:
  - .OGG
- image_format:
- database_format:
- text_is_transcribed: Yes - audiovisual
- instance_type: episode
- instance_count: 10K<n<100K
- instance_size: 100<n<10,000
- validated: False
fname: spotify_podcast_dataset.json

Nov 23 '21 10:11 albertvillanova

#self-assign

Dec 08 '21 15:12 yjernite

Hi @yjernite, what is the status of this dataset?

Jan 21 '22 07:01 albertvillanova

Create dataset spotify_podcast_dataset