2019-08 Open archive of 240,000 hours' worth of talk radio, including 2.8 billion words of machine-transcription

A group of MIT Media Lab researchers have published Radiotalk, a massive corpus of talk radio audio with machine-generated transcriptions, with a total of 240,000 hours' worth of speech, marked up with machine-readable metadata.

 The audio was scraped from streaming radio services between Oct 2018 and Mar 2019, and the transcripts run to 2.8 billion words. The researchers hope the corpus will be used by "researchers in the fields of natural language processing, conversational analysis, and the social sciences."

https://boingboing.net/2019/08/01/pump-up-the-volume.html

Comentarios

Popular

Es hora de que la IA se explique

Ann Cavoukian explica por qué la vigilancia invasiva no debería ser la norma en los entornos urbanos modernos y sostenibles

Gemelos digitales, cerebros virtuales y los peligros del lenguaje