2019-08 Open archive of 240,000 hours' worth of talk radio, including 2.8 billion words of machine-transcription

A group of MIT Media Lab researchers have published Radiotalk, a massive corpus of talk radio audio with machine-generated transcriptions, with a total of 240,000 hours' worth of speech, marked up with machine-readable metadata.

 The audio was scraped from streaming radio services between Oct 2018 and Mar 2019, and the transcripts run to 2.8 billion words. The researchers hope the corpus will be used by "researchers in the fields of natural language processing, conversational analysis, and the social sciences."

https://boingboing.net/2019/08/01/pump-up-the-volume.html

Comentarios

Popular

Herramientas de Evaluación de Sistemas Algorítmicos

Sistemas multiagentes: Desafíos técnicos y éticos del funcionamiento en un grupo mixto

Controversias éticas en torno a la privacidad, la confidencialidad y el anonimato en investigación social