2019-08 Open archive of 240,000 hours' worth of talk radio, including 2.8 billion words of machine-transcription
A group of MIT Media Lab researchers have published Radiotalk, a massive corpus of talk radio audio with machine-generated transcriptions, with a total of 240,000 hours' worth of speech, marked up with machine-readable metadata.
The audio was scraped from streaming radio services between Oct 2018 and Mar 2019, and the transcripts run to 2.8 billion words. The researchers hope the corpus will be used by "researchers in the fields of natural language processing, conversational analysis, and the social sciences."
https://boingboing.net/2019/08/01/pump-up-the-volume.html
The audio was scraped from streaming radio services between Oct 2018 and Mar 2019, and the transcripts run to 2.8 billion words. The researchers hope the corpus will be used by "researchers in the fields of natural language processing, conversational analysis, and the social sciences."
https://boingboing.net/2019/08/01/pump-up-the-volume.html
Comentarios
Publicar un comentario