Entradas

Mostrando entradas de agosto, 2019

2019-08 Waymo Open Dataset

https://waymo.com/open/

2016-01 1.5 TB dataset of anonymized user interactions released by Yahoo

The Yahoo News Feed dataset is a collection based on a sample of anonymized user interactions on the news feeds of several Yahoo properties, including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate. The dataset stands at a massive ~110B lines (1.5TB bzipped) of user-news item interaction data, collected by recording the user- news item interaction of about 20M users from February 2015 to May 2015. In addition to the interaction data, we are providing the demographic information (age segment and gender) and the city in which the user is based for a subset of the anonymized users. On the item side, we are releasing the title, summary, and key-phrases of the pertinent news article. The interaction data is timestamped with the user’s local time and also contains partial information of the device on which the user accessed the news feeds, which allows for interesting work in contextual recommendation and temporal data mining. https://www.d...

2019-08 Waymo is going to share its self-driving data—but it’s still not enough

Waymo says it will share some of the data it’s gathered from its vehicles for free so other researchers working on autonomous driving can use it. Waymo isn’t the first to do this: Lyft, Argo AI, and other firms have already open-sourced some data sets. But Waymo’s move is notable because its vehicles have covered millions of miles on roads already. https://www.technologyreview.com/f/614211/waymo-is-going-to-share-its-self-driving-databut-its-still-not-enough/?utm_medium=tr_social&utm_campaign=site_visitor.unpaid.engagement&utm_source=Twitter#Echobox=1566491935

2019-08 20 Open Datasets for Natural Language Processing

https://noeliagorod.com/2019/08/19/20-open-datasets-for-natural-language-processing/amp/

2018-12 It all Boils Down to the Training Data

Is your model not performing well? Try digging into your data. Instead of getting marginal improvements in performance by searching for state-of-the-art models, drastically improve your model’s accuracy by improving the quality of your data. https://medium.com/labelbox/it-all-boils-down-to-the-training-data-393376f24e6a

2019-08 AI NEEDS YOUR DATA—AND YOU SHOULD GET PAID FOR IT

ROBERT CHANG, A Stanford ophthalmologist, normally stays busy prescribing drops and performing eye surgery. But a few years ago, he decided to jump on a hot new trend in his field: artificial intelligence. Doctors like Chang often rely on eye imaging to track the development of conditions like glaucoma. With enough scans, he reasoned, he might find patterns that could help him better interpret test results. https://www.wired.com/story/ai-needs-data-you-should-get-paid/

2019-08 Dataset search tool

https://toolbox.google.com/datasetsearch

2019-08 Open archive of 240,000 hours' worth of talk radio, including 2.8 billion words of machine-transcription

A group of MIT Media Lab researchers have published Radiotalk, a massive corpus of talk radio audio with machine-generated transcriptions, with a total of 240,000 hours' worth of speech, marked up with machine-readable metadata.  The audio was scraped from streaming radio services between Oct 2018 and Mar 2019, and the transcripts run to 2.8 billion words. The researchers hope the corpus will be used by "researchers in the fields of natural language processing, conversational analysis, and the social sciences." https://boingboing.net/2019/08/01/pump-up-the-volume.html

2019-07 Transforming Skewed Data for Machine Learning

Skewed data is common in data science; skew is the degree of distortion from a normal distribution. For example, below is a plot of the house prices from Kaggle’s House Price Competition that is right skewed, meaning there are a minority of very large values. https://medium.com/@ODSC/transforming-skewed-data-for-machine-learning-90e6cc364b0