9 free Machine learning datasets

submit to reddit
A collection of various datasets for machine learning.

 

UCI Machine Learning Repository

UC Irvine Machine Learning Repository, currently maintains 335 data. You may view all data sets through their searchable interface. Their old web site is still available, for those who prefer the old format.

 

Landsat on AWS

Landsat 8 data is available for anyone to use via Amazon S3. All Landsat 8 scenes from 2015 are available along with a selection of cloud-free scenes from 2013 and 2014. All new Landsat 8 scenes are made available each day, often within hours of production. MathWorks has created a freely-downloadable tool for accessing, processing, and visualizing Landsat on AWS data in MATLAB. With this tool, you can create a map display of scene locations with markers that show each scene’s metadata.

 

Modeling Online Auctions

Modeling online auctions provides data sets from eBay. All files are available in comma separated format (CSV) and with data field discription.

 

Million Song Dataset

The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

 

Delve Datasets

The Delve datasets and families are available from this page. Every dataset (or family) has a brief overview page and many also have detailed documentation. You can download gzipped-tar files of the datasets. Datasets are categorized as primarily assessment, development or historical according to their recommended use.

 

KEEL-dataset

In KEEL-dataset repository aims at providing to the machine learning researchers a set of benchmarks to analyze the behavior of the learning methods. Concretely, it is possible to find benchmarks already formatted in KEEL format for classification (such as standard, multi instance or imbalanced data), semi-supervised classification, regression, time series and unsupervised learning. Also, a set of low quality data benchmarks is maintained in the repository.

 

1000 Genomes Project and AWS

The 1000 Genomes Project is an international collaboration which has established the most detailed catalogue of human genetic variation, including SNPs, structural variants, and their haplotype context. The final phase of the project sequenced more than 2500 individuals from 26 different populations around the world and produced an integrated set of phased haplotypes with more than 80 million variants for these individuals. The Amazon mirror contains the complete data set from the project and the data can be found at: s3.amazonaws.com/1000genomes.

 

Mammographic Image Analysis

Links to various datasets of Mammographic Image Analysis Society (MIAS) database and the Digital Database for Screening Mammography (DDSM) etc.

 

Auton Lab Datasets

Various datasets like Alias Detection Datasets, Link Datasets, Logistic Regression Datasets, Optimal Reinsertion Datasets etc.