For teaching voice recognition, Microsoft opens Ms Marco dataset

643

Microsoft is trying to help create machines that can have conversations by releasing a new set of data for free.

The data, called the Microsoft Machine Reading Comprehension dataset (MS MARCO) is a bundle of 100,000 English queries along with corresponding answers. It’s supposed to help people build artificial intelligence systems that can understand human written language.

The company is opening up its dataset in the hope that Microsoft can work with other organizations on making machines better at reading comprehension, said Rangan Majumder, program manager for the Microsoft Partner Group, in a blog post on Friday.

The queries in MS MARCO are based on anonymized questions that were submitted to Microsoft’s Bing search engine and Cortana virtual assistant. The answers are based on information found online, written by humans and checked for accuracy. The queries and responses are built for use with deep learning models.

Right now, the dataset is free to download for people who plan to use it in a non-commercial manner. Microsoft is sharing it in the same way it shares other open data sets that are used for training artificial intelligence programs.

One of those is ImageNet, a database of tagged pictures that’s used for training image recognition algorithms. Microsoft used that database in developing the image recognition technology that now underpins products like Microsoft’s Computer Vision API.

Readers who want to read more about MS MARCO can download a research paper written by the team at Microsoft that built it. The team is also putting together a challenge that will evaluate models trained using the MS MARCO data. Evaluation scripts for that challenge are still under development.

 

Read the source article at computerworld.com