More public data key to democratizing ML, says MLCommons • The Register

ByLois V. Aguirre

Apr 18, 2022 #2021 Acura Rdx Technology Package, #2021 Acura Tlx Technology Package, #2022 Acura Mdx Technology Package, #Align Technology Stock, #Applied Racing Technology, #Artificial Intelligence Technology Solutions Inc, #Assisted Reproductive Technology, #Battery Technology Stocks, #Benjamin Franklin Institute Of Technology, #Chief Technology Officer, #Color Star Technology, #Craft Design Technology, #Definition Of Technology, #Definitive Technology Speakers, #Element Materials Technology, #Health Information Technology Salary, #Ice Mortgage Technology, #Information Technology Definition, #Information Technology Degree, #Information Technology Salary, #Interactive Response Technology, #International Game Technology, #La Crosse Technology Weather Station, #Lacrosse Technology Atomic Clock, #Luokung Technology Stock, #Marvell Technology Stock Price, #Maytag Commercial Technology Washer, #Microchip Technology Stock, #Micron Technology Stock Price, #Mrna Technology History, #Mrna Vaccine Technology, #Nyc College Of Technology, #Penn College Of Technology, #Recombinant Dna Technology, #Rlx Technology Stock, #Robert Half Technology, #Science And Technology, #Sharif University Of Technology, #Smart Home Technology, #Stevens Institute Of Technology Ranking, #Symphony Technology Group, #Technology In The Classroom, #Technology Readiness Level, #Technology Stores Near Me, #Thaddeus Stevens College Of Technology, #University Of Advancing Technology, #Vanguard Information Technology Etf, #Vanguard Technology Etf, #What Is 5g Technology, #Women In Technology

[ad_1]

Unless you’re an English speaker, and one with as neutral an American accent as possible, you’ve probably butted heads with a digital assistant that couldn’t understand you. With any luck, a couple of open-source datasets from MLCommons could help future systems grok your voice.

The two datasets, which were made generally available in December, are the People’s Speech Dataset (PSD), a 30,000-hour database of spontaneous English speech; and the Multilingual Spoken Words Corpus (MSWC), a dataset of some 340,000 keywords in 50 languages. 

By making both datasets publicly available under CC-BY and CC-BY-SA licenses, MLCommons hopes to democratize machine learning – that is to say, make it available to everyone – and help push the industry toward data-centric AI.

David Kanter, executive director and founder of MLCommons, told Nvidia in a podcast this week that he sees data-centric AI as a conceptual pivot from “which model is the most accurate,” to “what can we do with data to improve model accuracy.” For that, Kanter said, the world needs lots of data.

Increasing understanding with the People’s Speech

Spontaneous speech recognition is still challenging for AIs, and the PSD could help learning machines better understand colloquial speech, speech disorders and accents. Had a database like this existed earlier, said PSD project lead Daniel Galvez, “we’d likely be speaking to our digital assistants in a much less robotic way.” 

The 30,000 hours of speech in the People’s Speech Dataset was culled from a total of 50,000 hours of publicly available speech pulled from the Internet Archive digital library, and it has two unique qualities: Firstly, it’s entirely spontaneous speech, meaning it contains all the tics and imprecisions of the average conversation. Second, it all came with transcripts.

By using some CUDA-powered inference engine tricks, the team behind PSD was able to reduce labeling time of that massive dataset to just two days. The end result was a dataset that can allow chatbots and other speech recognition programs to better understand those with voices that differ from those of American English-speaking, white, males

Galvez said that speech disorders, neurological issues and accents are all poorly represented in datasets, and as a result, “[those types of speech] aren’t well understood by commercial products.”

Again, said Kanter, projects like those fail because of a lack of data that includes diverse speakers. 

A corpus to broaden the reach of digital assistants 

The Multilingual Spoken Words Corpus is a different animal from the PSD. Instead of complete sentences, the Corpus consists of 340,000 keywords in 50 languages. “To our knowledge this is the only open-source spoken word dataset for 46 of these 50 languages,” Kanter said. 

Digital assistants, like chatbots, are prone to bias based on their training datasets, which has led to them not catching on as quickly as they could have. Kanter predicts that digital assistants will be available worldwide “by mid-decade,” and he sees the MSWC as a key base for making that happen. 

“When you look at equivalent databases, it’s Mandarin, English, Spanish, and then it falls off pretty quick,” Kanter said. 

Kanter said the datasets were already tested by some of the MLCommons member companies. So far, he said they’re being used to de-noise audio and video recordings of crowded rooms and conferences, and for improving speech recognition. 

In the near future, Kanter said he hopes the datasets will be widely adopted and used alongside other public datasets that commonly serve as sources for ML and AI researchers. ®

[ad_2]

Source link