Expanding Debbie Dahl's 2014 SpeechTek presentation on Tools for MultiModal Development

DRAFT

AVIOS: Applied Voice Input/Output Society

Application Development links

There are many available free and open source components for building every part of a multimodal application Understand your requirements and evaluate carefully!

Speech Recognizers

Free

  • Microsoft Windows Speech Recognition 8.0
  • Android Speech Recognition API
  • WebSpeech API forChrome. IntroThe new JavaScript Web Speech API makes it easy to add speech recognition to your web pages. This API allows fine control and flexibility over the speech recognition capabilities in Chrome version 25 and later.
  • iSpeech (for mobile) iSpeech
    • The iSpeech: iSpeech API allows developers to implement Text-To-Speech (TTS) and Automated Voice Recognition (ASR) in any Internet-enabled application. They have SDKs for various mobile, computer, and web platforms:
    • Mobile
      • iPhone
      • Android
      • BlackBerry
    • Desktop/Server
      • Java
      • .NET
    • Web
      • Javascript (TTS)
      • Flash/Flex/Air
    Developers can build programs that use iSpeech services by using the correct iSpeech SDKs.
    • Note, while it says "Free" and there is no cost to sign up as a developer, there is also a price per use and price per download model, and itn't not completely clear what is free and what costs.
  • Speech Mashup Guide More on the Mashup from Thomas

Open source

  • Sphinx-3 (C/C++), Sphinx-4 (Java)
  • PocketSphinx (for embedded systems) and PocketSphinx.js (Javascript)
  • Kaldi
    • Kaldi is developed at JOhns Hopkins Univeristy, Itis similar in aims and scope to HTK. The goal is to have modern and flexible code, written in C++, that is easy to modify and extend.
  • Open Ears (uses PocketSphinx)
    • OpenEars makes it simple for you to add speech recognition and synthesized speech/TTS to your iPhone app quickly and easily. It doesn't use the network and there are no hidden costs or accounts to set up. If you have more specific app requirements, the OpenEars Plugin Platform lets you drag and drop advanced functionality into your app when you're ready. It lets you easily implement round-trip English and Spanish language speech recognition and English text-to-speech on the iPhone, iPod and iPad and uses the open source CMU Pocketsphinx, CMU Flite, and CMUCLMTK libraries, and it is free to use in an iPhone, iPad or iPod app

Low cost for development

  • AT&T Mobile Developer program
  • Nuance Mobile Developer program (NDEV)

Speech Synthesis

Speech Technology and Speech Recognition: AT&T Labs

http://www.youtube.com/watch?v=V0uwydE0HaA&feature=youtube_gdata

Open source

  • Festival
  • DFKI Mary (supports SSML and EmotionML)
  • Flite (Festival Lite, small footprint)
  • eSpeak (formant synthesis)

Free

  • OpenEars (iPhone)
  • Google (Android and Chrome)

Audio Analysis

Analyze your audio with the "Swiss Army Knife" of audio analysis programs

Other Speech Processing

  • EmoVoice (open source emotion recognition from voice)
  • ALIZE (open source speaker recognition and diarization)
  • MSR Identity Toolbox (open source speaker recognition)

Natural Language Understanding

Open source

  • Stanford CoreNLP tools (Java)
    • Stanford CoreNLP provides a set of natural language analysis tools which can take raw text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, etc. Stanford CoreNLP is an integrated framework. Its goal is to make it very easy to apply a bunch of linguistic analysis tools to a piece of text. Starting from plain text, you can run all the tools on it with just two lines of code. It is designed to be highly flexible and extensible. With a single option you can change which tools should be enabled and which should be disabled. Its analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications. It includes the part-of-speech (POS) tagger, the named entity recognizer (NER), the parser, the coreference resolution system, the sentiment analysis, and the bootstrapped pattern learning tools.
  • Appache OpenNLP (Java)
    • The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services. OpenNLP also includes maximum entropy and perceptron based machine learning.
  • NLTK (Python)

    NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and an active discussion forum.

  • LingPipe (Java)
    • LingPipe is tool kit for processing text using computational linguistics. LingPipe is used to do tasks like:
      Find the names of people, organizations or locations in news
      Automatically classify Twitter search results into categories
      Suggest correct spellings of queries
  • Free (with conditions)

  • Wit.AI
  • <get diagram>

Dialog

  • AIML (Artificial Intelligence Markup Language) http://www.alicebot.org/aiml.html
    • AIML (Artificial Intelligence Markup Language) is an XML-compliant language that's easy to learn, and makes it possible for you to begin customizing an Alicebot or creating one from scratch within minutes.
  • OpenDIAL https://code.google.com/p/opendial/
    • OpenDial is a Java-based, domain-independent software toolkit for the development of robust and adaptive dialogue systems. Dialogue understanding, management and generation are expressed in OpenDial through probabilistic rules encoded in a simple XML format.
  • JVoiceXML http://jvoicexml.sourceforge.net
    • A free VoiceXML interpreter for JAVA with an open architecture for custom extensions. Demo implementation platforms are supporting JAVA APIs such as JSAPI and JTAPI.
  • Apache Commons SCXML http://commons.apache.org/proper/commons-scxml/
    • State Chart XML (SCXML) is currently a Working Draft specification published by the World Wide Web Consortium (W3C). SCXML provides a generic state-machine based execution environment based on Harel State Tables. SCXML is a candidate for the control language within multiple markup languages coming out of the W3C (see the latest Working Draft for details). Commons SCXML is an implementation aimed at creating and maintaining a Java SCXML engine capable of executing a state machine defined using a SCXML document, while abstracting out the environment interfaces.

Knowledge/Ontologies <get diagram>

  • Wolfram Alpha developer (personal/experimental)
  • OpenCYC
  • Freebase
  • BabelNet (WordNet + Wikipedia)
  • YAGO
  • NELL (Carnegie Mellon)

Meanings: WordNet

<get diagram>

Natural Language Generation Software

Allow more variation than templates Easier to maintain More complex to implement Downloadable systems: http://aclweb.org/aclwiki/index.php?title=Downloadable_NLG_systems

Development Tools

Dialog (and more generally graph) layout yEd IDE’s Eclipse NetBeans Audio Audacity – audio recording and editing ffmpeg – command line audio processing

yEd Graphical Layout Example: Personal Assistant App <get diagram>

Development Environments

iOS Android Windows Open Web Platform (HTML 5, etc.) AppInventor (Android only) Cross-platform development environments (Adobe PhoneGap/Apache Cordova, Appcelerator)

Open Standards

  • Often royalty-free (W3C standards)
  • Standards provide a head start on desig they incorporate significant design effort from the standards committees(use cases, requirements, syntax and semantics of representation, protocols) Promote interoperability
  • Many standards have open source implementations

  • Some Standards
  • Control
  • VoiceXML (JVoiceXML)
  • SCXML
  • CCXML (call control)
  • Open Web Platform
  • GUI
  • HTML-5
  • CSS, SVG
  • Audio
  • WebRTC
  • WebAudio H
  • uman input
  • EMMA – Extensible MultiModal Annotation
  • SRGS speech grammars
  • EmotionML InkML Sensor Input Ambient Light, Proximity, Media Capture, Geolocation TTS Prompts SSML Pronunciation Lexicon Knowledge OWL Ontology Language RDF Communication MMI Architecture HTTP/REST/SOAP UPnP

Phonetics

The ARPABET

Interesting Articles

Project Ouch - Outing Unfortunate Characteristics of HMMs (Used for Speech Recognition): http://www.icsi.berkeley.edu/icsi/projects/speech/ouch

http://voiceinthemachine.com/2012/07/03/whats-wrong-with-speech-recognition/