MEEMZEE

Mobile Conversational Solutions

Voice technology frameworks that enable mobile conversational solutions

BZMedia Framework (now only available for iOS)

BZMedia is a powerful framework for iOS devices that enables mobile applications to become fully conversational (BZMedia for Android devices is currently under development). BZMedia framework integrates speech recognition, speech synthesis, audio playback and audio recording with a real-time user, dialog and device event notification engine that allows mobile applications to have maximum control over conversations and dialogs with users.

With this framework, not only a mobile application can perform simple tasks like speech recognition of a predetermined set of commands, speech synthesis of a sentence, and basic audio playback and recordings, but also it can implement state-of-the-art conversational dialogs that will put the application in the driver seat to control the conversation with the user in order to execute complex tasks quickly and efficiently.

BZDialog Framework (under development for iOS & Android)

BZDialog is a powerful framework for iOS devices that enables mobile applications to become fully conversational using pre-defined configurable dialog modules. BZDialog framework provides a collection of dialog modules that range in functionality from simple audio playback to ones that execute advanced dialogs comprised of audio playback, speech synthesis, speech recognition of collected user utterances and automatic error handling schemas for speech/silence detection and invalid recognized utterances.

During the automatic execution of a dialog module, the mobile application will be able to receive a full range of real-time event notifications about the state of the media operations executed by the dialog and about the user and device. With that, a developer can power a mobile application with conversational dialogs very quickly and effectively. The mobile application will also have some control over the flow of the conversation with the user.

While BZDialog framework contains a comprehensive list of dialog modules, however, it’s recommended that BZMedia framework be used instead to implement custom dialogs that require finer application control.

Features of BZMedia Framework

  • Speech Recognition (Automatic Speech Recognition, ASR)
    • Speech Recognition engines supported: Apple Speech API and CMU Sphinx/Pocket Sphinx
    • Local offline speech recognition right on the mobile device without streaming audio data to cloud servers (Pocket Sphinx only)
    • BZMedia Framework supports english only in Pocket Sphinx. Support of ASR for other languages in Pocket Sphinx is under development
    • Apple Speech API supports many, many languages
    • Supports requests to ask permission from user to enable speech recognition functionality in case the mobile application has not already done so
    • Supports PCM 16K audio encoding. Other encoding formats (PCM 8K, PCM 32K and PCM 48K, compressed and pre-encoded) are experimental
    • Supports speech recognition from live audio streams and recorded audio files (both raw and wav audio file formats are supported)
    • Supports live and file audio streams, and file direct audio
    • Pocket Sphinx Task types: JSpeech Grammar Format (JSGF grammars), JSGF String, Finite State Grammar (FSG), Statistical Language Model (LM), Key Phrase, key word spotting (KWS) and all-phone
    • Apple Speech API Task hints: Dictation, Search, Confirmation & Unspecified
    • Continuous Speech Recognition and Single Speech Recognition
    • Real-time streaming of partial speech recognition hypothesis (Apple Speech API only)
    • NBest hypotheses with scoring
    • Custom dictionaries (Pocket Sphinx only)
    • Dynamic loading of custom word list in dictionary (Pocket Sphinx only)
    • Automatic pronunciation of unknown words by File text-to-speech engine (Pocket Sphinx only)
    • Support of automatic Audio playback barge-in when a full or partial speech recognition hypothesis is detected
    • Integrated with Acoustic Echo Cancellation (AEC) that cancels audio playback echos so speech recognizer is not falsely triggered. AEC is needed for applications running on iOS simulator, and on iOS on some mobile devices without hardware embedded AEC support
    • Integrated with Voice Activity Detector so speech recognition tasks are started and ended automatically dependent on the following parameters: Max Audio Duration, Max Speech Duration, Pre-Speech Max Silence Duration and Post-Speech Max Silence Duration
    • Optional internal real-time event communication to stop audio playback when a speech recognition hypothesis, full or partial, is found
    • Internal audio recording of live and file audio utterances with support of custom naming conventions such as with a timestamp
    • Real-time event notification of speech recognition events to the application main thread
  • Speech Synthesis (Text-To-Speech, TTS)
    • Speech Synthesis engines supported: Apple Speech Synthesis Framework and CMU Flite/Flite+HTS
    • Local offline speech synthesis right on the mobile device without streaming audio data from cloud servers (Pocket Sphinx only & some Apple Voices)
    • BZMedia Framework supports english only in CMU Flite and Flite+HTS. Support of TTS for other languages is under development
    • Apple Speech Synthesis Framework supports many voices for many, many languages. Many Apple Voices will need to be downloaded and installed on individual mobile devices by users in order to be used by BZMedia Framework
    • Supports real-time audio streaming for playback of large text
    • Support text string and text file
    • Support of a specific set of SSML tags (CMU Flite/Flite+HTS only)
    • Apple TTS supports control volume, pitch multiplier, rate and other. CMU Flite/Flite+HTS TTS supports control duration stretch, target mean, target deviation and "start play from character position"
    • Ability to record synthesised speech to audio files
    • Real-time event notification of TTS audio playback events to the application main thread
  • Audio Playback
    • Based on enhancements and modifications made to audio playback functionality in WebRTC project
    • Real-time streaming of audio data from audio file and speech synthesis audio sources to audio devices and to "Audio-Recording" destinations
    • Supports wav and raw file audio formats
    • Supports PCM 16K audio encoding. Other encoding formats (PCM 8K, PCM 32K and PCM 48K, compressed and pre-encoded) are experimental
    • Fine control over audio playback includes automatic audio looping and having the audio start and stop playback at specific time positions in the audio
    • Real-time event notification of a time position in the audio during audio playback
    • Can be combined with "Audio-Playback" of other audio files sources containing audio echos for offline testing of acoustic echo cancellation
    • Real-time event notification of "Audio-Playback" events to the application main thread
  • Audio Recording
    • Based on enhancements and modifications made to audio recording functionality in WebRTC project
    • Real-time audio streaming of audio data to audio file and speech recognition destinations from audio devices and "Audio-Playback" sources
    • Audio streaming to Speech Recognizers through "Audio-Recording" allows speech recognition on live audio streams coming from an input audio device and on simulated audio streams coming from recorded audio utterances for offline testing
    • Supports wav and raw file audio formats
    • Supports PCM 16K audio encoding. Other encoding formats (PCM 8K, PCM 32K and PCM 48K, compressed and pre-encoded) are experimental
    • Integrated with Voice Activity Detector so audio recording tasks are started and ended automatically dependent on the following parameters: File Max Record Duration, Max Audio Duration, Max Speech Duration, Pre-Speech Max Silence Duration, Post-Speech Max Silence Duration and Max Speech Duration
    • Optional internal real-time event communication between "Audio-Recording" and "Audio-Playback" in order to stop audio playback when speech is detected or when a speech recognition hypothesis, full or partial, is found
    • Optional audio recording of audio with filtered non-speech activity to Audio Recording destinations
    • Real-time event notification of a time position in the audio during "Audio-Recording"
    • Real-time event notification of "Audio-Recording" events to the application main thread
  • Acoustic Echo Cancellation (AEC)
    • Based on enhancements and modifications made to Acoustic Echo Cancellation functionality in WebRTC project
    • AEC is available to cancel the acoustic echo that may be present in audio recorded through the microphone of a device. The acoustic echo is generated when audio is played back through the device external speaker at the same time the audio is being recorded.
    • Ability to test AEC offline with recorded audio files, and in real-time while "Audio-Recording" live audio to audio destinations
    • Local offline support of AEC on iOS Simulator, and iOS on media device in case device does not support hardware echo cancellation
    • Client control over the following AEC parameters: Enable drift compensation, delay offset duration and level estimation
    • Configurable support for multiple mobile modes: speaker phone, loud speaker phone, ear piece and loud ear piece
  • Voice Activity Detection (VAD)
    • Based on enhancements and modifications made to Voice Activity Detection functionality in WebRTC project
    • VAD is available to detect speech energy and silences in audio streams from audio device and "Audio-Playback" sources
    • Ability to test VAD offline with recorded audio files, and in real-time while "Audio-Recording" live audio to audio destinations
    • Configurable support of multiple VAD likelihood modes: very low, low, moderate and high
    • Support of more granular control of VAD thresholds using the local and global threshold parameters
    • Simpler configurable control of VAD behavior using the "min speech to detect" and "min silence to detect" duration parameters
    • Optional internal real-time event communication to stop audio playback when speech energy is detected
    • Real-time event notification of voice activity detection events to the application main thread
  • iOS Device Audio Configuration & Event Detection
    • BZMedia Framework provides an iOS audio session wrapper to client applications with event notification. This allows the application to have maximum control over its behaviour while executing
    • Support of iOS Audio Session category options: default to speaker, mix with others, allow bluetooth, duck others, etc.
    • Detection of iOS Audio Session events: interruption begin/end, audio route changes, media services lost/reset and play/record start/stop
    • Notifications with reasons for audio route changes
    • Real-time event notification of mobile device events to the application main thread
  BZMedia Developer Guide