Note: this recognizer runs on a web server, the audio file will be uploaded using HTTP.
How to use the Model Speech Alignment recognizer from within ELAN
The Model Speech Alignment recognizer written at the Max Planck
Institute for Psycholinguistics uses the CMU Sphinx speech
recognition engine to suggest a time-alignment between an audio
recording and a plain text file provided by the user. It expects the
input data to be in one of a few common spoken languages, as
selected by the user. When models for other languages become
available in Sphinx compatible format, they can be added later.
The Sphinx source code is available from the Sphinx
website at CMU, which also gives information about models.
- Input: A single audio recording (preferrably in
uncompressed PCM RIFF WAVE audio *.wav file format, as this is
compatible with ELAN visualization) and a plain text file in
ASCII or UTF-8 encoding (*.txt file) with a transcription of
all spoken text in the audio recording.
- Settings: Model, giving the user the choice between
the English, German and Dutch language models of Sphinx. Other
Sphinx models can be integrated when available.
- Output: A single annotation tier in CSV format,
aligning the input text with the time course of the input audio
recording. The recognizer attempts to re-synchronize after
missed or misdetected words, but shorter recordings work
better.
AVATecH and AUVIS compatible recognizers have the following
categories of settings, input and output elements:
- input media: ELAN automatically uses the first suitable media
file of your current annotation session, but you can change
that to other supported files belonging to the session. Very
few recognizers expect multiple input media files or extra
input files in 'timeseries' or recognizer-specific formats.
- input tiers: Some recognizers need input in the form of an
annotation tier, for example to select timespans of interest.
For some recognizers, the input is expected to be the output
of another recognizer. This gives you a chance to edit and
correct data - often simply tiers - between the two steps.
- numerical input: Recognizers can be configurable by
numerical 'knobs'. ELAN can show those as slider or field.
Recognizers often work well enough with defaults already.
- choice input: Recognizers can give you the option
to select settings from a pre-defined list. An example can
be 'verbose/normal/silent' messages or 'high/low' sensitivity.
ELAN shows drop down selectors here. In special cases, a
recognizer can also have 'any text' configuration items.
- output: Recognizers often produce one or more annotation
tiers. ELAN will offer to add those to your annotation
session as new tiers. It is also possible for recognizers
to output timeseries (which ELAN can show as curves) or
even audio, video or other files. Most recognizers only
produce zero or more tiers (plus log messages) as output.
It is often possible to selectively skip some output steps.
- log: You can open a window showing general messages from
the recognizer, tagged by type (e.g. DEBUG, INFO, WARN,
ERROR, RESULT or PROGRESS). Messages of higher priority
also update the processing status display, so they can
be seen directly without having to review the log text.
- basic or advanced recognizer settings: ELAN gives you
the choice to either hide or show 'advanced' settings. Default
values will be used for those settings which are hidden.
Your default ELAN configuration invokes a
CLAM
REST
web service wrapper on catalog.clarin.eu to have your files analyzed.
In other words, your media files and, if applicable, input tiers will
be uploaded for processing and ELAN will process the downloaded (tier
or other) results as if you had done the processing locally. For use
in situations where a web service can not be used (too large files or
no internet available) you can also request a copy of the recognizer
for local installation on Linux or Windows, protected by USB dongle.
For this and for general support with the use of this recognizer,
please contact auvis@mpi.nl or use
the ELAN and AUVIS forums on the website of
The Language Archive.
CLAM, ELAN and the client-side recognizer proxy are free open source
software under the
GNU
General Public License - however, some of the recognizers can be
propietary closed source software. Licenses for academic use are
available on request. Use of the web services is free at the moment,
but may be limited to the academic community if it becomes necessary.