Note: this recognizer runs on a web server, the audio file will be uploaded using HTTP.
How to use the Speaker Diarization recognizer from within ELAN
The Speaker Diarization recognizer written by the
Fraunhofer IAIS (Fraunhofer-Institut für Intelligente Analyse-
und Informationssysteme) analyzes an audio file to group audio
segments provided as second input by similarity. The used similarity
measure aims to group all utterances by the same speaker as similar.
By using a tier of speech segments as input and splitting the output
into multiple tiers by their speaker number, a draft segmentation of
an audio recording featuring multiple speakers can be created. This
helps to reduce the time required to annotate common recordings.
- Input: A single WAV audio file, recommended format is
uncompressed PCM RIFF WAVE: Other formats can be supported by
the recognizer but will not be visualized by ELAN, for
example. Also, a tier selecting timespans of speech in the
recording is required as second input.
- Settings:
- Global perc threshold, selects how similar audio
snippets have to be to be judged as being from the same
speaker. In some cases, this allows grouping of utterances
by speaker gender (relaxed) or further subdivision of
utterances by the same speaker into several groups
(strict). The main use of course is to adjust the
recognition parameters to group utterances exactly by
speakers.
- Lambda, selects how complex the notion of 'speaker' of
the recognizer should be. Small lambda values make the
recognizer more likely to put utterances of the same speaker
into several sub-groups.
- Num max segs, influences the speed accuracy balance.
Larger values can mean slower but more accurate
processing.
- Output: A single tier annotating the provided speech
segments by possible speaker number. Numbering of the speakers
can vary depending on multiple factors, but annotations with the
same label will generally correspond to speech from the same
speaker. Available in default XML and alternate CSV format.
AVATecH and AUVIS compatible recognizers have the following
categories of settings, input and output elements:
- input media: ELAN automatically uses the first suitable media
file of your current annotation session, but you can change
that to other supported files belonging to the session. Very
few recognizers expect multiple input media files or extra
input files in 'timeseries' or recognizer-specific formats.
- input tiers: Some recognizers need input in the form of an
annotation tier, for example to select timespans of interest.
For some recognizers, the input is expected to be the output
of another recognizer. This gives you a chance to edit and
correct data - often simply tiers - between the two steps.
- numerical input: Recognizers can be configurable by
numerical 'knobs'. ELAN can show those as slider or field.
Recognizers often work well enough with defaults already.
- choice input: Recognizers can give you the option
to select settings from a pre-defined list. An example can
be 'verbose/normal/silent' messages or 'high/low' sensitivity.
ELAN shows drop down selectors here. In special cases, a
recognizer can also have 'any text' configuration items.
- output: Recognizers often produce one or more annotation
tiers. ELAN will offer to add those to your annotation
session as new tiers. It is also possible for recognizers
to output timeseries (which ELAN can show as curves) or
even audio, video or other files. Most recognizers only
produce zero or more tiers (plus log messages) as output.
It is often possible to selectively skip some output steps.
- log: You can open a window showing general messages from
the recognizer, tagged by type (e.g. DEBUG, INFO, WARN,
ERROR, RESULT or PROGRESS). Messages of higher priority
also update the processing status display, so they can
be seen directly without having to review the log text.
- basic or advanced recognizer settings: ELAN gives you
the choice to either hide or show 'advanced' settings. Default
values will be used for those settings which are hidden.
Your default ELAN configuration invokes a
CLAM
REST
web service wrapper on catalog.clarin.eu to have your files analyzed.
In other words, your media files and, if applicable, input tiers will
be uploaded for processing and ELAN will process the downloaded (tier
or other) results as if you had done the processing locally. For use
in situations where a web service can not be used (too large files or
no internet available) you can also request a copy of the recognizer
for local installation on Linux or Windows, protected by USB dongle.
For this and for general support with the use of this recognizer,
please contact auvis@mpi.nl or use
the ELAN and AUVIS forums on the website of
The Language Archive.
CLAM, ELAN and the client-side recognizer proxy are free open source
software under the
GNU
General Public License - however, some of the recognizers can be
propietary closed source software. Licenses for academic use are
available on request. Use of the web services is free at the moment,
but may be limited to the academic community if it becomes necessary.