Skip to content

Frequently Asked Questions

  • Recipes — push audio, capture speech after a wake word, push-to-talk, UI-thread callbacks.
  • Concepts — thread safety, sample rate, model types, LVCSR, RTOS integration.
  • Performance — code size, memory, wake-word tuning.
  • Troubleshooting — debug audio, model compatibility, display issues.

Recipes

Short answers for tasks that come right after Your first program. Follow the links for full samples and API detail.

How do I push audio?

Use push mode when your app owns the audio path — for example a custom driver, an RTOS without a blocking read API, or fixed-size buffers from another thread. The library does not read the microphone for you; you pass each chunk to push on ->audio-pcm.

Contrast with pull mode on Your first program: there you attach fromAudioDevice (or a file stream) and call run.

  1. new / new SnsrSession() — empty Session.
  2. load your .snsr model.
  3. setHandler for ^result and any other events you need.
  4. Loop: read a chunk from your driver (often 10–20 ms of 16 kHz PCM), then snsrPush(s, SNSR_SOURCE_AUDIO_PCM, buffer, nbytes) (Java: session.push(...)).
  5. When finished, call stop once to flush buffered audio, then release.

Do not attach an input stream with setStream on ->audio-pcm in push mode — audio enters only through push. Handlers run on the thread that calls push (see UI thread callbacks).

/* 15 ms @ 16 kHz mono 16-bit LE */
#define CHUNK 480
char pcm[CHUNK];
size_t n = myAudioRead(pcm, CHUNK);  /* your driver */
snsrPush(s, SNSR_SOURCE_AUDIO_PCM, pcm, n);
byte[] pcm = myAudioRead();  /* your driver */
session.push(Snsr.SOURCE_AUDIO_PCM, pcm);

API overview § Push mode, push-audio.c, spot-data.c

How do I capture the audio that fired the wake word?

Use a composed spotter + VAD model so the SDK segments speech after the trigger and writes PCM to an output stream.

  1. Build or download a pipeline model — for example compose tpl-spot-vad with snsr-edit (see tpl-spot-vad-type), or use a pre-built spot+VAD .snsr from your SDK tree.
  2. load the pipeline into a Session.
  3. Attach live input on ->audio-pcm (fromAudioDevice or your push loop).
  4. Attach a WAV (or buffer) sink on <-audio-pcm: snsrSetStream(s, SNSR_SINK_AUDIO_PCM, snsrStreamFromFileName("out.wav", "w")) (wrap with fromAudioStream if needed — see live-segment.c).
  5. Register ^result to note the spot, and ^end (or ^limit) to know when the following utterance ended; stop run or return STOP from the endpoint handler.
  6. Optional: set include-leading-silence to 1 (or include-wake-word-audio) if the saved clip should include the wake word audio, not only the command.

The Java sample segmentSpottedAudio.java runs this flow with Gradle; C uses the same settings in live-segment.c.

Read begin-ms / end-ms in the endpoint handler if you need timestamps without writing a file.

How do I gate a command set with a push-to-talk button or other external event?

Use tpl-spot-sequential with loop = 2. This template normally listens for a wake word in slot 0, then a command in slot 1; with loop = 2 it skips slot 0 and pins listening to the command recognizer in slot 1 until you reset it.

A typical "wake word or push-to-talk" flow:

  1. Build a sequential model with your wake word in slot 0 and your command set in slot 1 (tpl-spot-sequential § Examples).
  2. Run the Session in the default mode (loop = 0). Slot 0 listens for the wake word and hands off to slot 1 after a spot.
  3. When the user presses the push-to-talk button, set loop = 2 from your UI thread (or whichever thread receives the button event) and the recognizer will treat the next utterance as a slot-1 command.
  4. After the command spots, set loop = 0 to resume always-listening behavior.

If you want the recognized utterance to be a regular wake-word-gated recognition but the wake word can come at the end of speech (for example, "… please, computer"), use wake-word-at-end on a tpl-spot-vad-lvcsr, tpl-opt-spot-vad-lvcsr, or tpl-spot-vad pipeline.

The older two-session pattern (one Session for the wake word, another for push-to-talk) is no longer recommended; a single sequential model has the same behavior and shares one audio path.

tpl-spot-sequential, loop, wake-word-at-end

How do I wire callbacks into a UI thread?

Session events and ^result handlers run on the same thread that calls run or push — not on your UI thread. Keep handlers short: copy what you need, then return. Update the UI from your toolkit's main thread.

Platform Pattern
Android Run run / push on a worker thread or HandlerThread; post UI work with Handler / runOnUiThread. See snsr-debug (PhraseSpot worker thread).
iOS (C API) Call run from a background DispatchQueue; update SwiftUI/UIKit on the main actor. See PhraseSpot.
Java desktop Run recognition off the EDT; use SwingUtilities.invokeLater (or equivalent) for UI updates.

The one cross-thread exception: you may call stop from another thread to unblock a run that is waiting on live audio (thread-safe FAQ).

Never share a Session or Stream handle across threads without your own lock; create one session per recognition worker.

Your first program (platform tabs), push, live-spot.c

Concepts

Is this SDK thread-safe?

Yes, as long as Session and Stream handles are not shared between threads. The number of handles per thread is limited only by system resources.

If you need to share one of these handles across threads, you must provide application-level mutual exclusion locking.

Note

There is just one exception to this requirement: You may call stop on a Session handle from a different thread than the one run is executing on.

If you replace the dynamic memory allocator with config and CONFIG_ALLOC the new allocator implementation must be thread-safe. Use allocLock to add thread-safety to an allocator that is not.

What sample rate does the SDK expect?

Sample rate is technically model-dependent — read the active model's samples-per-second setting if you need to confirm — but every model shipped in this TrulyNatural distribution requires 16 kHz, mono, 16-bit signed PCM. If you need a model that runs at a lower rate (typically 8 kHz for telephony audio), contact Sales.

When your capture device runs at a different rate, follow these rules:

  • Never up-sample to 16 kHz. Up-sampling does not add the high-frequency information the recognizer relies on; the resulting audio sounds similar to a human listener but recognition accuracy will be noticeably worse than on natively recorded 16 kHz audio. If 16 kHz capture is not available, contact Sales about a sub-16 kHz model instead.
  • Down-sampling from a higher rate is fine — for example, from 48 kHz on a typical USB microphone — provided you follow standard down-sampling practice. In particular, apply a low-pass anti-aliasing filter with a cut-off below the new Nyquist frequency before decimating; otherwise high-frequency content will fold back into the band the recognizer cares about and degrade accuracy.

Most platform audio APIs (Android AudioRecord, iOS Audio Queue Services, ALSA, Core Audio, Windows Multimedia Extensions) will resample correctly when you ask them for 16 kHz directly; prefer that over rolling your own filter chain.

What is a Command Set?

Command sets are phrase spotters with more than one phrase. These are frequently tuned to have a limited listen-window.

Command set recognizers have task-type == phrasespot and can be used as a drop-in replacement for any wake word. No code changes are required.

Most command sets are tuned for use after an always-listening keyword spotter. The tpl-spot-sequential template provides a convenient way to build such a model.

Can I run two wake word models at the same time?

Yes, see tpl-spot-concurrent.

Yes. Create a new phrase spot model from the tpl-spot-vad template.

How do I enroll Fixed Trigger models?

EFT models use the same API, and follow the same enrollment recipe as UDT models.

Replace the UDT model udt-universal-3.67.1.0.snsr in any of the examples with an EFT enrollment model such as eft-hbg-enUS-23.0.0.9.snsr.

How do I improve the user experience for wake words in poor audio environments?

Use a spotter model with Smart Wake Word support. See low-fr-operating-point and duration-ms.

How do I spot phrases on a Real-Time Operating System (RTOS) with a custom audio driver and no filesystem?

You should implement a new custom stream similar to data-stream.c which is used in spot-data-stream.c. This shows how to make a custom stream which should encapsulate your audio driver functionality, and which your Session can pull data from.

An alternative is pushing data onto a stream. See spot-data.c. You can take data chunks of any size (perhaps provided by your audio driver) and push them onto a stream to be read by an Session.

How do I use Large Vocabulary Continuous Speech Recognition? tnl

This TrulyNatural release includes three different ways of running a speech-to-text recognizer: without audio segmentation, with VAD audio segmentation, and with wake word gated VAD.

Note

The ^result callback only happens when a VAD endpoint is detected, or the end of the input stream is reached. For applications with live audio recognition, LVCSR recognizers should always be used with a VAD, such as tpl-opt-spot-vad-lvcsr, tpl-spot-vad-lvcsr, or tpl-vad-lvcsr.

LVCSR without audio segmentation

The stt-enUS-automotive-medium-2.3.15-pnc.snsr model included in this distribution is a generic broad-domain US English speech-to-text recognizer with a special domain focus on automotive commands.

% bin/snsr-eval -t model/stt-enUS-automotive-medium-2.3.15-pnc.snsr \
    data/enrollments/armadillo-1-3-c.wav
P     40    200 Im
P     80    640 Armadillo
P    120   1120 Armadillo playing
P    120   1520 Armadillo play marsa
P    120   1880 Armadillo play more songs by
P    120   2320 Armadillo play more songs by this art
P    120   2600 Armadillo play more songs by this artist
P    120   2640 Armadillo play more songs by this artist
NLU intent: music_player (0.9849) = armadillo play more songs by this artist
   120   2640 Armadillo play more songs by this artist.

Preliminary or partial results above are prefixed with P. Suppress these by setting the partial-result-interval to 0:

% bin/snsr-eval -t model/stt-enUS-automotive-medium-2.3.15-pnc.snsr \
    -s partial-result-interval=0 \
    data/enrollments/armadillo-1-3-c.wav
NLU intent: music_player (0.9849) = armadillo play more songs by this artist
   120   2640 Armadillo play more songs by this artist.

LVCSR with VAD-segmented audio

Large vocabulary recognizers perform better when used with a Voice Activity Detector that removes extraneous leading and trailing silence.

Create such a VAD-lvcsr model using the tpl-vad-lvcsr template:

% bin/snsr-edit -t model/tpl-vad-lvcsr-3.17.0.snsr \
    -f 0 model/stt-enUS-automotive-medium-2.3.15-pnc.snsr \
    -o vad-stt-enUS-automotive-medium-pnc.snsr

Evaluate using snsr-eval:

% bin/snsr-eval -t vad-stt-enUS-automotive-medium-pnc.snsr \
    data/enrollments/armadillo-1-0-c.wav
P    230    830 Armadilla
P    270   1150 Armadillo, eight
P    310   1630 Armadillo, eighteen percent
P    310   1910 Armadillo. Eighteen percent of s
P    310   2430 Armadillo, eighteen percent of six hundred
P    310   2790 Armadillo, eighteen percent of six hundred and forty
P    310   3150 Armadillo, eighteen percent of six hundred forty three
NLU intent: no_command (0.9765) = armadillo eighteen percent of 643
NLU entity:   number (0.9564) = 643
   310   3190 Armadillo, eighteen percent of six hundred forty three.

LVCSR following a wake word

The tpl-spot-vad-lvcsr template provides a way to start a large-vocabulary recognizer with a spotted wake word. The example below enrolls a wake word, then uses the enrolled spotter with the broad-domain recognizer.

Create an enrolled spotter for "jackalope":

% spot-enroll -vt model/udt-universal-3.67.1.0.snsr \
    +jackalope \
    data/enrollments/jackalope-1-0.wav \
    data/enrollments/jackalope-1-1.wav \
    data/enrollments/jackalope-1-4.wav \
    data/enrollments/jackalope-1-3.wav
Adapting: 100% complete.
Enrolled model saved to "enrolled-sv.snsr"

Combine the enrolled spotter and the broad-domain recognizer using the tpl-spot-vad-lvcsr-3.23.0.snsr template:

% snsr-edit -vt model/tpl-spot-vad-lvcsr-3.23.0.snsr \
  -f 0 enrolled-sv.snsr \
  -f 1 model/stt-enUS-automotive-medium-2.3.15-pnc.snsr \
  -s include-leading-silence=1 \
  -o jackalope-stt-enUS-automotive-medium-pnc.snsr
Saved edited model to "jackalope-stt-enUS-automotive-medium-pnc.snsr".

Evaluate using snsr-eval. The wake word is not included in the LVCSR transcription.

% snsr-eval -t jackalope-stt-enUS-automotive-medium-pnc.snsr \
    data/enrollments/jackalope-1-2-c.wav
P   1050   1530 Directions
P   1050   1930 Directions to sus
P   1050   2370 Directions to Susan's house
P   1050   2530 Directions to Susan's house
NLU intent: navigation (0.9973) = directions to susan's house
NLU entity:   navigation_location (0.9811) = susan's house
  1050   2530 Directions to Susan's house.

LVCSR with lightweight NLU parsing

The included LVCSR and STT models support a lightweight natural language mark-up. This can significantly simplify application code that has to interpret recognition results. See grammar-based recognition for a description of the grammar syntax.

NLU with custom grammar recognizers
% snsr-eval -t model/lvcsr-build-enUS-12.13.1-5MB.snsr \
    -s partial-result-interval=0 \
    -f grammar-stream data/grammars/enrollments-nlu-slot.txt \
    data/enrollments/armadillo-1-4-c.wav
NLU intent: avcontrol (0) =  record a video
NLU entity:   action (0) = record
NLU entity:   type (0) = video
   435   1995 armadillo record a video
NLU with broad-domain recognizers

In TrulyNatural v6.16.0 and later, NLU parsing is a separate processing step that occurs after the ^result event. NLU parsing includes a special . symbol that matches any input word. This allows crafting of more robust island parsers that can be used with free-form recognition results from a broad-domain model.

This small example detects a small set of microwave control commands using lvcsr-lib-enUS-1.2.0.snsr.

Note

The stt-enUS-automotive-medium-2.3.15-pnc.snsr model includes machine-learned NLU processing for automotive command tasks. If you use nlu-grammar-stream with this model the grammar-based NLU will override the machine-learned NLU parsing.

tiny-microwave.nlu
# Microwave command NLU post-processor grammar
# tiny-microwave.nlu

# power level setting, "fifty percent". don't capture optional "power"
power = ~s.percent power?;

# timer duration, "two minutes and ten seconds"
duration = ~s.timer;

# defrost command: the word "defrost" followed by
# zero or more power or duration values, both captured
# .* matches any input word sequence
defrost = defrost ( .* ({power} | {duration}) .* )* ;

# default action matches any input and discards it
default = .:*;

# set clock time: the word "clock" or "time" followed by
# a time ("seven twenty nine pm").
# ignore spurious words before and after the time specification
clock = (clock | time) .* {time ~s.time} .*;

# list of all the actions we've defined, captured
action = {defrost} | {clock} | {default};

# match any one of the actions, ignoring unknown words before
# and after
nlu = <s> .* $action .* </s>;

Build and run a recognizer with live input.

% snsr-eval -vat model/stt-enUS-automotive-medium-2.3.15-pnc.snsr \
    -t model/lvcsr-lib-enUS-1.2.0.snsr \
    -f nlu-grammar-stream tiny-microwave.nlu \
    -s partial-result-interval=0
Using live audio from default capture device. ^C to stop.

# "Defrost my soup for 15 minutes at 30% power"
Using live audio from default capture device. ^C to stop.
  4035   8835 [^end] VAD speech region.
NLU intent: defrost (0) = defrost my soup for 15 minutes at thirty percent power
NLU entity:   duration (0) = 15 minutes
NLU entity:   power (0) = thirty percent power
  4310   8470 (0.4805) Defrost my soup for fifteen minutes at thirty percent power.

# "Could you set the clock to 3:43 pm?"
 48165  51810 [^end] VAD speech region.
NLU intent: clock (0) = clock to 15:43
NLU entity:   time (0) = 15:43
 48360  51360 (0.163) Could you set the clock to three? Forty three P? M.
Dealing with NLU parse ambiguity

It is possible to get more than one valid parse result if the NLU grammar introduces ambiguity. The NLU processor scores these alternates and returns the best hypotheses in order, up to nlu-match-max. During the ^nlu-slot callback, nlu-match-count reports the number of alternates available, with nlu-match-index the current alternate.

nlu-match-max defaults to 1 for best compatibility with earlier releases.

Warning

Resolving NLU ambiguity can be expensive both in terms of computation and heap memory use.

Avoid using patterns that match arbitrary input in multiple ways:

g = <s> {left .*} {right .*} </s>;

This example uses two NLU grammars: system.nlu for basic functionality provided by a product, and app.nlu to extend NLU processing for a plug-in application. If the application duplicates some of the system NLU actions, those duplicates need to be reported for the system to take appropriate action.

system.nlu
# system.nlu
volume = volume: {volume-level ~s.percent};
preset = preset: number:? ~s.number-integer-0-9;
system = {volume} | {preset};
# :/-0.1 adds a small weight bias towards the ~app class, so
# ~app will outscore $system for identical matches
plugin = :/-0.1 ~app;
action = {system} | {plugin};
nlu = <s> $action </s>;
app.nlu
# app.nlu
media-control = ~s.control.media;
preset = preset: ( one | two | three | four | five );
nlu = {media-control} | {preset};

Build and run a recognizer with live input. Set the value for nlu-match-max to allow up to ten alternate matches.

% snsr-eval -vvat model/stt-enUS-automotive-medium-2.3.15-pnc.snsr \
    -t model/lvcsr-lib-enUS-1.2.0.snsr \
    -f nlu-grammar-stream system.nlu \
    -f nlu-grammar-stream.app app.nlu \
    -s partial-result-interval=0 \
    -s nlu-match-max=10
Using live audio from default capture device. ^C to stop.

# "volume 50%"
# in system grammar
  5235 [^begin]
  4710   6645 [^end] VAD speech region.
NLU intent: system (0) =  fifty percent
NLU entity:   volume.volume-level (0) = fifty percent
NLU  1/1 nlu-slot-value.system (0) = { volume { volume-level fifty percent } }
NLU  1/1 nlu-slot-value.system.volume (0) = { volume-level fifty percent }
NLU  1/1 nlu-slot-value.system.volume.volume-level (0) = fifty percent
phrase:
  4990   6270 (0.8939) Volume. Fifty percent.
words:
  4990   5470 (0.8955) Volume.
  5550   5870 (0.9986) Fifty
  5950   6270 (0.9996) percent.

 # "fast forward"
# in plugin grammar
 17070 [^begin]
 16545  17940 [^end] VAD speech region.
NLU intent: plugin (0) =  fast forward
NLU entity:   media-control (0) = fast forward
NLU  1/1 nlu-slot-value.plugin (0) = { media-control fast forward }
NLU  1/1 nlu-slot-value.plugin.media-control (0) = fast forward
phrase:
 16860  17540 (0.7646) Fast forward.
words:
 16860  17100 (0.9913) Fast
 17220  17540 (0.7713) forward.

# "preset 5"
# in both system and plugin grammars, but plugin reported first
# due to the weight bias
 22290 [^begin]
 21765  23325 [^end] VAD speech region.
NLU intent: plugin (0) =  five
NLU entity:   preset (0) = five
NLU  1/2 nlu-slot-value.plugin (0) = { preset five }
NLU  1/2 nlu-slot-value.plugin.preset (0) = five
NLU intent: system (0) =  five
NLU entity:   preset (0) = five
NLU  2/2 nlu-slot-value.system (0) = { preset five }
NLU  2/2 nlu-slot-value.system.preset (0) = five
phrase:
 22040  22920 (0.9432) Preset. Five.
words:
 22040  22480 (0.9443) Preset.
 22680  22920 (0.9988) Five.
How do I take action on an NLU result?

You can think of an intent as specifying which function or method you should call to perform an action. Entities identify parts of the utterance that include additional detail. For example, a call_contact intent might have a contact_name entity that specifies who to call.

  • Register a handler for ^nlu-intent
  • In this handler,
    • Retrieve nlu-intent-name as a string.
    • Map this intent name to an action. Do this by comparing the intent name to all valid intent names for which you want to perform an action.
    • If the matched action requires additional data, retrieve the expected nlu-entity-value by name.
    • Call a function (specified by the intent value) with zero or more arguments specified by the entity values.
    • Return from the intent event handler with OK.

Performance

How can I reduce application code size?

By default, any applications linked against the TrulyNatural library can run any model (.snsr) file supported by the library. You can reduce the overall code size of an application by limiting the library capabilities to only the models of interest.

Use snsr-edit with the -i flag to create custom initialization code that references only the modules used by the models included in your application. For example:

% snsr-edit -v -i -t spot-voicegenie-enUS-6.5.1-m.snsr
Output written to "snsr-custom-init.c".

This creates a custom initialization file, snsr-custom-init.c, that references only the code modules used by spot-voicegenie-enUS-6.5.1-m.snsr. Add this file to your application, and compile with -DSNSR_USE_SUBSET This will replace all calls to snsrNew with a variant that initializes only the required modules.

You can further reduce code size by linking at the function instead of the module level. See sample/c/Makefile for compiler and linker flag examples (-ffunction-sections).

Can I avoid dynamic memory allocation?

You can avoid all calls to malloc(), realloc(), and free() by replacing the memory allocator with CONFIG_ALLOC.

For embedded use, allocTLSF is a good choice. Use it with one or more pre-defined read-write memory segments that remain valid for the lifetime of the application.

How do I improve wake word performance?

Contact Sensory if interested in pursuing these customizations. There may be additional cost involved. Not all combinations may be possible depending on platform and trigger specification.

How to measure real-time factor and MIPS

  • To measure the real-time factor, time how long it takes to run the spotter over a long audio file. Then, real time factor = (run time in seconds) / (length of audio in seconds).
  • To measure the MIPS on your device, use a profiler like perf when running the spotter over an audio file. Then, MIPS = (No. of instructions) / (length of audio in seconds * 1000000).

What if the spotter runs too slow, or consumes too many cycles?

You could explore one of these options to see an improvement: Try multi-threaded, frame-stacked, or little-big spotters. You may also want to get a smaller spotter model, which uses less CPU (in proportion to its size) with a small reduction in FA and FR performance. Contact Sensory to see if these options are right for you.

What if the spotter consumes too much memory?

  1. Contact Sensory for a smaller model.
  2. If your platform runs code directly from ROM, consider converting the spotter to compiled-in code. This will run from read-only code space, and reduce heap requirements. Use the snsr-edit tool to create a C source file from any spotter model. See fromCode and examples spot-data-stream.c and spot-data.c

What is a little-big spotter?

A little-big spotter does sequential recognition by first running a low-power spotter. When this spots, it re-processes the audio with a high-power state-of-the-art spotter. This reduces average CPU cycles (and hence power) required to run a spotter with a small increase in latency. This one combined model has the behavior of a high-power spotter.

What is a frame-stacked spotter?

Frame stacked spotters reduce the CPU load by 30-45%, in exchange for a small reduction in FA and FR performance. The resolution of time alignments is also reduced by a factor of two.

What is a multi-threaded spotter?

Multi-threaded spotters speed up execution on CPUs with more than one core.

Troubleshooting

How do I diagnose wake word audio issues?

Create a new wake word model from the tpl-spot-debug template. See the notes and example.

Can I use models from the beta releases?

Yes. This release is compatible with older models, but it requires a modification to the task requirement sanity checks.

Use "~0.5.0 || 1.0.0" instead of "1.0.0", for example:

snsrRequire(session, SNSR_TASK_VERSION,  "~0.5.0 || 1.0.0");
session.require(Snsr.TASK_VERSION,  "~0.5.0 || 1.0.0");

The models included in the v6.0.0 release use task-version values of 1.0.0. This makes these models incompatible with 5.0.0-beta releases.

How do I display international characters in results?

On Windows systems, when using Sensory STT models with snsr-eval v7.3.0 or earlier, international characters such as Chinese (zhCN) may appear as garbled symbols such as "Σ╜á σÑ╜ σÉù" instead of correct UTF-8 characters "您 好 吗".

This is a display encoding issue, not an issue with the recognition output itself.

Solution Options

  1. Set Console Code Page to UTF-8

    Before running snsr-eval.exe, run the following command in the Windows Command Prompt:

    chcp 65001
    

    This sets the console's code page to UTF-8, enabling correct display of international characters.

    snsr-eval v7.4.0 and later does this before writing any output.

  2. Enable System-Wide UTF-8 Support (Recommended for Long-Term Use)

    • Open Settings > Time & Language > Administrative Language Settings
    • Under Change system locale, check: "Beta: Use Unicode UTF-8 for worldwide language support"
    • Save your changes and restart your computer to apply them

    This setting ensures that all applications and the console will handle UTF-8 properly by default.