Implementing Hotword Detection with AssemblyAI's Streaming Speech-to-Text in Go - Blockchain.News
News

Implementing Hotword Detection with AssemblyAI's Streaming Speech-to-Text in Go

Learn how to implement hotword detection using AssemblyAI's Streaming Speech-to-Text API with Go. This guide covers setup, coding, and execution.


  • Jun 26, 2024 07:43
Implementing Hotword Detection with AssemblyAI's Streaming Speech-to-Text in Go

Hotword detection is a crucial feature for voice-activated systems like Siri or Alexa. In a recent tutorial by AssemblyAI, developers are guided on how to implement this functionality using AssemblyAI's Streaming Speech-to-Text API with the Go programming language.

Introduction to Hotword Detection

Hotword detection enables an AI system to respond to specific trigger words or phrases. Popular AI systems like Alexa and Siri use predefined hotwords to activate their functionalities. This tutorial from AssemblyAI demonstrates how to create a similar system, named 'Jarvis' in homage to Iron Man, using Go and AssemblyAI's API.

Setting Up the Environment

Before diving into the coding, developers need to set up their environment. This includes installing the Go bindings of PortAudio to capture raw audio data from the microphone and the AssemblyAI Go SDK for interfacing with the API. The following commands are used for setting up the project:

mkdir jarvis
cd jarvis
go mod init jarvis
go get github.com/gordonklaus/portaudio
go get github.com/AssemblyAI/assemblyai-go-sdk

Next, an AssemblyAI account is required to obtain an API key. Developers can sign up on the AssemblyAI website and configure their billing details to access the Streaming Speech-to-Text API.

Implementing the Recorder

The core functionality begins with recording raw audio data. The tutorial guides on creating a recorder.go file to define a recorder struct that captures audio data using PortAudio. This struct includes methods for starting, stopping, and reading from the audio stream.

package main

import (
    "bytes"
    "encoding/binary"

    "github.com/gordonklaus/portaudio"
)

type recorder struct {
    stream *portaudio.Stream
    in     []int16
}

func newRecorder(sampleRate int, framesPerBuffer int) (*recorder, error) {
    in := make([]int16, framesPerBuffer)

    stream, err := portaudio.OpenDefaultStream(1, 0, float64(sampleRate), framesPerBuffer, in)
    if err != nil {
        return nil, err
    }

    return &recorder{
        stream: stream,
        in:     in,
    }, nil
}

func (r *recorder) Read() ([]byte, error) {
    if err := r.stream.Read(); err != nil {
        return nil, err
    }

    buf := new(bytes.Buffer)

    if err := binary.Write(buf, binary.LittleEndian, r.in); err != nil {
        return nil, err
    }

    return buf.Bytes(), nil
}

func (r *recorder) Start() error {
    return r.stream.Start()
}

func (r *recorder) Stop() error {
    return r.stream.Stop()
}

func (r *recorder) Close() error {
    return r.stream.Close()
}

Creating the Real-Time Transcriber

AssemblyAI's Real-Time Transcriber requires event handlers for different stages of the transcription process. These handlers are defined in a transcriber struct and include events such as OnSessionBegins, OnSessionTerminated, and OnPartialTranscript.

package main

import (
    "fmt"

    "github.com/AssemblyAI/assemblyai-go-sdk"
)

var transcriber = &assemblyai.RealTimeTranscriber{
    OnSessionBegins: func(event assemblyai.SessionBegins) {
        fmt.Println("session begins")
    },

    OnSessionTerminated: func(event assemblyai.SessionTerminated) {
        fmt.Println("session terminated")
    },

    OnPartialTranscript: func(event assemblyai.PartialTranscript) {
        fmt.Printf("%s\r", event.Text)
    },

    OnFinalTranscript: func(event assemblyai.FinalTranscript) {
        fmt.Println(event.Text)
    },

    OnError: func(err error) {
        fmt.Println(err)
    },
}

Stitching Everything Together

The final step involves integrating all components in the main.go file. This includes setting up the API client, initializing the recorder, and handling the transcription events. The code also includes logic for detecting the hotword and responding appropriately.

package main

import (
    "context"
    "fmt"
    "log"
    "os"
    "os/signal"
    "strings"
    "syscall"

    "github.com/AssemblyAI/assemblyai-go-sdk"
    "github.com/gordonklaus/portaudio"
)

var hotword string

var transcriber = &assemblyai.RealTimeTranscriber{
    OnSessionBegins: func(event assemblyai.SessionBegins) {
        fmt.Println("session begins")
    },

    OnSessionTerminated: func(event assemblyai.SessionTerminated) {
        fmt.Println("session terminated")
    },

    OnPartialTranscript: func(event assemblyai.PartialTranscript) {
        fmt.Printf("%s\r", event.Text)
    },

    OnFinalTranscript: func(event assemblyai.FinalTranscript) {
        fmt.Println(event.Text)
        hotwordDetected := strings.Contains(
            strings.ToLower(event.Text),
            strings.ToLower(hotword),
        )
        if hotwordDetected {
            fmt.Println("I am here!")
        }
    },

    OnError: func(err error) {
        fmt.Println(err)
    },
}

func main() {
    sigs := make(chan os.Signal, 1)
    signal.Notify(sigs, syscall.SIGINT, syscall.SIGTERM)

    logger := log.New(os.Stderr, "", log.Lshortfile)

    portaudio.Initialize()
    defer portaudio.Terminate()

    hotword = os.Args[1]

    device, err := portaudio.DefaultInputDevice()
    if err != nil {
        logger.Fatal(err)
    }

    var (
        apiKey = os.Getenv("ASSEMBLYAI_API_KEY")
        sampleRate = device.DefaultSampleRate
        framesPerBuffer = int(0.2 * sampleRate)
    )

    client := assemblyai.NewRealTimeClientWithOptions(
        assemblyai.WithRealTimeAPIKey(apiKey),
        assemblyai.WithRealTimeSampleRate(int(sampleRate)),
        assemblyai.WithRealTimeTranscriber(transcriber),
    )

    ctx := context.Background()

    if err := client.Connect(ctx); err != nil {
        logger.Fatal(err)
    }

    rec, err := newRecorder(int(sampleRate), framesPerBuffer)
    if err != nil {
        logger.Fatal(err)
    }

    if err := rec.Start(); err != nil {
        logger.Fatal(err)
    }

    for {
        select {
        case <-sigs:
            fmt.Println("stopping recording...")
            if err := rec.Stop(); err != nil {
                log.Fatal(err)
            }
            if err := client.Disconnect(ctx, true); err != nil {
                log.Fatal(err)
            }
            os.Exit(0)
        default:
            b, err := rec.Read()
            if err != nil {
                logger.Fatal(err)
            }
            if err := client.Send(ctx, b); err != nil {
                logger.Fatal(err)
            }
        }
    }
}

Running the Application

To run the application, developers need to set their AssemblyAI API key as an environment variable and execute the Go program with the desired hotword:

export ASSEMBLYAI_API_KEY='***'
go run . Jarvis

This command sets 'Jarvis' as the hotword, and the program will respond with 'I am here!' whenever the hotword is detected in the audio stream.

Conclusion

This tutorial by AssemblyAI provides a comprehensive guide for developers to implement hotword detection using their Streaming Speech-to-Text API and Go. The combination of PortAudio for capturing audio and AssemblyAI for transcription offers a powerful solution for creating voice-activated applications. For more details, visit the original tutorial.

Image source: Shutterstock