How ElevenLabs Voice AI Is Replacing Screens in Warehouse and Manufacturing Operations

I built a voice-guided picking system to prove it: what used to require a $5,000 proprietary headset now runs on a smartphone with ElevenLabs.

Need Help?
Subscribed! Error
How ElevenLabs Voice AI Is Replacing Screens in Warehouse and Manufacturing Operations
(Image generated with Gemini by Samir Saci)

A warehouse picking operation is the process of collecting items from storage locations to fulfil customer orders.

It is one of the most labour-intensive activities in logistics, accounting for up to 55% of total warehouse operating costs.

Example of warehouse layout where opeartors need to pick in multiple locations - (Image by Samir Saci)ExpCreateCrecsssfdwf

For each order, an operator receives a list of items to collect from their storage locations.

They walk to each location, identify the product, pick the right quantity, and confirm the operation before moving to the next line.

In most warehouses, operators rely on RF scanners or handheld tablets to receive instructions and confirm each pick.

  • What happens when your operators need both hands to handle heavy items?
  • How do you onboard operators who don't read the local language fluently?

Voice picking solves this by replacing the screen with audio instructions: the system tells the operator where to go and what to pick, and the operator confirms verbally.

Ilustration of an operator using voice picking - (Image by Samir Saci)

When I was designing supply chain solutions in logistics companies, vocalisation was the default choice, especially for price-sensitive projects.

Based on my experience, with vocalization operators' productivity can reach 250 boxes/hour for retail and FMCG operations.

The concept is not new. Hardware providers and software editors have offered voice-picking solutions since the early 2000s.

But these systems come with significant constraints:

  • Proprietary hardware at $2,000 to $5,000 per headset
  • Vendor-locked software with limited customisation
  • Long deployment cycles of 3 to 6 months per site
  • Rigid language support that requires retraining for each new language

For a 50-person warehouse, the total investment reaches $150K to $300K, excluding training costs.

What if you could achieve similar results using a smartphone, a browser, and modern AI voice technology?

In this article, I will show how I built a minimalist voice-picking module that integrates with Warehouse Management Systems, using ElevenLabs for text-to-speech and speech recognition.

Example of screens of this app designed to be used on smartphone with vocal interface - (Image by Samir Saci)

This web application has been deployed in the distribution centre of a small supermarket chain.

The objective is not to design solutions that compete with market leaders, but rather to offer an alternative to operations that lack the capacity to invest in expensive equipment.

Problem Statement

Before we get into voice picking, let me introduce the operations this AI-powered web application will support.

Layout of the distribution center - (Image by Samir Saci)

This is the central distribution centre of a small supermarket chain that delivers to 50 stores in Central Europe.

Layout of the warehouse with 10 aisles and 12 pallet positions displayed on the app - (Image by Samir Saci)

The facility is organised in a grid layout with aisles (A through L) and positions along each aisle:

  • Each location stores a specific SKU with a known quantity of boxes.
  • Operators need to know where to go and what to expect when they arrive.

In the application, our operators can tap any location on the grid to see its contents: the SKU reference, the number of boxes currently stored, and any notes attached to that position.

Operators can check their picking list but also detailed information per location - (Image by Samir Saci)

This visual layout, connected to the Warehouse Management System (WMS) database, also supports inventory cycle counting, but here we focus on its role as the spatial reference for picking operations.

Now let's see how operators use this visual interface to prepare orders.

How the Picking Flow Works

A picking batch is a group of customer orders consolidated into a single work assignment.

The system generates a batch with multiple order lines with instructions:

  • Where to go (the storage location)
  • What to pick (the SKU reference)
  • How many boxes to collect
Picking list (left), layout (middle), details of location (right) - (Image by Samir Saci)

The operator processes each line sequentially.

Once they confirm a pick, the system advances to the next instruction.

This sequential flow is critical because it determines the walking path through the warehouse using a pathfinding algorithm to minimise the total distance.

Example of the original pathfinding solution (bottom) and the optimized (top)

Thanks to our control here (this is a custom application), we can implement this optimisation without relying on an external editor.

Initially, the customer planned to purchase a commercial solution (for voice picking) and wanted me to integrate the pathfinding solution.

After investigation, they discovered that it would have been more expensive to integrate the app into the vendor solution than to build something from scratch.

What is the process without the AI-based voice feature?

Manual Mode: The Screen-Based Baseline

In manual mode, the operator reads each instruction on screen and confirms by tapping a button.

This is the simplest version of the picking flow, as it works on any device and requires no audio capability.

Two actions are available at each step:

  • Confirm Pick: operator collected the right quantity
  • Report Issue: the location is empty, the quantity doesn't match, or the product is damaged
Our operator has to press the button to confirm the picking or report an issue - (Image by Samir Saci)

I built the manual mode as a reliable fallback in case we have issues with Elevenlabs.

But it keeps the operator's eyes and one hand tied to the device at every step.

We need to add vocal commands!

Voice Mode: Hands-Free with ElevenLabs

Now that you know why we want the voice mode to replace screen interaction, let me explain how I added two AI-powered components.

Text-to-Speech: ElevenLabs Reads the Instructions

When the operator starts a picking session in voice mode, each instruction is converted to speech using the ElevenLabs API.

Instead of reading "Location A-03-2, pick 4 boxes of SKU-1042" on a screen, the operator hears a natural voice say:

"Location Alpha Three Two. Pick four boxes."

ElevenLabs provides several advantages over basic browser-based TTS:

  • Natural intonation that is easier to understand in a noisy warehouse
  • 29+ languages available out of the box, with no retraining
  • Consistent voice quality across all instructions
  • Sub-second generation for short sentences like pick instructions

But what about speech recognition?

Speech-to-Text: The Operator Confirms Verbally

After hearing the instruction, the operator walks to the location, picks the items, and needs to confirm.

Here, I made a deliberate design choice relying on speech recognition and the reasoning capabilities of ElevenLabs.

Using a single endpoint, we capture the response and match it against expected commands:

  • "Confirm" or "Done" to validate the pick
  • "Problem" or "Issue" to flag a discrepancy
  • "Repeat" to hear the instruction again
The complete process from left to right: Step 1 -> Step 2 -> Step 3 - (Image by Samir Saci)

For a multilingual warehouse, this is a significant benefit:

  • A Czech operator and a Filipino operator can both receive instructions in their native language from the same system, without any hardware change.
  • I don't have to consider all the languages possible in the design of the solution

For another product, the inventory cycle count tool presented in this article, I have used n8n with AI agent nodes to perform the same task.

n8n workflow for the voice-powered inventory cycle count tools - (Image by Samir Saci)

This was working quite well, but it required a more complex setup

  • Two AI nodes: one for the audio transcription and one AI agent to format the output of the transcription
  • The system prompts were assuming that the operator was speaking English

I have replaced that with a single ElevenLabs endpoint with multi-lingual capabilities.

Putting both components together, a single pick cycle looks like this:

  1. The app calls ElevenLabs to generate the audio instruction
  2. The operator hears: "Location Alpha Three Two. Pick four boxes."
  3. The operator walks to the location (hands free, eyes free)
  4. The operator picks the items
  5. The operator presses the microphone button
  6. The operator says: "Confirm"
  7. The speech recognition endpoint processes the confirmation and moves to the next pick
The Complete Voice Picking Cycle - (Image by Samir Saci)

The entire interaction takes a few seconds of system time.

What about the costs?

This is where the comparison with traditional systems becomes striking.

Comparative study - (Image by Samir Saci)

For this mid-size warehouse with 35 operators, they estimated that the traditional approach costs roughly $60K to $150K in the first year.

The AI-powered approach costs a few API calls.

The trade-off is clear: traditional systems offer proven reliability and offline capability for high-volume operations.

But we have the manual solution as a rollback.
Tehnical architecture of this application - (Image by Samir Saci)

This AI-powered approach offers accessibility and speed for organisations that cannot justify a six-figure investment.

What Does that Mean for Operations Leaders?

Voice picking is no longer a technology reserved for the largest 3PLs and retailers with deep pockets.

If your warehouse has WiFi and your operators have smartphones, you can prototype a voice-guided picking system in days and test it on a real batch to measure the impact before committing any significant budget.

Three scenarios where this approach makes particular sense:

  • Multilingual facilities where operators struggle with screen-based instructions in a language that is not their own
  • Multi-site operations where deploying proprietary hardware to every small warehouse is not economically viable
  • High-turnover environments where training time on complex scanning systems directly impacts productivity during peak periods

Good news, the same architecture extends beyond picking.

Voice-guided workflows can support any process where an operator needs instructions while keeping their hands free.

You can find a live demo of an inventory cycle counting tool in this video:

About Me

Let's connect on LinkedIn and Twitter. I am a Supply Chain Engineer who is using data analytics to improve logistics operations and reduce costs.

If you're looking for tailored consulting solutions to optimise your supply chain and meet sustainability goals, please contact me.

Need Help?