Microsoft Cloud Communications API - an Overview

To kick this series off, I’m going to take a look at business scenarios that can be enabled with Microsoft’s Cloud Communications API. The API was made generally available exactly a year ago today, and allows Teams to be enhanced in some very powerful ways, so it seems like a great place to start. But first, some background:

Bots

Most people nowadays are familiar with Bots, as we bump up against them more and more in our daily lives. When trying to get customer support, the first point of contact will often be a bot (“Tell me about your problem, and I’ll route you to the correct department”). Bots can be written for a wide variety of platforms, such as web pages, WhatsApp, Facebook Messenger, Slack, Teams, and Zoom

A bot could be defined as an application that can be interacted with in a similar way as you’d interact with a human, with the goal of performing a specific task.

When thinking of bots, it’s common to think of chatbots, where the communication between users and the application is via text written in a natural language. Communication with bots is often initiated by the user, as in the case of a user using the “Chat” functionality on a web page to request information about delivery times. It is also common for bots to initiate the conversation with users - for example, to send alerts when a stock’s price breaches a certain threshold, and ask if any action should be taken.

We also interact with bots when making calls - organisations often use a type of bot called an IVR (Interactive Voice Response) to route incoming calls to the right department. These have been around for so long, and are so familiar, that we don’t often think of them as bots, but they do fit the definition. Like chatbots, interaction is most commonly initiated by the user, but it’s also common to see the interaction initiated by the bot. Outbound diallers, often used in call centres, are a good example.

Bots in Teams

It’s been possible to build chatbots in Microsoft Teams for some time, using the Bot Framework. This lets developers build powerful chatbots that actually have a wider scope than just Teams, with channels available for Web Chat, Facebook Messenger and Slack, among others. Bot Framework also has some ability to synthesize speech and recognise spoken answers, allowing it to work with Alexa and Cortana.

The Cloud Communications API enhances this by adding audio and video calling features. Using this API, it’s possible to build bots that can be assigned a PSTN phone number, and:

  • Initiate and answer calls to/from other phone users
  • Receive the user’s audio for processing - e.g. recording, or responding to voice commands
  • Play audio to the user, e.g. synthesised or pre-recorded speech
  • Listen for and respond to DTMF tones - e.g. “Press 1 for customer services, 2 for sales”

As well as PSTN users, bots using this API can interact with Teams users - either individually, or with a group of users in a Teams meeting:

  • Initiate and answer calls to/from Teams users
  • Receive any teams user’s audio and video
  • Play audio and video to the Teams user or meeting participants
  • Manage teams meetings
  • Receive notification of users availability in Teams - known as Presence.

Scenarios

This API gives us the ability to build traditional IVR and dialler solutions for PSTN, as well as expanding the concept to include Teams users. Users on Microsoft Teams can place Teams calls to a bot, and the bot can play audio clips and listen to voice responses or DTMF tones to guide the user through a particular path. The dialler concept also works with Teams users - the bot can initiate contact with Teams users to ask questions, send audio updates etc.

Where things really get interesting is when we consider what we can do with video. We have the ability to send and receive video as well as audio, so we could use this to build video content delivery solutions in Teams - for example, delivering training content, or pre-recorded messages from the CEO. Given that our application can receive video, we can also process this video in some way - for example, to record video clips to deliver to a user when they are available. We could also use this to perform processing on the received audio and video - for example, speech transcription, or sentiment analysis based on facial expressions.

The API also allows for integration with other audio/video solutions. For example, it could enable a Teams user to reach field operatives who are using walkie talkies, or request the video feed from a CCTV security system.

As well as interacting with individuals, all of the above extends to users in Teams meetings too - in fact the API has a few more tricks up its sleeve regarding Teams meetings. Bots are able to schedule teams meetings on the fly, and invite and remove participants. This allows a bot to act as a meeting coordinator, for example in emergency situations, creating a Teams meeting and then attempting to reach all required participants by Teams call, mobile phone, SMS etc. The bot appears to any meeting participants as just another participant in the meeting, so is able to send Instant Messages and play its own audio, video and even screenshare feed, making it able to present information to the participants in whichever way makes the most sense for the scenario.

Using this API in combination with other technologies - for example, using Azure Cognitive Services for speech recognition and synthesis, and Azure Media Services for audio/video analysis and indexing - can lead to some powerful enterprise solutions. Imagine the following solution:

A bot is configured to join all departmental meetings, and record them. During a meeting a question is asked about a previous decision and there’s some uncertainty about the factors that led to that decision being taken. A spoken conversation with the bot proceeds:

Participant - “Hey, Recording Bot - we’re looking for a previous meeting where we were discussing the back-end architecture of Project Hailstorm”
Bot - “Roughly how long ago was this call?”
Participant - “Around 6 months - certainly after Christmas”
Bot - “Do you remember who was on the call?”
Participant - “Well, I was. I remember Abi and Kwame being there and I think Brad may have been around for part of it”
bot - “OK, thanks. Anything else you can think of to help me find it?”
Participant - “We were debating moving to Microservices. We weren’t 100% sure on the rewrite effort for the billing module”

At this point, the Bot’s video feed starts, and displays a list of possible matches, showing meeting title, participants and snippets of conversation that may be relevant, along with who spoke them, and when they were spoken.

Participant - “Meeting 3 looks good - play from 10 minutes in”

All participants watch for a minute.

Participant - “Yes, that’s the one, send me a link to the recording

Summary

The Cloud Communications API allows applications to interact with PSTN and Teams users in a natural way, with the ability to request and present information in whichever way makes sense for the scenario - audio, video or screenshare. The ability to integrate with external systems leads to powerful possibilities for improving collaboration in a wide variety of scenarios.

However, this power comes at a cost. The Cloud Communications API is, in my opinion, the most complex API in Teams and comes with a steep learning curve - particularly for those that are new to Microsoft Azure development. The next few posts on this subject are going to dive deep into the technology to attempt to flatten that curve a little.

Paul Nearney Written by:

Paul is a software solution architect specialising in Microsoft Teams, and the founder of Chimu Software