There are many ways to record audio on iOS, with the
AVFoundation framework being a veritable Swiss Army Knife of tools.
At the basic level,
AVAudioRecorder makes it dead simple to record audio in a variety of common formats, and save it to disk. However, the audio is only available after recording is complete. This is fine in most cases, but sometimes, we might need the audio while recording is happening, to do something with it — applying effects, streaming it to a web server, and so on.
This is where
AVAudioEngine comes in. Introduced in 2014, it provides a lower-level set of building blocks, with which it is possible to write custom audio-processing pipelines. In this post, we’ll see how to use
AVAudioEngine to record audio, and compress and stream it, even while recording is in progress.
An audio pipeline is built by using the
AVAudioEngine class and connecting it with a graph of
AVAudioNodes. As audio data flows from node to node, it is processed by each node, with the final processed audio reaching the final output node. Using this, it is possible to architect complex pipelines, which can process audio all in real-time.
There are a variety of types of nodes:
- AVAudioInputNode: This node is responsible for connecting to audio input sources such as the device microphone. The captured audio is passed on to other nodes down the graph.
- AVAudioUnit: This node is used to process input audio and apply various effects, such as speed, pitch, reverb etc in realtime. There are different subclasses of
AVAudioUnit, each one responsible for a different effect.
- AVAudioMixerNode: This node accepts input audio from other, possibly multiple nodes, and mixes them together. For example, multiple effects that were applied on AVAudioUnit nodes can be mixed together in these nodes, and the mixed audio can be passed on to the next node.
- AVAudioPlayerNode: This node can be used to playback audio either from input buffers incoming from other nodes, or from files saved already.
- AVAudioOutputNode: This node is responsible for connecting to the device output, such as speakers, and actually outputting the audio.
An AVAudioEngine instance by default has one input node, mixer node (called the main mixer node) and output node. Other nodes can be created and attached as necessary.
Using AVAudioEngine for recording
To demonstrate how recording using AVAudioEngine works, we’ll build a
Recorder class, that encapsulates all the functionality we need.
The class holds a reference to an
AVAudioEngine instance, which we will be creating. It also holds a reference to a mixer node, using which we will be capturing audio. We could use the engine’s
inputNode directly, but having a separate node helps in case some additional processing is needed before accessing the audio. And finally, we also have a property that holds the current recording state.
When initialising the Recorder, the first thing to do is to setup the
Using the shared
AVAudioSession instance, we first set the category to
record, which allows the app to record audio. If there is a need to playback audio in the app as well, it can be set to
playAndRecord. We then activate the session.
Next, we setup the engine:
Like mentioned earlier,
AVAudioEngine works by having an instance of the engine and attaching nodes to it. In the above code, we’re doing the following:
- We create the engine as well as the custom mixer node, in the
- We attach our mixer node to the engine. Attaching simply adds the node to the list of nodes in the engine.
- We then connect nodes together, in the
makeConnectionsmethod. This is where we actually construct the audio processing graph. For our purposes, we’re connecting the input node to our mixer node, and our mixer node to the main mixer node. So it looks like this:
- Finally, we prepare the engine, so that the system allocates the necessary resources in advance.
When connecting nodes, we can specify the format of the audio, which will act as the output audio format for the source node (and therefore the input format of the destination node).
However, this cannot be any format — it needs to be a PCM audio format for reasons we will see later. If the format is nil, the input format of the source node is treated as its output format as well.
Once the session and engine are setup, we can start recording, so let’s setup a
A key feature of AVAudioEngine is that it is possible to tap (i.e capture) the audio at any point in the graph, again in real-time. Audio continues to flow through the graph — we just get access to the stream at that point of the graph.
Let’s see what’s happening in
- We get a reference to the node on which we want to install the tap on. In this case, it’s our custom mixer node. As mentioned earlier, this could’ve been the input node or the main mixer node as well. We also get a reference to the node’s output format.
- We install the tap, providing a large enough buffer size, and the format of the node. We also pass an
AVAudioNodeTapBlockclosure, which gives us the audio buffer in return.
- We then start the engine, which allocates resources, and connects the input/output nodes to the audio source/destination.
The main part is of course the tap. The buffer that the closure gives back to us is an
AVAudioPCMBuffer. Remember how we had to use a PCM audio format earlier? This is the reason — the audio buffer is always expected to be in this format, and it’s up to us to take it and convert it into other formats.
Side note: In case your requirement is to just record audio, without having to do anything else, you’re pretty much done! You can write the buffer to disk using methods provided by
AVAudioFile. Once the recording is completed by stopping the engine, you can read and playback the file in your app. Of course, do stick around till the end, because we’ll be discussing a few special considerations.
AVAudioFile to write to disk:
Pause and resume is simple — just implement the following methods:
We have a buffer in PCM format, but this is uncompressed audio, and in order to avoid taking up too much disk space, or for faster data transmission, we’re going to have to compress it. We can compress audio by converting it into a compressed audio format.
There are a variety of formats, such as AAC and FLAC, some of which are lossy compressed formats and others lossless. Which format you choose would depend on your use-case. For things like Speech Recognition, it is better to use lossless formats, while for things like, say, audio notes in a notes app, a lossy format would suffice.
To convert audio, we need to use the
For a project that we worked on, our requirement was to convert the audio into FLAC before streaming it. So let’s take a look at how we did that.
Phew, a lot of things going on here, let’s break it down:
- We first create a couple of instance variables — an
AVAudioCompressedBuffer. We’ll be using the converter instance to reset it when stopping the recording. The buffer needs to be instance variable, because there’s a bug where it gets deallocated too soon if it’s only a local variable, and the converter crashes.
- We create the converter inside the
startRecordingmethod. The initialiser accepts two parameters — an input format and an output format. The input format is the format of our mixer node. The output format is the format we want to convert to, which is created as follows.
- The output format is constructed using an
AudioStreamBasicDescription, which is a struct that allows us to specify the settings for an
AVAudioFormat. In this example, we’ve used the FLAC format, along with settings recommended for the format. A note on sample rate: we’re using the same sample rate as that of the node. We faced audio stuttering and quality issues when we tried to change it. If you have suggestions on how to handle it, do let us know in the comments!
- Once the converter is constructed, we can use it inside the tap block. The first step is to initialise our compressed buffer with settings for the format, packet capacity and size. Again, these are the recommended settings.
- When the converter starts converting audio using
convert(to:error:inputBlock:), it passes the data through an
AVAudioConverterInputBlock. It returns the buffer to be used to convert audio. We also need to use the outStatus parameter to specify whether there is data available to be used or not. For our requirement, we need to stream audio, and we can expect there to always be data as long as the recording is in progress. However, if the converter is operating on files, which have a specified end, we would need to use
- After calling
convert(to:error:inputBlock:), we now have data in the compressed buffer. We convert it to
Data, so that we can use it for our purposes. In our case, we had to stream it, so it was just a matter of passing along the data to another class which was responsible for connecting over Websockets and sending data to a server.
We looked at using
AVAudioEngine to record and compress audio, but there are a few more things to keep in mind when using this in an actual app.
Any app that wants to record audio needs to request permission from the user first. You would need to use
requestRecordPermission method at some point in the app before starting to record. You would also need to include the privacy description for microphone usage, in your
To be able to record audio in the background, you need to enable the audio background mode in the Signing & Capabilities section of your project settings.
This is a crucial consideration, which is easy to overlook. When recording audio in your app, it’s not guaranteed that it will have access to the microphone at all times. It is possible that the recording is interrupted by phone calls, or by other processes that take over the microphone, such as Siri.
We need to take appropriate actions both when the interruption begins, and when it ends, and this is done by listening to notifications sent by the
AVAudioSession.interruptionNotification contains info that lets us know when an interruption began and when it ended.
When the interruption begins, we need to pause the recording — this allows
AVAudioEngine resources to be temporarily freed up, while the microphone is being used by some other process.
When the interruption ends, we need to activate our audio session again, and resume the recording. We also need to handle any pending configuration changes, which we’ll discuss next.
When there are changes to the hardware configuration, such as when an external microphone is connected or disconnected, the
AVAudioEngineConfigurationChange notification is sent.
We need to listen to this notification, and depending on whether the session is interrupted or not, rewire the node connections in the engine.
Media Services Reset
According to Apple:
Under rare circumstances, the system terminates and restarts its media services daemon. Respond to these events by reinitializing your app’s audio objects (such as players, recorders, converters, or audio queues) and resetting your audio session’s category, options, and mode configuration. Your app shouldn’t restart its media playback, recording, or processing until initiated by user action.
So we’ll do exactly that:
Nothing but activating the session again, recreating the engine and its nodes, and rewiring the node connections.
In this post, we saw how to setup an audio recording pipeline using
AVAudioEngine. We’ve however, just barely scratched the surface of what is possible with
AVAudioEngine and the rest of
AVFoundation. To know more about
- Here’s Apple’s official documentation.
- Apple’s AVAudioEngine Sample Code is a treasure trove of best practices that are not documented anywhere else.
- Ray Wenderlich’s AVAudioEngine tutorial talks about AVAudioEngine in general, and setting it up for playback purposes.
AVAudioEngine has been a rewarding journey, but thanks to the breadth and depth of
AVFoundation, it has also not been an easy one. Hopefully, we helped save you some time and effort, and made it easy for you to setup your own recording stack.
How has your experience been working with
AVAudioEngine? Let us know in the comments!