streamlink stream.hls: packed audio support

Streamlink currently does not support "packed audio" HLS streams: https://datatracker.ietf.org/doc/html/rfc8216#section-3.4

A Packed Audio Segment contains encoded audio samples and ID3 tags that are simply packed together with minimal framing and no per- sample timestamps. Supported Packed Audio formats are Advanced Audio Coding (AAC) with Audio Data Transport Stream (ADTS) framing [ISO_13818_7], MP3 [ISO_13818_3], AC-3 [AC_3], and Enhanced AC-3 [AC_3].

A Packed Audio Segment has no Media Initialization Section.

Each Packed Audio Segment MUST signal the timestamp of its first sample with an ID3 Private frame (PRIV) tag [ID3] at the beginning of the segment. The ID3 PRIV owner identifier MUST be "com.apple.streaming.transportStreamTimestamp". The ID3 payload MUST be a 33-bit MPEG-2 Program Elementary Stream timestamp expressed as a big-endian eight-octet number, with the upper 31 bits set to zero. Clients SHOULD NOT play Packed Audio Segments without this ID3 tag.

Simply concatenating the segments and piping the data to FFmpeg does not work and the audio will be desynced (in addition to a long output delay), so additional stream output logic is required.

Example streams:

https://mcdn.daserste.de/daserste/de/master.m3u8
https://mcdn.daserste.de/daserste/de/master_audio1_128.m3u8

#4687, #4703, #3534

Implementing packed audio streams means

Detecting the format of the media stream. There's no metadata, so only file extensions of the segment URL path can be used. Related: https://github.com/streamlink/streamlink/issues/791#issuecomment-913222776
Implementing an ID3 parser. Unfortunately, there's no maintained package on PyPI for that. Spec: https://id3.org/id3v2.4.0-structure
Parsing and removing the ID3 data from the output stream.
Applying the parsed com.apple.streaming.transportStreamTimestamp timestamp to the first sample of the audio segment.

1-3 is fairly trivial, but 4 is the big question mark.

When opening packed audio streams directly, setting the timestamp is irrelevant, so just removing the ID3 data works fine. However, when muxing the video and audio streams, the audio timestamp is required, otherwise there will be an audio desync. Another problem implementation-wise is that parsing the timestamp can only be done after opening the audio stream, and at this point the FFmpeg process has already been spawned by the MuxedHLSStream, and its argv been set, so the timestamp is missing (also no idea how to make FFmpeg use the timestamp).

Aug 09 '22 10:08 bastimeyer

I had another look at this yesterday.

As already explained, what needs to be done here is removing ID3v2 tags from each segment of the packed HLS audio stream and getting the embedded timestamp metadata from the first segment. This is fairly simple with a quick ID3v2 parser implementation (no additional python dependency required), so all ID3v2 tags can be cut off from the ADTS data and the timestamp can be read from the respective ID3v2 frame.

Once all ID3v2 tags are removed, the PTS of the first ADTS-packet will be 0 and all following packets will be the implicit continuation of that, which is where the timestamp from the ID3 metadata comes into play, which needs to be set for the first packet.

When checking the timestamps of the input video stream via ffprobe -show_packets, the value of the first packet is similar (with a small delta) to the ID3v2 timestamp metadata of the audio stream (which itself starts at 0), which means the audio timestamp got parsed correctly in my ID3v2 parser implemenation.

However, I haven't been able to set the presentation timestamp value of the first audio packet via FFmpeg. I've tried multiple things:

using the setts audio bitstream filter (https://ffmpeg.org/ffmpeg-bitstream-filters.html#setts):
```
ffmpeg -i video -i audio -c copy -bsf:a "setts=pts=$TIMESTAMP+(PTS-STARTPTS)" -f mpegts out
```
re-encoding the audio and using the asetpts audio filter (https://ffmpeg.org/ffmpeg-filters.html#setpts_002c-asetpts):
```
ffmpeg -i video -i audio -c:v copy -c:a aac -b:a 128k -af "asetpts=$TIMESTAMP+(PTS-STARTPTS)" -f mpegts out
```
using the setts video bitstream filter and subtracting the timestamp of the first packet from everything else
```
ffmpeg -i video -i audio -c copy -bsf:v 'setts=pts=PTS-STARTPTS' -f mpegts out
```

not changing any timestamps at all

ffmpeg -i video -i audio -c copy -f mpegts out

The idea behind 1 and 2 is updating the audio stream according to what the HLS spec wants. 3 on the other hand makes the timestamp of the video stream begin at 0, to make it identical to the audio stream.

Surprisingly, both 3 and 4 yielded the same (bitwise identical) results, with a working output stream, despite what I said at the end of my first post in regards to desync. When checking with ffprobe however, the timestamps of the video input were kept in the video output (both 3 and 4) and the audio timestamps got aligned automatically (by guessing from what it looks like).

I have no idea how to continue here. It could just be a coincidence that the output stream worked with just removing the ID3v2 tags and having FFmpeg do the rest, because I have only tested it with one stream so far (linked in the OP). That would make an implementation in Streamlink much easier though, because as said, if the timestamp from the ID3v2 metadata needs to be set explicitly, then the audio stream needs to be read first, which would require a reimplementation of the MuxedStream because the FFmpeg parameters need to be set after reading the start of the audio stream. But if changing the audio timestamps is not actually needed, then no big changes would be required apart from the ID3v2 parser.

Unfortunately, the ffmpeg docs are pretty sparse and I couldn't find anything useful elsewhere, and I'm also not well-versed in this low-level stuff.

Jan 17 '23 06:01 bastimeyer

I also looked into this a bit since I've been trying to record audio HLS streams and ffmpeg kind of sucks at it natively, it seems option 1 (setts bitstream audio filter) works if you specify both PTS and DTS.

$ ffmpeg -i out.aac -bsf:a "setts=pts=4384081248+(PTS-STARTPTS):dts=4384081248+(DTS-STARTDTS)" -f mpegts out_aac.ts     
...                                                                 
Input #0, aac, from 'out.aac':
  Metadata:
    id3v2_priv.com.apple.streaming.transportStreamTimestamp: \x00\x00\x00\x01\x05O\xc5`
    id3v2_priv.com.elementaltechnologies.timestamp.utc: 2023-06-13T04:36:12Z
...

$ ffprobe -show_packets out_aac.ts
...
[PACKET]
codec_type=audio
stream_index=0
pts=4384207248
pts_time=48713.413867
dts=4384207248
dts_time=48713.413867
duration=2160
duration_time=0.024000
size=1152
pos=564
flags=K_
[/PACKET]
[PACKET]
codec_type=audio
stream_index=0
pts=4384209408
pts_time=48713.437867
dts=4384209408
dts_time=48713.437867
duration=2160
duration_time=0.024000
size=1152
pos=N/A
flags=K_
[/PACKET]

Jun 13 '23 12:06 Hakkin

It seems you also need -copyts when muxing with video or ffmpeg will try to correct the video timestamps without touching the audio timestamps.

$ ffmpeg -i 1903.aac -i 1903.ts -c copy -copyts -bsf:a "setts=pts=4514401248+(PTS-STARTPTS):dts=4514401248+(DTS-STARTDTS)" -f mpegts 1903_merge_corrected.ts

$ ffprobe -show_packets 1903_merge_corrected.ts
[PACKET]
codec_type=video
stream_index=0
pts=4514525448
pts_time=50161.393867
dts=4514523648
dts_time=50161.373867
duration=1800
duration_time=0.020000
size=191293
pos=564
flags=K_
[/PACKET]
...
[PACKET]
codec_type=audio
stream_index=1
pts=4514527248
pts_time=50161.413867
dts=4514527248
dts_time=50161.413867
duration=1920
duration_time=0.021333
size=333
pos=334828
flags=K_
[/PACKET]

Jun 13 '23 13:06 Hakkin