Before playback can be initiated the shared memory region should be mapped, ref Rialto Application Session Management#ApplicationManagement.
The shared memory buffer shall be split into two regions for each playback session, one for a video stream and the other for audio (only one concurrent audio track supported, for audio track selection previous audio must be removed first). Each region must be big enough to accommodate the largest possible frame of audio/video data plus associated decryption parameters and metadata. The buffer shall initially be sized to 8Mb per playback session to allow some overhead.
For apps that can support more than one concurrent playback the shared memory buffer shall be sized accordingly and partitioned into different logical areas for each playback session. The partitions need not necessarily be equally sized, for example if an app supports one UHD and one HD playback the 'HD' partition may be smaller. There can be 0 or more Web Audio regions per Rialto Client.
Note that the application is not directly aware of the layout of the shared memory region or even its existence, this is all managed internally by Rialto.
The following defines the format of the metadata (i.e. data about each AV frame) stored in the shm buffer. Note that all offsets are relative to the base of the shm region.
V1 metadata uses a fixed byte format similar to the AVBus specification but with the following changes:
| Parameter | Size |
|---|---|
| Shared memory buffer | 8Mb |
| Max frames to request | [24] |
| Metadata size per frame | ~100bytes |
| Metadata region size | ~2.4kib |
| Video frame region size | 7Mb |
| Max video frame size | 7Mb - 2.4kb |
| Audio frame region size | 1Mb |
| Max audio frame size | 1Mb - 2.4kb |
| Web Audio region size | - |
The following diagram shows schematically how the shared memory buffer is partitioned into two regions for a playback session, for audio and video data respectively. Within each region the metadata for each frame is written sequentially from the beginning and the media data is stored in the remainder of the region, the offset & length of each frame being specified in the metadata.
The metadata regions have the format shown in the diagram below, namely a 4 byte version field followed by concatenated metadata structures for each media frame following the format specified in Metadata format below. The version field is used to specify the version of the metadata structure written to the buffer - this allows different versions of Rialto client (which may write different metadata formats) to interoperate with the same version of Rialto Server. Rialto server will only understand a certain number of metadata format versions, so only compatible client versions should be used. If Rialto Server sees an unsupported version number it should trigger an error and fail streaming.
Frame metadata is stored in a fixed format in the shm region as follows, undefined values should be set to 0.
uint32_t offset; /* Offset of first byte of sample in shm buffer */ uint32_t length; /* Number of bytes in sample */ int64_t time_position; /* Position in stream in nano-seconds */ int64_t sample_duration; /* Frame/sample duration in ns */ uint32_t stream_id; /* stream id (unique ID for ES, as defined in attachSource()) */ uint32_t extra_data_size; /* extraData size */ uint8_t extra_data[32]; /* buffer containing extradata */ uint32_t media_keys_id; /* Identifier of MediaKeys instance to use for decryption. If 0 use any CDM containing the MKS ID */ uint32_t media_key_session_identifier_offset; /* Offset to the location of the MediaKeySessionIdentifier */ uint32_t media_key_session_identifier_length; /* Length of the MediaKeySessionIdentifier */ uint32_t init_vector_offset; /* Offset to the location of the initialization vector */ uint32_t init_vector_length; /* Length of initialization vector */ uint32_t sub_sample_info_offset; /* Offset to the location of the sub sample info table */ uint32_t sub_sample_info_len; /* Length of sub-sample Info table */ uint32_t init_with_last_15; if (IS_AUDIO(stream_id)) {uint32_t sample_rate; /* Samples per second */ uint32_t channels_num; /* Number of channels */ } else if (IS_VIDEO(stream_id)) { uint32_t width; /* Video width in pixels */ uint32_t height; /* Video height in pixels */ } |
| Parameter | Size |
|---|---|
| Shared memory buffer | 8Mb * max_playback_sessions + 10kb * max_web_audio_playback_sessions |
| Max frames to request | [24] |
| Metadata size per frame | Clear: Variable but <100 bytes Encrypted: TODO |
| Video frame region size | 7Mb |
| Max video frame size | TODO |
| Audio frame region size | 1Mb |
| Max audio frame size | TODO |
| Web Audio region size | 10kb |
The following diagram shows schematically how the shared memory buffer is partitioned into two regions for a playback session, for audio and video data respectively. Within each region there is a 4 byte versions field indicating v2 metadata followed by concatenated metadata/frame pairs.
V2 metadata uses protobuf to serialise the frames' properties to the shared memory buffer. This use of protobuf aligns with the IPC protocol but also allows support for optional fields and for fields to be added and removed without causing backward/forward compatibility issues. It also supports variable length fields so the MKS ID, IV & sub-sample information can all be directly encoded in the metadata, avoiding the complexities of interleaving them with the media frames and referencing them with offsets/lengths as used in the V1 metadata format.
enum SegmentAlignment {
enum CipherMode {
required uint32 length = 1; /* Number of bytes in sample */ required sint64 time_position = 2; /* Position in stream in nanoseconds */ required sint64 sample_duration = 3; /* Frame/sample duration in nanoseconds */ required uint32 stream_id = 4; /* stream id (unique ID for ES, as defined in attachSource()) */ optional uint32 sample_rate = 5; /* Samples per second for audio segments */ optional uint32 channels_num = 6; /* Number of channels for audio segments */ optional uint32 width = 7; /* Frame width in pixels for video segments */ optional uint32 height = 8; /* Frame height in pixels for video segments */ optional SegmentAlignment segment_alignment = 9; /* Segment alignment can be specified for H264/H265, will use NAL if not set */ optional bytes extra_data = 10; /* Buffer containing extradata */ optional bytes media_key_session_id = 11; /* Buffer containing key session ID to use for decryption */ optional bytes key_id = 12; /* Buffer containing Key ID to use for decryption */ optional bytes init_vector = 13; /* Buffer containing the initialization vector for decryption */ optional uint32 init_with_last_15 = 14; /* initWithLast15 value for decryption */ optional repeated SubsamplePair sub_sample_info = 15; /* If present, use gather/scatter decryption based on this list of clear/encrypted byte lengths. */ /* If not present and content is encrypted then entire media segment needs decryption. */ optional bytes codec_data = 16; /* Buffer containing updated codec data for video segments */ optional bytes cipher_mode = 17; /* Block cipher mode of operation when common encryption used */optional uint32 crypt_byte_block = 18; /* Crypt byte block value for AES-CBC/cbcs common encryption */ optional uint32 skip_byte_block = 19; /* Skip byte block value for AES-CBC/cbcs common encryption */ }message SubsamplePair { required uint32_t num_clear_bytes = 1; /* How many of next bytes in sequence are clear */ required uint32_t num_encrypted_bytes = 2; /* How many of next bytes in sequence are encrypted */ } |
Render frame may be called whilst playback is paused either at the start of playback or immediately after a seek operation. The client must wait for the readyToRenderFrame() callback first.
Note that the data pipelines for different data sources (e.g. audio & video) should operate entirely independently. Rialto should
Note: Due to the common APIs on the client and server the parameters must be used slightly differently depending on whether the app is running in a client process or directly on the Rialto server as shown in the following two diagrams. The shared memory buffer is refilled as follows when running in the client-server mode:
1. Rialto server notifies client that refill is required. sourceId should match that specified in attachSource() call for the A/V data stream. needDataRequestId must be a unique ID for this playback session.
Media data flows as follows when running in server only mode:
The code for populating the shm buffer from the parameters to addSegment() will be common on the client & server side so this should be stored in a common location to be used by both implementations.
See also Rialto Client MSE Player Session Streaming State Machine for some additional clarity on how the Rialto client should manage the flow of data in particular regard to seek operations.
This algorithm should be run for all attached sources. A haveData() call in the above sequence can restart the algorithm when it previously stopped due to data exhaustion.
Frames are decrypted in the pipeline when they are pulled for playback.
The position reporting timer should be started whenever the PLAYING state is entered and stopped whenever the session moves to another playback state, i.e. stop polling during IDLE, BUFFERING, SEEKING etc.