paddlespeech.s2t.frontend.audio module

Contains the audio segment class.

class paddlespeech.s2t.frontend.audio.AudioSegment(samples, sample_rate)[source]

Bases: object

Monaural audio segment abstraction.

Parameters

samples (ndarray.float32) -- Audio samples [num_samples x num_channels].
sample_rate (int) -- Audio sample rate.

Raises

TypeError -- If the sample data type is not float or int.

Attributes

duration: Return audio duration.
num_samples: Return number of samples.
rms_db: Return root mean square energy of the audio in decibels.
sample_rate: Return audio sample rate.
samples: Return audio samples.

Methods

`add_noise`(noise, snr_dB[, ...])	Add the given noise segment at a specific signal-to-noise ratio.
`change_speed`(speed_rate)	Change the audio speed by linear interpolation.
`concatenate`(*segments)	Concatenate an arbitrary number of audio segments together.
`convolve`(impulse_segment[, allow_resample])	Convolve this audio segment with the given impulse segment.
`convolve_and_normalize`(impulse_segment[, ...])	Convolve and normalize the resulting audio segment so that it has the same average power as the input signal.
`from_bytes`(bytes)	Create audio segment from a byte string containing audio samples.
`from_file`(file[, infos])	Create audio segment from audio file.
`from_pcm`(samples, sample_rate)	Create audio segment from a byte string containing audio samples.
`from_sequence_file`(filepath)	Create audio segment from sequence file.
`gain_db`(gain)	Apply gain in decibels to samples.
`make_silence`(duration, sample_rate)	Creates a silent audio segment of the given duration and sample rate.
`normalize`([target_db, max_gain_db])	Normalize audio to be of the desired RMS value in decibels.
`normalize_online_bayesian`(target_db, ...[, ...])	Normalize audio using a production-compatible online/causal algorithm.
`pad_silence`(duration[, sides])	Pad this audio sample with a period of silence.
`random_subsegment`(subsegment_length[, rng])	Cut the specified length of the audiosegment randomly.
`resample`(target_sample_rate[, filter])	Resample the audio to a target sample rate.
`shift`(shift_ms)	Shift the audio in time.
`slice_from_file`(file[, start, end])	Loads a small section of an audio without having to load the entire file into the memory which can be incredibly wasteful.
`subsegment`([start_sec, end_sec])	Cut the AudioSegment between given boundaries.
`superimpose`(other)	Add samples from another segment to those of this segment (sample-wise addition, not segment concatenation).
`to`([dtype])	Create a dtype audio content.
`to_bytes`([dtype])	Create a byte string containing the audio content.
`to_wav_file`(filepath[, dtype])	Save audio segment to disk as wav file.

add_noise(noise, snr_dB, allow_downsampling=False, max_gain_db=300.0, rng=None)[source]

Add the given noise segment at a specific signal-to-noise ratio. If the noise segment is longer than this segment, a random subsegment of matching length is sampled from it and used instead.

Note that this is an in-place transformation.

Parameters

noise (AudioSegment) -- Noise signal to add.
snr_dB (float) -- Signal-to-Noise Ratio, in decibels.
allow_downsampling (bool) -- Whether to allow the noise signal to be downsampled to match the base signal sample rate.
max_gain_db (float) -- Maximum amount of gain to apply to noise signal before adding it in. This is to prevent attempting to apply infinite gain to a zero signal.
rng (None|random.Random) -- Random number generator state.

Raises

ValueError -- If the sample rate does not match between the two audio segments when downsampling is not allowed, or if the duration of noise segments is shorter than original audio segments.

change_speed(speed_rate)[source]

Change the audio speed by linear interpolation.

Note that this is an in-place transformation.

Parameters: speed_rate (float) -- Rate of speed change: speed_rate > 1.0, speed up the audio; speed_rate = 1.0, unchanged; speed_rate < 1.0, slow down the audio; speed_rate <= 0.0, not allowed, raise ValueError.
Raises: ValueError -- If speed_rate <= 0.0.

classmethod concatenate(*segments)[source]

Concatenate an arbitrary number of audio segments together.

Parameters

*segments --

Input audio segments to be concatenated.

Returns

Audio segment instance as concatenating results.

Return type

AudioSegment

Raises

ValueError -- If the number of segments is zero, or if the sample_rate of any segments does not match.
TypeError -- If any segment is not AudioSegment instance.

convolve(impulse_segment, allow_resample=False)[source]

Convolve this audio segment with the given impulse segment.

Note that this is an in-place transformation.

Parameters

impulse_segment (AudioSegment) -- Impulse response segments.
allow_resample (bool) -- Indicates whether resampling is allowed when the impulse_segment has a different sample rate from this signal.

Raises

ValueError -- If the sample rate is not match between two audio segments when resample is not allowed.

convolve_and_normalize(impulse_segment, allow_resample=False)[source]

Convolve and normalize the resulting audio segment so that it has the same average power as the input signal.

Note that this is an in-place transformation.

Parameters

impulse_segment (AudioSegment) -- Impulse response segments.
allow_resample (bool) -- Indicates whether resampling is allowed when the impulse_segment has a different sample rate from this signal.

property duration

Return audio duration.

Returns: Audio duration in seconds.
Return type: float

classmethod from_bytes(bytes)[source]

Create audio segment from a byte string containing audio samples.

Parameters: bytes (str) -- Byte string containing audio samples.
Returns: Audio segment instance.
Return type: AudioSegment

classmethod from_file(file, infos=None)[source]

Create audio segment from audio file.

Args:: filepath (str|file): Filepath or file object to audio file. infos (TarLocalData, optional): tar2obj and tar2infos. Defaults to None.
Returns:: AudioSegment: Audio segment instance.

classmethod from_pcm(samples, sample_rate)[source]: Create audio segment from a byte string containing audio samples. :param samples: Audio samples [num_samples x num_channels]. :type samples: numpy.ndarray :param sample_rate: Audio sample rate. :type sample_rate: int :return: Audio segment instance. :rtype: AudioSegment

classmethod from_sequence_file(filepath)[source]

Create audio segment from sequence file. Sequence file is a binary file containing a collection of multiple audio files, with several header bytes in the head indicating the offsets of each audio byte data chunk.

The format is:

4 bytes (int, version), 4 bytes (int, num of utterance), 4 bytes (int, bytes per header), [bytes_per_header*(num_utterance+1)] bytes (offsets for each audio), audio_bytes_data_of_1st_utterance, audio_bytes_data_of_2nd_utterance, ......

Sequence file name must end with ".seqbin". And the filename of the 5th utterance's audio file in sequence file "xxx.seqbin" must be "xxx.seqbin_5", with "5" indicating the utterance index within this sequence file (starting from 1).

Parameters: filepath (str) -- Filepath of sequence file.
Returns: Audio segment instance.
Return type: AudioSegment

gain_db(gain)[source]

Apply gain in decibels to samples.

Note that this is an in-place transformation.

Parameters: gain (float|1darray) -- Gain in decibels to apply to samples.

classmethod make_silence(duration, sample_rate)[source]

Creates a silent audio segment of the given duration and sample rate.

Parameters

duration (float) -- Length of silence in seconds.
sample_rate (float) -- Sample rate.

Returns

Silent AudioSegment instance of the given duration.

Return type

AudioSegment

normalize(target_db=-20, max_gain_db=300.0)[source]

Normalize audio to be of the desired RMS value in decibels.

Note that this is an in-place transformation.

Parameters

target_db (float) -- Target RMS value in decibels. This value should be less than 0.0 as 0.0 is full-scale audio.
max_gain_db (float) -- Max amount of gain in dB that can be applied for normalization. This is to prevent nans when attempting to normalize a signal consisting of all zeros.

Raises

ValueError -- If the required gain to normalize the segment to the target_db value exceeds max_gain_db.

normalize_online_bayesian(target_db, prior_db, prior_samples, startup_delay=0.0)[source]

Normalize audio using a production-compatible online/causal algorithm. This uses an exponential likelihood and gamma prior to make online estimates of the RMS even when there are very few samples.

Note that this is an in-place transformation.

Parameters

target_db -- Target RMS value in decibels.
prior_db (float) -- Prior RMS estimate in decibels.
prior_samples (float) -- Prior strength in number of samples.
startup_delay (float) -- Default 0.0s. If provided, this function will accrue statistics for the first startup_delay seconds before applying online normalization.

property num_samples

Return number of samples.

Returns: Number of samples.
Return type: int

pad_silence(duration, sides='both')[source]

Pad this audio sample with a period of silence.

Note that this is an in-place transformation.

Parameters

duration (float) -- Length of silence in seconds to pad.
sides (str) -- Position for padding: 'beginning' - adds silence in the beginning; 'end' - adds silence in the end; 'both' - adds silence in both the beginning and the end.

Raises

ValueError -- If sides is not supported.

random_subsegment(subsegment_length, rng=None)[source]

Cut the specified length of the audiosegment randomly.

Note that this is an in-place transformation.

Parameters

subsegment_length (float) -- Subsegment length in seconds.
rng (random.Random) -- Random number generator state.

Raises

ValueError -- If the length of subsegment is greater than the origineal segemnt.

resample(target_sample_rate, filter='kaiser_best')[source]

Resample the audio to a target sample rate.

Note that this is an in-place transformation.

Parameters

target_sample_rate (int) -- Target sample rate.
filter (str) -- The resampling filter to use one of {'kaiser_best', 'kaiser_fast'}.

property rms_db

Return root mean square energy of the audio in decibels.

Returns: Root mean square energy in decibels.
Return type: float

property sample_rate

Return audio sample rate.

Returns: Audio sample rate.
Return type: int

property samples

Return audio samples.

Returns: Audio samples.
Return type: ndarray

shift(shift_ms)[source]

Shift the audio in time. If shift_ms is positive, shift with time advance; if negative, shift with time delay. Silence are padded to keep the duration unchanged.

Note that this is an in-place transformation.

Parameters: shift_ms (float) -- Shift time in millseconds. If positive, shift with time advance; if negative; shift with time delay.
Raises: ValueError -- If shift_ms is longer than audio duration.

classmethod slice_from_file(file, start=None, end=None)[source]

Loads a small section of an audio without having to load the entire file into the memory which can be incredibly wasteful.

Parameters

file (str|file) -- Input audio filepath or file object.
start (float) -- Start time in seconds. If start is negative, it wraps around from the end. If not provided, this function reads from the very beginning.
end (float) -- End time in seconds. If end is negative, it wraps around from the end. If not provided, the default behvaior is to read to the end of the file.

Returns

AudioSegment instance of the specified slice of the input audio file.

Return type

AudioSegment

Raises

ValueError -- If start or end is incorrectly set, e.g. out of bounds in time.

subsegment(start_sec=None, end_sec=None)[source]

Cut the AudioSegment between given boundaries.

Note that this is an in-place transformation.

Parameters

start_sec (float) -- Beginning of subsegment in seconds.
end_sec (float) -- End of subsegment in seconds.

Raises

ValueError -- If start_sec or end_sec is incorrectly set, e.g. out of bounds in time.

superimpose(other)[source]

Add samples from another segment to those of this segment (sample-wise addition, not segment concatenation).

Note that this is an in-place transformation.

Parameters

other (AudioSegments) -- Segment containing samples to be added in.

Raises

TypeError -- If type of two segments don't match.
ValueError -- If the sample rates of the two segments are not equal, or if the lengths of segments don't match.

to(dtype='int16')[source]

Create a dtype audio content.

Parameters: dtype (str) -- Data type for export samples. Options: 'int16', 'int32', 'float32', 'float64'. Default is 'float32'.
Returns: np.ndarray containing dtype audio content.
Return type: str

to_bytes(dtype='float32')[source]

Create a byte string containing the audio content.

Parameters: dtype (str) -- Data type for export samples. Options: 'int16', 'int32', 'float32', 'float64'. Default is 'float32'.
Returns: Byte string containing audio content.
Return type: str

to_wav_file(filepath, dtype='float32')[source]

Save audio segment to disk as wav file.

Parameters

filepath (str|file) -- WAV filepath or file object to save the audio segment.
dtype (str) -- Subtype for audio file. Options: 'int16', 'int32', 'float32', 'float64'. Default is 'float32'.

Raises

TypeError -- If dtype is not supported.