paddlespeech.s2t.frontend.audio module

Contains the audio segment class.

class paddlespeech.s2t.frontend.audio.AudioSegment(samples, sample_rate)[source]

Bases: object

Monaural audio segment abstraction.

Parameters
  • samples (ndarray.float32) -- Audio samples [num_samples x num_channels].

  • sample_rate (int) -- Audio sample rate.

Raises

TypeError -- If the sample data type is not float or int.

Attributes
duration

Return audio duration.

num_samples

Return number of samples.

rms_db

Return root mean square energy of the audio in decibels.

sample_rate

Return audio sample rate.

samples

Return audio samples.

Methods

add_noise(noise, snr_dB[, ...])

Add the given noise segment at a specific signal-to-noise ratio.

change_speed(speed_rate)

Change the audio speed by linear interpolation.

concatenate(*segments)

Concatenate an arbitrary number of audio segments together.

convolve(impulse_segment[, allow_resample])

Convolve this audio segment with the given impulse segment.

convolve_and_normalize(impulse_segment[, ...])

Convolve and normalize the resulting audio segment so that it has the same average power as the input signal.

from_bytes(bytes)

Create audio segment from a byte string containing audio samples.

from_file(file[, infos])

Create audio segment from audio file.

from_pcm(samples, sample_rate)

Create audio segment from a byte string containing audio samples.

from_sequence_file(filepath)

Create audio segment from sequence file.

gain_db(gain)

Apply gain in decibels to samples.

make_silence(duration, sample_rate)

Creates a silent audio segment of the given duration and sample rate.

normalize([target_db, max_gain_db])

Normalize audio to be of the desired RMS value in decibels.

normalize_online_bayesian(target_db, ...[, ...])

Normalize audio using a production-compatible online/causal algorithm.

pad_silence(duration[, sides])

Pad this audio sample with a period of silence.

random_subsegment(subsegment_length[, rng])

Cut the specified length of the audiosegment randomly.

resample(target_sample_rate[, filter])

Resample the audio to a target sample rate.

shift(shift_ms)

Shift the audio in time.

slice_from_file(file[, start, end])

Loads a small section of an audio without having to load the entire file into the memory which can be incredibly wasteful.

subsegment([start_sec, end_sec])

Cut the AudioSegment between given boundaries.

superimpose(other)

Add samples from another segment to those of this segment (sample-wise addition, not segment concatenation).

to([dtype])

Create a dtype audio content.

to_bytes([dtype])

Create a byte string containing the audio content.

to_wav_file(filepath[, dtype])

Save audio segment to disk as wav file.

add_noise(noise, snr_dB, allow_downsampling=False, max_gain_db=300.0, rng=None)[source]

Add the given noise segment at a specific signal-to-noise ratio. If the noise segment is longer than this segment, a random subsegment of matching length is sampled from it and used instead.

Note that this is an in-place transformation.

Parameters
  • noise (AudioSegment) -- Noise signal to add.

  • snr_dB (float) -- Signal-to-Noise Ratio, in decibels.

  • allow_downsampling (bool) -- Whether to allow the noise signal to be downsampled to match the base signal sample rate.

  • max_gain_db (float) -- Maximum amount of gain to apply to noise signal before adding it in. This is to prevent attempting to apply infinite gain to a zero signal.

  • rng (None|random.Random) -- Random number generator state.

Raises

ValueError -- If the sample rate does not match between the two audio segments when downsampling is not allowed, or if the duration of noise segments is shorter than original audio segments.

change_speed(speed_rate)[source]

Change the audio speed by linear interpolation.

Note that this is an in-place transformation.

Parameters

speed_rate (float) -- Rate of speed change: speed_rate > 1.0, speed up the audio; speed_rate = 1.0, unchanged; speed_rate < 1.0, slow down the audio; speed_rate <= 0.0, not allowed, raise ValueError.

Raises

ValueError -- If speed_rate <= 0.0.

classmethod concatenate(*segments)[source]

Concatenate an arbitrary number of audio segments together.

Parameters

*segments --

Input audio segments to be concatenated.

Returns

Audio segment instance as concatenating results.

Return type

AudioSegment

Raises
  • ValueError -- If the number of segments is zero, or if the sample_rate of any segments does not match.

  • TypeError -- If any segment is not AudioSegment instance.

convolve(impulse_segment, allow_resample=False)[source]

Convolve this audio segment with the given impulse segment.

Note that this is an in-place transformation.

Parameters
  • impulse_segment (AudioSegment) -- Impulse response segments.

  • allow_resample (bool) -- Indicates whether resampling is allowed when the impulse_segment has a different sample rate from this signal.

Raises

ValueError -- If the sample rate is not match between two audio segments when resample is not allowed.

convolve_and_normalize(impulse_segment, allow_resample=False)[source]

Convolve and normalize the resulting audio segment so that it has the same average power as the input signal.

Note that this is an in-place transformation.

Parameters
  • impulse_segment (AudioSegment) -- Impulse response segments.

  • allow_resample (bool) -- Indicates whether resampling is allowed when the impulse_segment has a different sample rate from this signal.

property duration

Return audio duration.

Returns

Audio duration in seconds.

Return type

float

classmethod from_bytes(bytes)[source]

Create audio segment from a byte string containing audio samples.

Parameters

bytes (str) -- Byte string containing audio samples.

Returns

Audio segment instance.

Return type

AudioSegment

classmethod from_file(file, infos=None)[source]

Create audio segment from audio file.

Args:

filepath (str|file): Filepath or file object to audio file. infos (TarLocalData, optional): tar2obj and tar2infos. Defaults to None.

Returns:

AudioSegment: Audio segment instance.

classmethod from_pcm(samples, sample_rate)[source]

Create audio segment from a byte string containing audio samples. :param samples: Audio samples [num_samples x num_channels]. :type samples: numpy.ndarray :param sample_rate: Audio sample rate. :type sample_rate: int :return: Audio segment instance. :rtype: AudioSegment

classmethod from_sequence_file(filepath)[source]

Create audio segment from sequence file. Sequence file is a binary file containing a collection of multiple audio files, with several header bytes in the head indicating the offsets of each audio byte data chunk.

The format is:

4 bytes (int, version), 4 bytes (int, num of utterance), 4 bytes (int, bytes per header), [bytes_per_header*(num_utterance+1)] bytes (offsets for each audio), audio_bytes_data_of_1st_utterance, audio_bytes_data_of_2nd_utterance, ......

Sequence file name must end with ".seqbin". And the filename of the 5th utterance's audio file in sequence file "xxx.seqbin" must be "xxx.seqbin_5", with "5" indicating the utterance index within this sequence file (starting from 1).

Parameters

filepath (str) -- Filepath of sequence file.

Returns

Audio segment instance.

Return type

AudioSegment

gain_db(gain)[source]

Apply gain in decibels to samples.

Note that this is an in-place transformation.

Parameters

gain (float|1darray) -- Gain in decibels to apply to samples.

classmethod make_silence(duration, sample_rate)[source]

Creates a silent audio segment of the given duration and sample rate.

Parameters
  • duration (float) -- Length of silence in seconds.

  • sample_rate (float) -- Sample rate.

Returns

Silent AudioSegment instance of the given duration.

Return type

AudioSegment

normalize(target_db=-20, max_gain_db=300.0)[source]

Normalize audio to be of the desired RMS value in decibels.

Note that this is an in-place transformation.

Parameters
  • target_db (float) -- Target RMS value in decibels. This value should be less than 0.0 as 0.0 is full-scale audio.

  • max_gain_db (float) -- Max amount of gain in dB that can be applied for normalization. This is to prevent nans when attempting to normalize a signal consisting of all zeros.

Raises

ValueError -- If the required gain to normalize the segment to the target_db value exceeds max_gain_db.

normalize_online_bayesian(target_db, prior_db, prior_samples, startup_delay=0.0)[source]

Normalize audio using a production-compatible online/causal algorithm. This uses an exponential likelihood and gamma prior to make online estimates of the RMS even when there are very few samples.

Note that this is an in-place transformation.

Parameters
  • target_db -- Target RMS value in decibels.

  • prior_db (float) -- Prior RMS estimate in decibels.

  • prior_samples (float) -- Prior strength in number of samples.

  • startup_delay (float) -- Default 0.0s. If provided, this function will accrue statistics for the first startup_delay seconds before applying online normalization.

property num_samples

Return number of samples.

Returns

Number of samples.

Return type

int

pad_silence(duration, sides='both')[source]

Pad this audio sample with a period of silence.

Note that this is an in-place transformation.

Parameters
  • duration (float) -- Length of silence in seconds to pad.

  • sides (str) -- Position for padding: 'beginning' - adds silence in the beginning; 'end' - adds silence in the end; 'both' - adds silence in both the beginning and the end.

Raises

ValueError -- If sides is not supported.

random_subsegment(subsegment_length, rng=None)[source]

Cut the specified length of the audiosegment randomly.

Note that this is an in-place transformation.

Parameters
  • subsegment_length (float) -- Subsegment length in seconds.

  • rng (random.Random) -- Random number generator state.

Raises

ValueError -- If the length of subsegment is greater than the origineal segemnt.

resample(target_sample_rate, filter='kaiser_best')[source]

Resample the audio to a target sample rate.

Note that this is an in-place transformation.

Parameters
  • target_sample_rate (int) -- Target sample rate.

  • filter (str) -- The resampling filter to use one of {'kaiser_best', 'kaiser_fast'}.

property rms_db

Return root mean square energy of the audio in decibels.

Returns

Root mean square energy in decibels.

Return type

float

property sample_rate

Return audio sample rate.

Returns

Audio sample rate.

Return type

int

property samples

Return audio samples.

Returns

Audio samples.

Return type

ndarray

shift(shift_ms)[source]

Shift the audio in time. If shift_ms is positive, shift with time advance; if negative, shift with time delay. Silence are padded to keep the duration unchanged.

Note that this is an in-place transformation.

Parameters

shift_ms (float) -- Shift time in millseconds. If positive, shift with time advance; if negative; shift with time delay.

Raises

ValueError -- If shift_ms is longer than audio duration.

classmethod slice_from_file(file, start=None, end=None)[source]

Loads a small section of an audio without having to load the entire file into the memory which can be incredibly wasteful.

Parameters
  • file (str|file) -- Input audio filepath or file object.

  • start (float) -- Start time in seconds. If start is negative, it wraps around from the end. If not provided, this function reads from the very beginning.

  • end (float) -- End time in seconds. If end is negative, it wraps around from the end. If not provided, the default behvaior is to read to the end of the file.

Returns

AudioSegment instance of the specified slice of the input audio file.

Return type

AudioSegment

Raises

ValueError -- If start or end is incorrectly set, e.g. out of bounds in time.

subsegment(start_sec=None, end_sec=None)[source]

Cut the AudioSegment between given boundaries.

Note that this is an in-place transformation.

Parameters
  • start_sec (float) -- Beginning of subsegment in seconds.

  • end_sec (float) -- End of subsegment in seconds.

Raises

ValueError -- If start_sec or end_sec is incorrectly set, e.g. out of bounds in time.

superimpose(other)[source]

Add samples from another segment to those of this segment (sample-wise addition, not segment concatenation).

Note that this is an in-place transformation.

Parameters

other (AudioSegments) -- Segment containing samples to be added in.

Raises
  • TypeError -- If type of two segments don't match.

  • ValueError -- If the sample rates of the two segments are not equal, or if the lengths of segments don't match.

to(dtype='int16')[source]

Create a dtype audio content.

Parameters

dtype (str) -- Data type for export samples. Options: 'int16', 'int32', 'float32', 'float64'. Default is 'float32'.

Returns

np.ndarray containing dtype audio content.

Return type

str

to_bytes(dtype='float32')[source]

Create a byte string containing the audio content.

Parameters

dtype (str) -- Data type for export samples. Options: 'int16', 'int32', 'float32', 'float64'. Default is 'float32'.

Returns

Byte string containing audio content.

Return type

str

to_wav_file(filepath, dtype='float32')[source]

Save audio segment to disk as wav file.

Parameters
  • filepath (str|file) -- WAV filepath or file object to save the audio segment.

  • dtype (str) -- Subtype for audio file. Options: 'int16', 'int32', 'float32', 'float64'. Default is 'float32'.

Raises

TypeError -- If dtype is not supported.