paddlespeech.s2t.frontend.speech module

Contains the speech segment class.

class paddlespeech.s2t.frontend.speech.SpeechSegment(samples, sample_rate, transcript, tokens=None, token_ids=None)[source]

Bases: AudioSegment

Speech Segment with Text

Args:

AudioSegment (AudioSegment): Audio Segment

Attributes
duration

Return audio duration.

has_token
num_samples

Return number of samples.

rms_db

Return root mean square energy of the audio in decibels.

sample_rate

Return audio sample rate.

samples

Return audio samples.

token_ids

Return the transcript text token ids.

tokens

Return the transcript text tokens.

transcript

Return the transcript text.

Methods

add_noise(noise, snr_dB[, ...])

Add the given noise segment at a specific signal-to-noise ratio.

change_speed(speed_rate)

Change the audio speed by linear interpolation.

concatenate(*segments)

Concatenate an arbitrary number of speech segments together, both audio and transcript will be concatenated.

convolve(impulse_segment[, allow_resample])

Convolve this audio segment with the given impulse segment.

convolve_and_normalize(impulse_segment[, ...])

Convolve and normalize the resulting audio segment so that it has the same average power as the input signal.

from_bytes(bytes, transcript[, tokens, ...])

Create speech segment from a byte string and corresponding

from_file(filepath, transcript[, tokens, ...])

Create speech segment from audio file and corresponding transcript.

from_pcm(samples, sample_rate, transcript[, ...])

Create speech segment from pcm on online mode Args: samples (numpy.ndarray): Audio samples [num_samples x num_channels]. sample_rate (int): Audio sample rate. transcript (str): Transcript text for the speech. tokens (List[str], optional): text tokens. Defaults to None. token_ids (List[int], optional): text token ids. Defaults to None. Returns: SpeechSegment: Speech segment instance.

from_sequence_file(filepath)

Create audio segment from sequence file.

gain_db(gain)

Apply gain in decibels to samples.

make_silence(duration, sample_rate)

Creates a silent speech segment of the given duration and sample rate, transcript will be an empty string.

normalize([target_db, max_gain_db])

Normalize audio to be of the desired RMS value in decibels.

normalize_online_bayesian(target_db, ...[, ...])

Normalize audio using a production-compatible online/causal algorithm.

pad_silence(duration[, sides])

Pad this audio sample with a period of silence.

random_subsegment(subsegment_length[, rng])

Cut the specified length of the audiosegment randomly.

resample(target_sample_rate[, filter])

Resample the audio to a target sample rate.

shift(shift_ms)

Shift the audio in time.

slice_from_file(filepath, transcript[, ...])

Loads a small section of an speech without having to load the entire file into the memory which can be incredibly wasteful.

subsegment([start_sec, end_sec])

Cut the AudioSegment between given boundaries.

superimpose(other)

Add samples from another segment to those of this segment (sample-wise addition, not segment concatenation).

to([dtype])

Create a dtype audio content.

to_bytes([dtype])

Create a byte string containing the audio content.

to_wav_file(filepath[, dtype])

Save audio segment to disk as wav file.

classmethod concatenate(*segments)[source]

Concatenate an arbitrary number of speech segments together, both audio and transcript will be concatenated.

Parameters

*segments --

Input speech segments to be concatenated.

Returns

Speech segment instance.

Return type

SpeechSegment

Raises
  • ValueError -- If the number of segments is zero, or if the sample_rate of any two segments does not match.

  • TypeError -- If any segment is not SpeechSegment instance.

classmethod from_bytes(bytes, transcript, tokens=None, token_ids=None)[source]

Create speech segment from a byte string and corresponding

Args:

filepath (str|file): Filepath or file object to audio file. transcript (str): Transcript text for the speech. tokens (List[str], optional): text tokens. Defaults to None. token_ids (List[int], optional): text token ids. Defaults to None.

Returns:

SpeechSegment: Speech segment instance.

classmethod from_file(filepath, transcript, tokens=None, token_ids=None, infos=None)[source]

Create speech segment from audio file and corresponding transcript.

Args:

filepath (str|file): Filepath or file object to audio file. transcript (str): Transcript text for the speech. tokens (List[str], optional): text tokens. Defaults to None. token_ids (List[int], optional): text token ids. Defaults to None. infos (TarLocalData, optional): tar2obj and tar2infos. Defaults to None.

Returns:

SpeechSegment: Speech segment instance.

classmethod from_pcm(samples, sample_rate, transcript, tokens=None, token_ids=None)[source]

Create speech segment from pcm on online mode Args:

samples (numpy.ndarray): Audio samples [num_samples x num_channels]. sample_rate (int): Audio sample rate. transcript (str): Transcript text for the speech. tokens (List[str], optional): text tokens. Defaults to None. token_ids (List[int], optional): text token ids. Defaults to None.

Returns:

SpeechSegment: Speech segment instance.

property has_token
classmethod make_silence(duration, sample_rate)[source]

Creates a silent speech segment of the given duration and sample rate, transcript will be an empty string.

Args:

duration (float): Length of silence in seconds. sample_rate (float): Sample rate.

Returns:

SpeechSegment: Silence of the given duration.

classmethod slice_from_file(filepath, transcript, tokens=None, token_ids=None, start=None, end=None)[source]

Loads a small section of an speech without having to load the entire file into the memory which can be incredibly wasteful.

Parameters
  • filepath (str|file) -- Filepath or file object to audio file.

  • start (float) -- Start time in seconds. If start is negative, it wraps around from the end. If not provided, this function reads from the very beginning.

  • end (float) -- End time in seconds. If end is negative, it wraps around from the end. If not provided, the default behvaior is to read to the end of the file.

  • transcript -- Transcript text for the speech. if not provided, the defaults is an empty string.

Returns

SpeechSegment instance of the specified slice of the input speech file.

Return type

SpeechSegment

property token_ids

Return the transcript text token ids.

Returns:

List[int]: text token ids.

property tokens

Return the transcript text tokens.

Returns:

List[str]: text tokens.

property transcript

Return the transcript text.

Returns:

str: Transcript text for the speech.