paddlespeech.s2t.utils.error_rate module

This module provides functions to calculate error rate in different level. e.g. wer for word-level, cer for char-level.

class paddlespeech.s2t.utils.error_rate.ErrorCalculator(char_list, sym_space, sym_blank, report_cer=False, report_wer=False)[source]

Bases: object

Calculate CER and WER for E2E_ASR and CTC models during training.

Parameters

y_hats -- numpy array with predicted text
y_pads -- numpy array with true (target) text
char_list -- List[str]
sym_space -- <space>
sym_blank -- <blank>

Returns

Methods

`__call__`(ys_hat, ys_pad[, is_ctc])	Calculate sentence-level WER/CER score.
`calculate_cer`(seqs_hat, seqs_true)	Calculate sentence-level CER score.
`calculate_cer_ctc`(ys_hat, ys_pad)	Calculate sentence-level CER score for CTC.
`calculate_wer`(seqs_hat, seqs_true)	Calculate sentence-level WER score.
`convert_to_char`(ys_hat, ys_pad)	Convert index to character.

calculate_cer(seqs_hat, seqs_true)[source]

Calculate sentence-level CER score.

Parameters

seqs_hat (list) -- prediction
seqs_true (list) -- reference

Returns

average sentence-level CER score

:rtype float

calculate_cer_ctc(ys_hat, ys_pad)[source]

Calculate sentence-level CER score for CTC.

Parameters

ys_hat (paddle.Tensor) -- prediction (batch, seqlen)
ys_pad (paddle.Tensor) -- reference (batch, seqlen)

Returns

average sentence-level CER score

:rtype float

calculate_wer(seqs_hat, seqs_true)[source]

Calculate sentence-level WER score.

Parameters

seqs_hat (list) -- prediction
seqs_true (list) -- reference

Returns

average sentence-level WER score

:rtype float

convert_to_char(ys_hat, ys_pad)[source]

Convert index to character.

Parameters

seqs_hat (paddle.Tensor) -- prediction (batch, seqlen)
seqs_true (paddle.Tensor) -- reference (batch, seqlen)

Returns

token list of prediction

:rtype list :return: token list of reference :rtype list

paddlespeech.s2t.utils.error_rate.cer(reference, hypothesis, ignore_case=False, remove_space=False)[source]

Calculate charactor error rate (CER). CER compares reference text and hypothesis text in char-level. CER is defined as:

\[CER = (Sc + Dc + Ic) / Nc\]

where

Sc is the number of characters substituted,
Dc is the number of characters deleted,
Ic is the number of characters inserted
Nc is the number of characters in the reference

We can use levenshtein distance to calculate CER. Chinese input should be encoded to unicode. Please draw an attention that the leading and tailing space characters will be truncated and multiple consecutive space characters in a sentence will be replaced by one space character.

Parameters

reference (str) -- The reference sentence.
hypothesis (str) -- The hypothesis sentence.
ignore_case (bool) -- Whether case-sensitive or not.
remove_space (bool) -- Whether remove internal space characters

Returns

Character error rate.

Return type

float

Raises

ValueError -- If the reference length is zero.

paddlespeech.s2t.utils.error_rate.char_errors(reference, hypothesis, ignore_case=False, remove_space=False)[source]

Compute the levenshtein distance between reference sequence and hypothesis sequence in char-level.

Parameters

reference (str) -- The reference sentence.
hypothesis (str) -- The hypothesis sentence.
ignore_case (bool) -- Whether case-sensitive or not.
remove_space (bool) -- Whether remove internal space characters

Returns

Levenshtein distance and length of reference sentence.

Return type

list

paddlespeech.s2t.utils.error_rate.wer(reference, hypothesis, ignore_case=False, delimiter=' ')[source]

Calculate word error rate (WER). WER compares reference text and hypothesis text in word-level. WER is defined as:

\[WER = (Sw + Dw + Iw) / Nw\]

where

Sw is the number of words subsituted,
Dw is the number of words deleted,
Iw is the number of words inserted,
Nw is the number of words in the reference

We can use levenshtein distance to calculate WER. Please draw an attention that empty items will be removed when splitting sentences by delimiter.

Parameters

reference (str) -- The reference sentence.
hypothesis (str) -- The hypothesis sentence.
ignore_case (bool) -- Whether case-sensitive or not.
delimiter (char) -- Delimiter of input sentences.

Returns

Word error rate.

Return type

float

Raises

ValueError -- If word number of reference is zero.

paddlespeech.s2t.utils.error_rate.word_errors(reference, hypothesis, ignore_case=False, delimiter=' ')[source]

Compute the levenshtein distance between reference sequence and hypothesis sequence in word-level.

Parameters

reference (str) -- The reference sentence.
hypothesis (str) -- The hypothesis sentence.
ignore_case (bool) -- Whether case-sensitive or not.
delimiter (char) -- Delimiter of input sentences.

Returns

Levenshtein distance and word number of reference sentence.

Return type

list