The present invention relates to a speech or voice synthesis apparatus and system which, in response to a remark, question or utterance made by voice input, provide replying output, as well as a coding/decoding device related to the voice synthesis.
In recent years, the following voice synthesis techniques have been proposed. Examples of such proposed voice synthesis techniques include a technique that synthesizes and outputs voice corresponding to a speaking tone and voice quality of a user and thereby generates voice in a more human-like manner (see, for example, Patent Literature 1), and a technique that analyzes voice of a user to diagnose psychological and health states etc. of the user (see, for example, Patent Literature 2).
Also proposed in recent years is a voice interaction or dialogue system which implements voice interaction with a user by outputting, in synthesized voice, content designated by a scenario while recognizing voice input by the user (see, for example, Patent Literature 3).
In view of the foregoing, it is an object of the present invention to realize, in a technique for responding to a question or remark by use of voice synthesis, synthesis of responsive or replying voice capable of giving a natural feeling to a user. More specifically, the present invention seeks to provide a technique which can easily and controllably realize replying voice that gives a good impression to the user, replying voice that gives a bad impression, etc.
In studying a man-machine system which synthesizes voice of a reply to a question (or remark) given by a user, the inventors of the present invention etc. first considered what kinds of dialogues are actually conducted between persons, focusing on non-linguistic information (i.e., non-verbal information other than verbal or linguistic information) and particularly pitches (frequencies) characterizing dialogues.
Here, consider a dialogue between persons where one of the persons (hereinafter “person b”) returns a reply to a question given by the other person (hereinafter “person a”). Often, in such a case, when person a has uttered the question, not only person a but also person b, who is going to reply the question, keeps in mind a pitch of a given segment of the question with a strong impression. In returning a reply to the question with a meaning of agreement, approval, affirmation or the like, person b utters replying voice in such a manner that a pitch of a portion characterizing the reply, such as the word ending or word beginning, of the reply assumes a predetermined relationship, more specifically a consonant interval relationship, with (with respect to) the pitch of the question having impressed the person. The inventors etc. thought that, because the pitch which left an impression in the mind of person a about his or her question and the pitch of the portion charactering the reply of person b are in the above-mentioned relationship, person a would have a comfortable and easing good impression about the reply of person b.
Further, people have communicated with one another for a long time from the ancient times when there was no language. It is presumed that pitch and volume of human voice has played a very important role in human communications under such environment. It is also presumed that, although voice-pitch-based communications are forgotten in these modern times when languages have developed, “predetermined pitch relationship” used from the ancient times can give a “somehow comfortable” feel because such a predetermined pitch relationship has been inscribed in the human DNA and handed down to the present times.
According to such an embodiment of the invention, it is possible to prevent the voice of the reply, synthesized in response to the input voice signal of a question (remark), from being accompanied by an unnatural feel. Note that the reply to the question (remark) is not limited to a specific or concrete reply and may sometimes be in the form of back-channel feedback (interjection), such as “eec” (romanized Japanese meaning “Yah.”), “naruhodo” (“I see.”) or “sou desune” (“I agree.”) Further, the reply is not limited to one in human voice and may sometimes be in the form of voice of an animal, such as “wan” (“bowwow”) or “Nyâ” (“meow”). Namely, the terms “reply” and “voice” are used herein to refer to concepts embracing not only voice uttered by a person but also voice of an animal.
You can read the full patent over at http://www.freshpate...20170110111.php