add: absorb soprano_to_rvc as regular subdirectory
Voice conversion pipeline (Soprano TTS → RVC) with Docker support. Previously tracked as bare gitlink; removed .git/ directories and absorbed into main repo for unified tracking. Includes: Soprano TTS, RVC WebUI integration, Docker configs, WebSocket API, and benchmark scripts. Updated .gitignore to exclude large model weights (*.pth, *.pt, *.onnx, *.index). 287 files (3.1GB of ML weights properly excluded via gitignore).
This commit is contained in:
11
soprano_to_rvc/soprano/.gitignore
vendored
Normal file
11
soprano_to_rvc/soprano/.gitignore
vendored
Normal file
@@ -0,0 +1,11 @@
|
||||
__pycache__/
|
||||
test.py
|
||||
*.wav
|
||||
dist/
|
||||
*.egg-info/
|
||||
.venv/
|
||||
.task/
|
||||
BACKEND-SELECTION.md
|
||||
Taskfile.yml
|
||||
benchmark.py
|
||||
examples-backend.py
|
||||
201
soprano_to_rvc/soprano/LICENSE
Normal file
201
soprano_to_rvc/soprano/LICENSE
Normal file
@@ -0,0 +1,201 @@
|
||||
Apache License
|
||||
Version 2.0, January 2004
|
||||
http://www.apache.org/licenses/
|
||||
|
||||
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
||||
|
||||
1. Definitions.
|
||||
|
||||
"License" shall mean the terms and conditions for use, reproduction,
|
||||
and distribution as defined by Sections 1 through 9 of this document.
|
||||
|
||||
"Licensor" shall mean the copyright owner or entity authorized by
|
||||
the copyright owner that is granting the License.
|
||||
|
||||
"Legal Entity" shall mean the union of the acting entity and all
|
||||
other entities that control, are controlled by, or are under common
|
||||
control with that entity. For the purposes of this definition,
|
||||
"control" means (i) the power, direct or indirect, to cause the
|
||||
direction or management of such entity, whether by contract or
|
||||
otherwise, or (ii) ownership of fifty percent (50%) or more of the
|
||||
outstanding shares, or (iii) beneficial ownership of such entity.
|
||||
|
||||
"You" (or "Your") shall mean an individual or Legal Entity
|
||||
exercising permissions granted by this License.
|
||||
|
||||
"Source" form shall mean the preferred form for making modifications,
|
||||
including but not limited to software source code, documentation
|
||||
source, and configuration files.
|
||||
|
||||
"Object" form shall mean any form resulting from mechanical
|
||||
transformation or translation of a Source form, including but
|
||||
not limited to compiled object code, generated documentation,
|
||||
and conversions to other media types.
|
||||
|
||||
"Work" shall mean the work of authorship, whether in Source or
|
||||
Object form, made available under the License, as indicated by a
|
||||
copyright notice that is included in or attached to the work
|
||||
(an example is provided in the Appendix below).
|
||||
|
||||
"Derivative Works" shall mean any work, whether in Source or Object
|
||||
form, that is based on (or derived from) the Work and for which the
|
||||
editorial revisions, annotations, elaborations, or other modifications
|
||||
represent, as a whole, an original work of authorship. For the purposes
|
||||
of this License, Derivative Works shall not include works that remain
|
||||
separable from, or merely link (or bind by name) to the interfaces of,
|
||||
the Work and Derivative Works thereof.
|
||||
|
||||
"Contribution" shall mean any work of authorship, including
|
||||
the original version of the Work and any modifications or additions
|
||||
to that Work or Derivative Works thereof, that is intentionally
|
||||
submitted to Licensor for inclusion in the Work by the copyright owner
|
||||
or by an individual or Legal Entity authorized to submit on behalf of
|
||||
the copyright owner. For the purposes of this definition, "submitted"
|
||||
means any form of electronic, verbal, or written communication sent
|
||||
to the Licensor or its representatives, including but not limited to
|
||||
communication on electronic mailing lists, source code control systems,
|
||||
and issue tracking systems that are managed by, or on behalf of, the
|
||||
Licensor for the purpose of discussing and improving the Work, but
|
||||
excluding communication that is conspicuously marked or otherwise
|
||||
designated in writing by the copyright owner as "Not a Contribution."
|
||||
|
||||
"Contributor" shall mean Licensor and any individual or Legal Entity
|
||||
on behalf of whom a Contribution has been received by Licensor and
|
||||
subsequently incorporated within the Work.
|
||||
|
||||
2. Grant of Copyright License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
copyright license to reproduce, prepare Derivative Works of,
|
||||
publicly display, publicly perform, sublicense, and distribute the
|
||||
Work and such Derivative Works in Source or Object form.
|
||||
|
||||
3. Grant of Patent License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
(except as stated in this section) patent license to make, have made,
|
||||
use, offer to sell, sell, import, and otherwise transfer the Work,
|
||||
where such license applies only to those patent claims licensable
|
||||
by such Contributor that are necessarily infringed by their
|
||||
Contribution(s) alone or by combination of their Contribution(s)
|
||||
with the Work to which such Contribution(s) was submitted. If You
|
||||
institute patent litigation against any entity (including a
|
||||
cross-claim or counterclaim in a lawsuit) alleging that the Work
|
||||
or a Contribution incorporated within the Work constitutes direct
|
||||
or contributory patent infringement, then any patent licenses
|
||||
granted to You under this License for that Work shall terminate
|
||||
as of the date such litigation is filed.
|
||||
|
||||
4. Redistribution. You may reproduce and distribute copies of the
|
||||
Work or Derivative Works thereof in any medium, with or without
|
||||
modifications, and in Source or Object form, provided that You
|
||||
meet the following conditions:
|
||||
|
||||
(a) You must give any other recipients of the Work or
|
||||
Derivative Works a copy of this License; and
|
||||
|
||||
(b) You must cause any modified files to carry prominent notices
|
||||
stating that You changed the files; and
|
||||
|
||||
(c) You must retain, in the Source form of any Derivative Works
|
||||
that You distribute, all copyright, patent, trademark, and
|
||||
attribution notices from the Source form of the Work,
|
||||
excluding those notices that do not pertain to any part of
|
||||
the Derivative Works; and
|
||||
|
||||
(d) If the Work includes a "NOTICE" text file as part of its
|
||||
distribution, then any Derivative Works that You distribute must
|
||||
include a readable copy of the attribution notices contained
|
||||
within such NOTICE file, excluding those notices that do not
|
||||
pertain to any part of the Derivative Works, in at least one
|
||||
of the following places: within a NOTICE text file distributed
|
||||
as part of the Derivative Works; within the Source form or
|
||||
documentation, if provided along with the Derivative Works; or,
|
||||
within a display generated by the Derivative Works, if and
|
||||
wherever such third-party notices normally appear. The contents
|
||||
of the NOTICE file are for informational purposes only and
|
||||
do not modify the License. You may add Your own attribution
|
||||
notices within Derivative Works that You distribute, alongside
|
||||
or as an addendum to the NOTICE text from the Work, provided
|
||||
that such additional attribution notices cannot be construed
|
||||
as modifying the License.
|
||||
|
||||
You may add Your own copyright statement to Your modifications and
|
||||
may provide additional or different license terms and conditions
|
||||
for use, reproduction, or distribution of Your modifications, or
|
||||
for any such Derivative Works as a whole, provided Your use,
|
||||
reproduction, and distribution of the Work otherwise complies with
|
||||
the conditions stated in this License.
|
||||
|
||||
5. Submission of Contributions. Unless You explicitly state otherwise,
|
||||
any Contribution intentionally submitted for inclusion in the Work
|
||||
by You to the Licensor shall be under the terms and conditions of
|
||||
this License, without any additional terms or conditions.
|
||||
Notwithstanding the above, nothing herein shall supersede or modify
|
||||
the terms of any separate license agreement you may have executed
|
||||
with Licensor regarding such Contributions.
|
||||
|
||||
6. Trademarks. This License does not grant permission to use the trade
|
||||
names, trademarks, service marks, or product names of the Licensor,
|
||||
except as required for reasonable and customary use in describing the
|
||||
origin of the Work and reproducing the content of the NOTICE file.
|
||||
|
||||
7. Disclaimer of Warranty. Unless required by applicable law or
|
||||
agreed to in writing, Licensor provides the Work (and each
|
||||
Contributor provides its Contributions) on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
||||
implied, including, without limitation, any warranties or conditions
|
||||
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
|
||||
PARTICULAR PURPOSE. You are solely responsible for determining the
|
||||
appropriateness of using or redistributing the Work and assume any
|
||||
risks associated with Your exercise of permissions under this License.
|
||||
|
||||
8. Limitation of Liability. In no event and under no legal theory,
|
||||
whether in tort (including negligence), contract, or otherwise,
|
||||
unless required by applicable law (such as deliberate and grossly
|
||||
negligent acts) or agreed to in writing, shall any Contributor be
|
||||
liable to You for damages, including any direct, indirect, special,
|
||||
incidental, or consequential damages of any character arising as a
|
||||
result of this License or out of the use or inability to use the
|
||||
Work (including but not limited to damages for loss of goodwill,
|
||||
work stoppage, computer failure or malfunction, or any and all
|
||||
other commercial damages or losses), even if such Contributor
|
||||
has been advised of the possibility of such damages.
|
||||
|
||||
9. Accepting Warranty or Additional Liability. While redistributing
|
||||
the Work or Derivative Works thereof, You may choose to offer,
|
||||
and charge a fee for, acceptance of support, warranty, indemnity,
|
||||
or other liability obligations and/or rights consistent with this
|
||||
License. However, in accepting such obligations, You may act only
|
||||
on Your own behalf and on Your sole responsibility, not on behalf
|
||||
of any other Contributor, and only if You agree to indemnify,
|
||||
defend, and hold each Contributor harmless for any liability
|
||||
incurred by, or claims asserted against, such Contributor by reason
|
||||
of your accepting any such warranty or additional liability.
|
||||
|
||||
END OF TERMS AND CONDITIONS
|
||||
|
||||
APPENDIX: How to apply the Apache License to your work.
|
||||
|
||||
To apply the Apache License to your work, attach the following
|
||||
boilerplate notice, with the fields enclosed by brackets "[]"
|
||||
replaced with your own identifying information. (Don't include
|
||||
the brackets!) The text should be enclosed in the appropriate
|
||||
comment syntax for the file format. We also recommend that a
|
||||
file or class name and description of purpose be included on the
|
||||
same "printed page" as the copyright notice for easier
|
||||
identification within third-party archives.
|
||||
|
||||
Copyright [yyyy] [name of copyright owner]
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
222
soprano_to_rvc/soprano/README.md
Normal file
222
soprano_to_rvc/soprano/README.md
Normal file
@@ -0,0 +1,222 @@
|
||||
<!-- Version 0.1.0 -->
|
||||
<div align="center">
|
||||
|
||||
# Soprano: Instant, Ultra‑Realistic Text‑to‑Speech
|
||||
|
||||
[](https://huggingface.co/ekwek/Soprano-1.1-80M)
|
||||
[](https://huggingface.co/spaces/ekwek/Soprano-TTS)
|
||||
|
||||
<img width="640" height="320" alt="soprano-github" src="https://github.com/user-attachments/assets/4d612eac-23b8-44e6-8c59-d7ac14ebafd1" />
|
||||
</div>
|
||||
|
||||
### 📰 News
|
||||
**2026.01.14 - [Soprano-1.1-80M](https://huggingface.co/ekwek/Soprano-1.1-80M) released! 95% fewer hallucinations and a 63% preference rate over Soprano-80M.**
|
||||
2026.01.13 - [Soprano-Factory](https://github.com/ekwek1/soprano-factory) released! You can now train/fine-tune your own Soprano models.
|
||||
2025.12.22 - Soprano-80M released! [Model](https://huggingface.co/ekwek/Soprano-80M) | [Demo](https://huggingface.co/spaces/ekwek/Soprano-TTS)
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
**Soprano** is an ultra‑lightweight, on-device text‑to‑speech (TTS) model designed for expressive, high‑fidelity speech synthesis at unprecedented speed. Soprano was designed with the following features:
|
||||
- Up to **20x** real-time generation on CPU and **2000x** real-time on GPU
|
||||
- **Lossless streaming** with **<250 ms** latency on CPU, **<15 ms** on GPU
|
||||
- **<1 GB** memory usage with a compact 80M parameter architecture
|
||||
- **Infinite generation length** with automatic text splitting
|
||||
- Highly expressive, crystal clear audio generation at **32kHz**
|
||||
- Widespread support for CUDA, CPU, and MPS devices on Windows, Linux, and Mac
|
||||
- Supports WebUI, CLI, and OpenAI-compatible endpoint for easy and production-ready inference
|
||||
|
||||
https://github.com/user-attachments/assets/525cf529-e79e-4368-809f-6be620852826
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Installation](#installation)
|
||||
- [Usage](#usage)
|
||||
- [WebUI](#webui)
|
||||
- [CLI](#cli)
|
||||
- [OpenAI-compatible endpoint](#openai-compatible-endpoint)
|
||||
- [Python script](#python-script)
|
||||
- [Usage tips](#usage-tips)
|
||||
- [Roadmap](#roadmap)
|
||||
|
||||
## Installation
|
||||
|
||||
### Install with wheel (CUDA)
|
||||
|
||||
```bash
|
||||
pip install soprano-tts[lmdeploy]
|
||||
```
|
||||
|
||||
### Install with wheel (CPU/MPS)
|
||||
|
||||
```bash
|
||||
pip install soprano-tts
|
||||
```
|
||||
|
||||
To get the latest features, you can install from source instead.
|
||||
|
||||
### Install from source (CUDA)
|
||||
|
||||
```bash
|
||||
git clone https://github.com/ekwek1/soprano.git
|
||||
cd soprano
|
||||
pip install -e .[lmdeploy]
|
||||
```
|
||||
|
||||
### Install from source (CPU/MPS)
|
||||
|
||||
```bash
|
||||
git clone https://github.com/ekwek1/soprano.git
|
||||
cd soprano
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
> ### ⚠️ Warning: Windows CUDA users
|
||||
>
|
||||
> On Windows with CUDA, `pip` will install a CPU-only PyTorch build. To ensure CUDA support works as expected, reinstall PyTorch explicitly with the correct CUDA wheel **after** installing Soprano:
|
||||
>
|
||||
> ```bash
|
||||
> pip uninstall -y torch
|
||||
> pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cu128
|
||||
> ```
|
||||
|
||||
---
|
||||
|
||||
## Usage
|
||||
|
||||
### WebUI
|
||||
|
||||
Start WebUI:
|
||||
|
||||
```bash
|
||||
soprano-webui # hosted on http://127.0.0.1:7860 by default
|
||||
```
|
||||
> **Tip:** You can increase cache size and decoder batch size to increase inference speed at the cost of higher memory usage. For example:
|
||||
> ```bash
|
||||
> soprano-webui --cache-size 1000 --decoder-batch-size 4
|
||||
> ```
|
||||
|
||||
### CLI
|
||||
|
||||
```
|
||||
soprano "Soprano is an extremely lightweight text to speech model."
|
||||
|
||||
optional arguments:
|
||||
--output, -o Output audio file path (non-streaming only). Defaults to 'output.wav'
|
||||
--model-path, -m Path to local model directory (optional)
|
||||
--device, -d Device to use for inference. Supported: auto, cuda, cpu, mps. Defaults to 'auto'
|
||||
--backend, -b Backend to use for inference. Supported: auto, transformers, lmdeploy. Defaults to 'auto'
|
||||
--cache-size, -c Cache size in MB (for lmdeploy backend). Defaults to 100
|
||||
--decoder-batch-size, -bs Decoder batch size. Defaults to 1
|
||||
--streaming, -s Enable streaming playback to speakers
|
||||
```
|
||||
> **Tip:** You can increase cache size and decoder batch size to increase inference speed at the cost of higher memory usage.
|
||||
|
||||
> **Note:** The CLI will reload the model every time it is called. As a result, inference speed will be slower than other methods.
|
||||
|
||||
### OpenAI-compatible endpoint
|
||||
|
||||
Start server:
|
||||
|
||||
```bash
|
||||
uvicorn soprano.server:app --host 0.0.0.0 --port 8000
|
||||
```
|
||||
|
||||
Use the endpoint like this:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/audio/speech \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"input": "Soprano is an extremely lightweight text to speech model."
|
||||
}' \
|
||||
--output speech.wav
|
||||
```
|
||||
|
||||
> **Note:** Currently, this endpoint only supports nonstreaming output.
|
||||
|
||||
### Python script
|
||||
|
||||
```python
|
||||
from soprano import SopranoTTS
|
||||
|
||||
model = SopranoTTS(backend='auto', device='auto', cache_size_mb=100, decoder_batch_size=1)
|
||||
```
|
||||
|
||||
> **Tip:** You can increase cache_size_mb and decoder_batch_size to increase inference speed at the cost of higher memory usage.
|
||||
|
||||
```python
|
||||
# Basic inference
|
||||
out = model.infer("Soprano is an extremely lightweight text to speech model.") # can achieve 2000x real-time with sufficiently long input!
|
||||
|
||||
# Save output to a file
|
||||
out = model.infer("Soprano is an extremely lightweight text to speech model.", "out.wav")
|
||||
|
||||
# Custom sampling parameters
|
||||
out = model.infer(
|
||||
"Soprano is an extremely lightweight text to speech model.",
|
||||
temperature=0.3,
|
||||
top_p=0.95,
|
||||
repetition_penalty=1.2,
|
||||
)
|
||||
|
||||
|
||||
# Batched inference
|
||||
out = model.infer_batch(["Soprano is an extremely lightweight text to speech model."] * 10) # can achieve 2000x real-time with sufficiently large input size!
|
||||
|
||||
# Save batch outputs to a directory
|
||||
out = model.infer_batch(["Soprano is an extremely lightweight text to speech model."] * 10, "/dir")
|
||||
|
||||
|
||||
# Streaming inference
|
||||
from soprano.utils.streaming import play_stream
|
||||
stream = model.infer_stream("Soprano is an extremely lightweight text to speech model.", chunk_size=1)
|
||||
play_stream(stream) # plays audio with <15 ms latency!
|
||||
```
|
||||
|
||||
## Usage tips:
|
||||
|
||||
* Soprano works best when each sentence is between 2 and 30 seconds long.
|
||||
* Although Soprano recognizes numbers and some special characters, it occasionally mispronounces them. Best results can be achieved by converting these into their phonetic form. (1+1 -> one plus one, etc)
|
||||
* If Soprano produces unsatisfactory results, you can easily regenerate it for a new, potentially better generation. You may also change the sampling settings for more varied results.
|
||||
* Avoid improper grammar such as not using contractions, multiple spaces, etc.
|
||||
|
||||
---
|
||||
|
||||
## Roadmap
|
||||
|
||||
* [x] Add model and inference code
|
||||
* [x] Seamless streaming
|
||||
* [x] Batched inference
|
||||
* [x] Command-line interface (CLI)
|
||||
* [x] CPU support
|
||||
* [x] Server / API inference
|
||||
* [ ] ROCm support (see [#29](/../../issues/29))
|
||||
* [ ] Additional LLM backends
|
||||
* [ ] Voice cloning
|
||||
* [ ] Multilingual support
|
||||
|
||||
---
|
||||
|
||||
## Limitations
|
||||
|
||||
Soprano is currently English-only and does not support voice cloning. In addition, Soprano was trained on only 1,000 hours of audio (~100x less than other TTS models), so mispronunciation of uncommon words may occur. This is expected to diminish as Soprano is trained on more data.
|
||||
|
||||
---
|
||||
|
||||
## Acknowledgements
|
||||
|
||||
Soprano uses and/or is inspired by the following projects:
|
||||
|
||||
* [Vocos](https://github.com/gemelo-ai/vocos)
|
||||
* [XTTS](https://github.com/coqui-ai/TTS)
|
||||
* [LMDeploy](https://github.com/InternLM/lmdeploy)
|
||||
|
||||
---
|
||||
|
||||
## License
|
||||
|
||||
This project is licensed under the **Apache-2.0** license. See `LICENSE` for details.
|
||||
94
soprano_to_rvc/soprano/examples/streaming.ipynb
Normal file
94
soprano_to_rvc/soprano/examples/streaming.ipynb
Normal file
@@ -0,0 +1,94 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "aaf83d46",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from soprano import SopranoTTS\n",
|
||||
"from soprano.utils.streaming import play_stream"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "8a4d1065",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"tts = SopranoTTS()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "91417abb",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"text = \"\"\"In my younger and more vulnerable years my father gave me some advice that I've been turning over in my mind ever since.\n",
|
||||
"\n",
|
||||
"\"Whenever you feel like criticizing any one,\" he told me, \"just remember that all the people in this world haven't had the advantages that you've had.\"\n",
|
||||
"\n",
|
||||
"He didn't say any more but we've always been unusually communicative in a reserved way, and I understood that he meant a great deal more than that. In consequence I'm inclined to reserve all judgments, a habit that has opened up many curious natures to me and also made me the victim of not a few veteran bores. The abnormal mind is quick to detect and attach itself to this quality when it appears in a normal person, and so it came about that in college I was unjustly accused of being a politician, because I was privy to the secret griefs of wild, unknown men. Most of the confidences were unsought—frequently I have feigned sleep, preoccupation, or a hostile levity when I realized by some unmistakable sign that an intimate revelation was quivering on the horizon—for the intimate revelations of young men or at least the terms in which they express them are usually plagiaristic and marred by obvious suppressions. Reserving judgments is a matter of infinite hope. I am still a little afraid of missing something if I forget that, as my father snobbishly suggested, and I snobbishly repeat, a sense of the fundamental decencies is parcelled out unequally at birth.\n",
|
||||
"\n",
|
||||
"And, after boasting this way of my tolerance, I come to the admission that it has a limit. Conduct may be founded on the hard rock or the wet marshes but after a certain point I don't care what it's founded on. When I came back from the East last autumn I felt that I wanted the world to be in uniform and at a sort of moral attention forever; I wanted no more riotous excursions with privileged glimpses into the human heart. Only Gatsby, the man who gives his name to this book, was exempt from my reaction—Gatsby who represented everything for which I have an unaffected scorn. If personality is an unbroken series of successful gestures, then there was something gorgeous about him, some heightened sensitivity to the promises of life, as if he were related to one of those intricate machines that register earthquakes ten thousand miles away. This responsiveness had nothing to do with that flabby impressionability which is dignified under the name of the \"creative temperament\"—it was an extraordinary gift for hope, a romantic readiness such as I have never found in any other person and which it is not likely I shall ever find again. No—Gatsby turned out all right at the end; it is what preyed on Gatsby, what foul dust floated in the wake of his dreams that temporarily closed out my interest in the abortive sorrows and short-winded elations of men.\n",
|
||||
"\n",
|
||||
"My family have been prominent, well-to-do people in this middle-western city for three generations. The Carraways are something of a clan and we have a tradition that we're descended from the Dukes of Buccleuch, but the actual founder of my line was my grandfather's brother who came here in fifty-one, sent a substitute to the Civil War and started the wholesale hardware business that my father carries on today.\n",
|
||||
"\n",
|
||||
"I never saw this great-uncle but I'm supposed to look like him—with special reference to the rather hard-boiled painting that hangs in Father's office. I graduated from New Haven in 1915, just a quarter of a century after my father, and a little later I participated in that delayed Teutonic migration known as the Great War. I enjoyed the counter-raid so thoroughly that I came back restless. Instead of being the warm center of the world the middle-west now seemed like the ragged edge of the universe—so I decided to go east and learn the bond business. Everybody I knew was in the bond business so I supposed it could support one more single man. All my aunts and uncles talked it over as if they were choosing a prep-school for me and finally said, \"Why—ye-es\" with very grave, hesitant faces. Father agreed to finance me for a year and after various delays I came east, permanently, I thought, in the spring of twenty-two.\"\"\"\n",
|
||||
"chunk_size = 1"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "4917a8c9",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Streaming latency: 21.01 ms\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"stream = tts.infer_stream(text, chunk_size=chunk_size)\n",
|
||||
"play_stream(stream)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "78439cad",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "soprano-env",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.12.12"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
43
soprano_to_rvc/soprano/pyproject.toml
Normal file
43
soprano_to_rvc/soprano/pyproject.toml
Normal file
@@ -0,0 +1,43 @@
|
||||
[build-system]
|
||||
requires = ["setuptools>=61.0"]
|
||||
build-backend = "setuptools.build_meta"
|
||||
|
||||
[project]
|
||||
name = "soprano-tts"
|
||||
version = "0.1.0"
|
||||
authors = [
|
||||
{ name="ekwek1", email="eugene.kwek.1@gmail.com" },
|
||||
]
|
||||
description = "Soprano: Instant, Ultra‑Realistic Text‑to‑Speech"
|
||||
readme = "README.md"
|
||||
requires-python = ">=3.10"
|
||||
classifiers = [
|
||||
"Programming Language :: Python :: 3",
|
||||
"Operating System :: OS Independent",
|
||||
]
|
||||
dependencies = [
|
||||
"accelerate",
|
||||
"fastapi",
|
||||
"gradio",
|
||||
"huggingface_hub",
|
||||
"numpy",
|
||||
"scipy",
|
||||
"sounddevice",
|
||||
"torch>=2.1.0",
|
||||
"transformers>=4.51.0",
|
||||
"unidecode",
|
||||
"uvicorn",
|
||||
"inflect"
|
||||
]
|
||||
license = {file = "LICENSE"}
|
||||
|
||||
[project.optional-dependencies]
|
||||
lmdeploy = ["lmdeploy"]
|
||||
|
||||
[project.urls]
|
||||
Homepage = "https://github.com/ekwek1/soprano"
|
||||
Issues = "https://github.com/ekwek1/soprano/issues"
|
||||
|
||||
[project.scripts]
|
||||
soprano = "soprano.cli:main"
|
||||
soprano-webui = "soprano.webui:main"
|
||||
1
soprano_to_rvc/soprano/soprano/__init__.py
Normal file
1
soprano_to_rvc/soprano/soprano/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
from .tts import SopranoTTS
|
||||
20
soprano_to_rvc/soprano/soprano/backends/base.py
Normal file
20
soprano_to_rvc/soprano/soprano/backends/base.py
Normal file
@@ -0,0 +1,20 @@
|
||||
class BaseModel:
|
||||
def infer(self,
|
||||
prompts,
|
||||
top_p=0.95,
|
||||
temperature=0.3,
|
||||
repetition_penalty=1.2):
|
||||
'''
|
||||
Takes a list of prompts and returns the output hidden states
|
||||
'''
|
||||
pass
|
||||
|
||||
def stream_infer(self,
|
||||
prompt,
|
||||
top_p=0.95,
|
||||
temperature=0.3,
|
||||
repetition_penalty=1.2):
|
||||
'''
|
||||
Takes a prompt and returns an iterator of the output hidden states
|
||||
'''
|
||||
pass
|
||||
59
soprano_to_rvc/soprano/soprano/backends/lmdeploy.py
Normal file
59
soprano_to_rvc/soprano/soprano/backends/lmdeploy.py
Normal file
@@ -0,0 +1,59 @@
|
||||
import torch
|
||||
from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
|
||||
from .base import BaseModel
|
||||
|
||||
|
||||
class LMDeployModel(BaseModel):
|
||||
def __init__(self,
|
||||
device='cuda',
|
||||
cache_size_mb=100,
|
||||
model_path=None,
|
||||
**kwargs):
|
||||
assert device == 'cuda', "lmdeploy only supports cuda devices, consider changing device or using a different backend instead."
|
||||
cache_size_ratio = cache_size_mb * 1024**2 / torch.cuda.get_device_properties('cuda').total_memory
|
||||
backend_config = TurbomindEngineConfig(cache_max_entry_count=cache_size_ratio)
|
||||
|
||||
# Use local model if path provided, otherwise use HuggingFace
|
||||
model_name_or_path = model_path if model_path else 'ekwek/Soprano-1.1-80M'
|
||||
|
||||
self.pipeline = pipeline(model_name_or_path,
|
||||
log_level='ERROR',
|
||||
backend_config=backend_config)
|
||||
|
||||
def infer(self,
|
||||
prompts,
|
||||
top_p=0.95,
|
||||
temperature=0.3,
|
||||
repetition_penalty=1.2):
|
||||
gen_config=GenerationConfig(output_last_hidden_state='generation',
|
||||
do_sample=True,
|
||||
top_p=top_p,
|
||||
temperature=temperature,
|
||||
repetition_penalty=repetition_penalty,
|
||||
max_new_tokens=512)
|
||||
responses = self.pipeline(prompts, gen_config=gen_config)
|
||||
res = []
|
||||
for response in responses:
|
||||
res.append({
|
||||
'finish_reason': response.finish_reason,
|
||||
'hidden_state': response.last_hidden_state
|
||||
})
|
||||
return res
|
||||
|
||||
def stream_infer(self,
|
||||
prompt,
|
||||
top_p=0.95,
|
||||
temperature=0.3,
|
||||
repetition_penalty=1.2):
|
||||
gen_config=GenerationConfig(output_last_hidden_state='generation',
|
||||
do_sample=True,
|
||||
top_p=top_p,
|
||||
temperature=temperature,
|
||||
repetition_penalty=repetition_penalty,
|
||||
max_new_tokens=512)
|
||||
responses = self.pipeline.stream_infer([prompt], gen_config=gen_config)
|
||||
for response in responses:
|
||||
yield {
|
||||
'finish_reason': response.finish_reason,
|
||||
'hidden_state': response.last_hidden_state
|
||||
}
|
||||
154
soprano_to_rvc/soprano/soprano/backends/transformers.py
Normal file
154
soprano_to_rvc/soprano/soprano/backends/transformers.py
Normal file
@@ -0,0 +1,154 @@
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
from transformers import LogitsProcessorList, RepetitionPenaltyLogitsProcessor, TemperatureLogitsWarper, TopPLogitsWarper
|
||||
from .base import BaseModel
|
||||
|
||||
|
||||
class TransformersModel(BaseModel):
|
||||
def __init__(self,
|
||||
device='cuda',
|
||||
model_path=None,
|
||||
**kwargs):
|
||||
self.device = device
|
||||
|
||||
# Use local model if path provided, otherwise use HuggingFace
|
||||
model_name_or_path = model_path if model_path else 'ekwek/Soprano-1.1-80M'
|
||||
|
||||
self.model = AutoModelForCausalLM.from_pretrained(
|
||||
model_name_or_path,
|
||||
dtype=torch.bfloat16 if device == 'cuda' else torch.float32,
|
||||
device_map=device
|
||||
)
|
||||
self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
|
||||
self.model.eval()
|
||||
|
||||
def infer(self,
|
||||
prompts,
|
||||
top_p=0.95,
|
||||
temperature=0.3,
|
||||
repetition_penalty=1.2):
|
||||
if temperature <= 0.0:
|
||||
temperature = 0.001 # temp must be nonzero
|
||||
inputs = self.tokenizer(
|
||||
prompts,
|
||||
return_tensors='pt',
|
||||
padding=True,
|
||||
truncation=True,
|
||||
max_length=512,
|
||||
).to(self.device)
|
||||
|
||||
with torch.no_grad():
|
||||
outputs = self.model.generate(
|
||||
input_ids=inputs['input_ids'],
|
||||
attention_mask=inputs['attention_mask'],
|
||||
max_new_tokens=512,
|
||||
do_sample=True,
|
||||
top_p=top_p,
|
||||
temperature=temperature,
|
||||
repetition_penalty=repetition_penalty,
|
||||
pad_token_id=self.tokenizer.pad_token_id,
|
||||
return_dict_in_generate=True,
|
||||
output_hidden_states=True,
|
||||
)
|
||||
res = []
|
||||
eos_token_id = self.model.config.eos_token_id
|
||||
for i in range(len(prompts)):
|
||||
seq = outputs.sequences[i]
|
||||
hidden_states = []
|
||||
num_output_tokens = len(outputs.hidden_states)
|
||||
for j in range(num_output_tokens):
|
||||
token = seq[j + seq.size(0) - num_output_tokens]
|
||||
if token != eos_token_id: hidden_states.append(outputs.hidden_states[j][-1][i, -1, :])
|
||||
last_hidden_state = torch.stack(hidden_states).squeeze()
|
||||
finish_reason = 'stop' if seq[-1].item() == eos_token_id else 'length'
|
||||
res.append({
|
||||
'finish_reason': finish_reason,
|
||||
'hidden_state': last_hidden_state
|
||||
})
|
||||
return res
|
||||
|
||||
def stream_infer(self,
|
||||
prompt,
|
||||
top_p=0.95,
|
||||
temperature=0.3,
|
||||
repetition_penalty=1.2):
|
||||
if temperature <= 0.0:
|
||||
temperature = 0.001 # temp must be nonzero
|
||||
|
||||
# Tokenize input
|
||||
inputs = self.tokenizer(prompt, return_tensors='pt').to(self.device)
|
||||
input_ids = inputs['input_ids']
|
||||
|
||||
# Prepare Logits Processors for sampling
|
||||
logits_processor = LogitsProcessorList()
|
||||
if repetition_penalty != 1.0:
|
||||
logits_processor.append(RepetitionPenaltyLogitsProcessor(penalty=repetition_penalty))
|
||||
|
||||
logits_warper = LogitsProcessorList()
|
||||
if temperature != 1.0:
|
||||
logits_warper.append(TemperatureLogitsWarper(temperature=temperature))
|
||||
if top_p < 1.0:
|
||||
logits_warper.append(TopPLogitsWarper(top_p=top_p))
|
||||
|
||||
# Helper to sample next token
|
||||
def get_next_token(logits, input_seq):
|
||||
scores = logits_processor(input_seq, logits)
|
||||
scores = logits_warper(input_seq, scores)
|
||||
probs = torch.nn.functional.softmax(scores, dim=-1)
|
||||
# Sample from the distribution
|
||||
return torch.multinomial(probs, num_samples=1)
|
||||
|
||||
with torch.no_grad():
|
||||
# Initial forward pass with the prompt
|
||||
outputs = self.model(
|
||||
input_ids,
|
||||
use_cache=True,
|
||||
output_hidden_states=True
|
||||
)
|
||||
|
||||
past_key_values = outputs.past_key_values
|
||||
next_token_logits = outputs.logits[:, -1, :]
|
||||
|
||||
# We need to maintain the full sequence for repetition penalty
|
||||
generated_ids = input_ids
|
||||
|
||||
# Sample the first token
|
||||
next_token = get_next_token(next_token_logits, generated_ids)
|
||||
|
||||
max_new_tokens = 512
|
||||
eos_token_id = self.model.config.eos_token_id
|
||||
|
||||
for i in range(max_new_tokens):
|
||||
# Append generated token to sequence history
|
||||
generated_ids = torch.cat([generated_ids, next_token], dim=-1)
|
||||
|
||||
# Run forward pass for the single new token
|
||||
outputs = self.model(
|
||||
next_token,
|
||||
past_key_values=past_key_values,
|
||||
use_cache=True,
|
||||
output_hidden_states=True
|
||||
)
|
||||
|
||||
# Update cache and get hidden state
|
||||
past_key_values = outputs.past_key_values
|
||||
current_hidden_state = outputs.hidden_states[-1][:, -1, :] # Last layer, last token
|
||||
|
||||
finish_reason = None
|
||||
if next_token.item() == eos_token_id:
|
||||
finish_reason = 'stop'
|
||||
elif i == max_new_tokens - 1:
|
||||
finish_reason = 'length'
|
||||
|
||||
# Yield result matching lmdeploy format
|
||||
yield {
|
||||
'finish_reason': finish_reason,
|
||||
'hidden_state': current_hidden_state
|
||||
}
|
||||
|
||||
if finish_reason:
|
||||
break
|
||||
|
||||
# Prepare for next iteration
|
||||
next_token_logits = outputs.logits[:, -1, :]
|
||||
next_token = get_next_token(next_token_logits, generated_ids)
|
||||
49
soprano_to_rvc/soprano/soprano/cli.py
Normal file
49
soprano_to_rvc/soprano/soprano/cli.py
Normal file
@@ -0,0 +1,49 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Soprano TTS Command Line Interface
|
||||
"""
|
||||
import argparse
|
||||
from soprano import SopranoTTS
|
||||
from soprano.utils.streaming import play_stream
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description='Soprano Text-to-Speech CLI')
|
||||
parser.add_argument('text', help='Text to synthesize')
|
||||
parser.add_argument('--output', '-o', default='output.wav',
|
||||
help='Output audio file path (non-streaming only)')
|
||||
parser.add_argument('--model-path', '-m',
|
||||
help='Path to local model directory (optional)')
|
||||
parser.add_argument('--device', '-d', default='auto',
|
||||
choices=['auto', 'cuda', 'cpu', 'mps'],
|
||||
help='Device to use for inference')
|
||||
parser.add_argument('--backend', '-b', default='auto',
|
||||
choices=['auto', 'transformers', 'lmdeploy'],
|
||||
help='Backend to use for inference')
|
||||
parser.add_argument('--cache-size', '-c', type=int, default=100,
|
||||
help='Cache size in MB (for lmdeploy backend)')
|
||||
parser.add_argument('--decoder-batch-size', '-bs', type=int, default=1,
|
||||
help='Batch size when decoding audio')
|
||||
parser.add_argument('--streaming', '-s', action='store_true',
|
||||
help='Enable streaming playback to speakers')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Initialize TTS
|
||||
tts = SopranoTTS(
|
||||
backend=args.backend,
|
||||
device=args.device,
|
||||
cache_size_mb=args.cache_size,
|
||||
decoder_batch_size=args.decoder_batch_size,
|
||||
model_path=args.model_path
|
||||
)
|
||||
|
||||
print(f"Generating speech for: '{args.text}'")
|
||||
if args.streaming:
|
||||
stream = tts.infer_stream(args.text, chunk_size=1)
|
||||
play_stream(stream)
|
||||
else:
|
||||
tts.infer(args.text, out_path=args.output)
|
||||
print(f"Audio saved to: {args.output}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
47
soprano_to_rvc/soprano/soprano/server.py
Normal file
47
soprano_to_rvc/soprano/soprano/server.py
Normal file
@@ -0,0 +1,47 @@
|
||||
import base64
|
||||
import io
|
||||
import json
|
||||
from typing import Generator
|
||||
|
||||
import numpy as np
|
||||
from fastapi import FastAPI, HTTPException
|
||||
from fastapi.responses import Response
|
||||
from scipy.io.wavfile import write
|
||||
from torch import Tensor
|
||||
|
||||
from soprano.tts import SopranoTTS
|
||||
|
||||
# Load model at startup
|
||||
tts = SopranoTTS(cache_size_mb = 100)
|
||||
|
||||
app = FastAPI(title="Soprano TTS API")
|
||||
|
||||
def _tensor_to_wav_bytes(tensor: Tensor) -> bytes:
|
||||
"""
|
||||
Convert a 1D fp32 torch tensor to a WAV byte stream.
|
||||
"""
|
||||
# convert to int16
|
||||
audio_int16 = (np.clip(tensor.numpy(), -1.0, 1.0) * 32767).astype(np.int16)
|
||||
|
||||
wav_io = io.BytesIO()
|
||||
write(wav_io, 32000, audio_int16) # 32kHz sample rate
|
||||
wav_io.seek(0)
|
||||
return wav_io.read()
|
||||
|
||||
|
||||
@app.post("/v1/audio/speech")
|
||||
async def create_speech(payload: dict):
|
||||
"""
|
||||
Minimal implementation of OpenAI's Speech endpoint.
|
||||
Fields:
|
||||
- input: string - text to synthesize
|
||||
- model, voice, etc. are accepted but ignored.
|
||||
- response_format: str - ignored, only support wav.
|
||||
"""
|
||||
text = payload.get("input")
|
||||
if not isinstance(text, str) or not text.strip():
|
||||
raise HTTPException(status_code=400, detail="`input` field must be a non-empty string.")
|
||||
|
||||
audio_tensor = tts.infer(text)
|
||||
wav_bytes = _tensor_to_wav_bytes(audio_tensor)
|
||||
return Response(content=wav_bytes, media_type="audio/wav", headers={"Content-Disposition": 'attachment; filename="speech.wav"'})
|
||||
237
soprano_to_rvc/soprano/soprano/tts.py
Normal file
237
soprano_to_rvc/soprano/soprano/tts.py
Normal file
@@ -0,0 +1,237 @@
|
||||
from .vocos.decoder import SopranoDecoder
|
||||
from .utils.text_normalizer import clean_text
|
||||
from .utils.text_splitter import split_and_recombine_text
|
||||
from .utils.auto_select import select_device, select_backend
|
||||
import torch
|
||||
import re
|
||||
from unidecode import unidecode
|
||||
from scipy.io import wavfile
|
||||
from huggingface_hub import hf_hub_download
|
||||
import os
|
||||
import time
|
||||
|
||||
|
||||
class SopranoTTS:
|
||||
"""
|
||||
Soprano Text-to-Speech model.
|
||||
|
||||
Args:
|
||||
backend: Backend to use for inference. Options:
|
||||
- 'auto' (default): Automatically select best backend. Tries lmdeploy first (fastest),
|
||||
falls back to transformers. CPU always uses transformers.
|
||||
- 'lmdeploy': Force use of LMDeploy (fastest, CUDA only)
|
||||
- 'transformers': Force use of HuggingFace Transformers (slower, all devices)
|
||||
device: Device to run inference on ('auto', 'cuda', 'cpu', 'mps')
|
||||
cache_size_mb: Cache size in MB for lmdeploy backend
|
||||
decoder_batch_size: Batch size for decoder
|
||||
"""
|
||||
def __init__(self,
|
||||
backend='auto',
|
||||
device='auto',
|
||||
cache_size_mb=100,
|
||||
decoder_batch_size=1,
|
||||
model_path=None):
|
||||
device = select_device(device=device)
|
||||
backend = select_backend(backend=backend, device=device)
|
||||
|
||||
if backend == 'lmdeploy':
|
||||
from .backends.lmdeploy import LMDeployModel
|
||||
self.pipeline = LMDeployModel(device=device, cache_size_mb=cache_size_mb, model_path=model_path)
|
||||
elif backend == 'transformers':
|
||||
from .backends.transformers import TransformersModel
|
||||
self.pipeline = TransformersModel(device=device, model_path=model_path)
|
||||
|
||||
self.device = device
|
||||
self.backend = backend
|
||||
self.decoder = SopranoDecoder().to(device)
|
||||
if model_path:
|
||||
decoder_path = os.path.join(model_path, 'decoder.pth')
|
||||
else:
|
||||
decoder_path = hf_hub_download(repo_id='ekwek/Soprano-1.1-80M', filename='decoder.pth')
|
||||
self.decoder.load_state_dict(torch.load(decoder_path, map_location=device))
|
||||
self.decoder_batch_size=decoder_batch_size
|
||||
self.RECEPTIVE_FIELD = 4 # Decoder receptive field
|
||||
self.TOKEN_SIZE = 2048 # Number of samples per audio token
|
||||
|
||||
self.infer("Hello world!") # warmup
|
||||
|
||||
def _preprocess_text(self, texts, min_length=30):
|
||||
'''
|
||||
adds prompt format and sentence/part index
|
||||
Enforces a minimum sentence length by merging short sentences.
|
||||
'''
|
||||
res = []
|
||||
for text_idx, text in enumerate(texts):
|
||||
text = text.strip()
|
||||
cleaned_text = clean_text(text)
|
||||
sentences = split_and_recombine_text(cleaned_text)
|
||||
processed = []
|
||||
for sentence in sentences:
|
||||
processed.append({
|
||||
"text": sentence,
|
||||
"text_idx": text_idx,
|
||||
})
|
||||
|
||||
if min_length > 0 and len(processed) > 1:
|
||||
merged = []
|
||||
i = 0
|
||||
while i < len(processed):
|
||||
cur = processed[i]
|
||||
if len(cur["text"]) < min_length:
|
||||
if merged: merged[-1]["text"] = (merged[-1]["text"] + " " + cur["text"]).strip()
|
||||
else:
|
||||
if i + 1 < len(processed): processed[i + 1]["text"] = (cur["text"] + " " + processed[i + 1]["text"]).strip()
|
||||
else: merged.append(cur)
|
||||
else: merged.append(cur)
|
||||
i += 1
|
||||
processed = merged
|
||||
sentence_idxes = {}
|
||||
for item in processed:
|
||||
if item['text_idx'] not in sentence_idxes: sentence_idxes[item['text_idx']] = 0
|
||||
res.append((f'[STOP][TEXT]{item["text"]}[START]', item["text_idx"], sentence_idxes[item['text_idx']]))
|
||||
sentence_idxes[item['text_idx']] += 1
|
||||
return res
|
||||
|
||||
def hallucination_detector(self, hidden_state):
|
||||
'''
|
||||
Analyzes hidden states to find long runs of similar sequences.
|
||||
'''
|
||||
DIFF_THRESHOLD = 300 # minimal difference between sequences
|
||||
MAX_RUNLENGTH = 16 # maximum number of recent similar sequences
|
||||
if len(hidden_state) <= MAX_RUNLENGTH: # hidden state not long enough
|
||||
return False
|
||||
aah_runlength = 0
|
||||
for i in range(len(hidden_state) - 1):
|
||||
current_sequences = hidden_state[i]
|
||||
next_sequences = hidden_state[i + 1]
|
||||
diffs = torch.abs(current_sequences - next_sequences)
|
||||
total_diff = diffs.sum(dim=0)
|
||||
if total_diff < DIFF_THRESHOLD:
|
||||
aah_runlength += 1
|
||||
elif aah_runlength > 0:
|
||||
aah_runlength -= 1
|
||||
if aah_runlength > MAX_RUNLENGTH:
|
||||
return True
|
||||
return False
|
||||
|
||||
def infer(self,
|
||||
text,
|
||||
out_path=None,
|
||||
top_p=0.95,
|
||||
temperature=0.0,
|
||||
repetition_penalty=1.2,
|
||||
retries=0):
|
||||
results = self.infer_batch([text],
|
||||
top_p=top_p,
|
||||
temperature=temperature,
|
||||
repetition_penalty=repetition_penalty,
|
||||
out_dir=None,
|
||||
retries=retries)[0]
|
||||
if out_path:
|
||||
wavfile.write(out_path, 32000, results.cpu().numpy())
|
||||
return results
|
||||
|
||||
def infer_batch(self,
|
||||
texts,
|
||||
out_dir=None,
|
||||
top_p=0.95,
|
||||
temperature=0.0,
|
||||
repetition_penalty=1.2,
|
||||
retries=0):
|
||||
sentence_data = self._preprocess_text(texts)
|
||||
prompts = list(map(lambda x: x[0], sentence_data))
|
||||
hidden_states = [None] * len(prompts)
|
||||
pending_indices = list(range(0, len(prompts)))
|
||||
tries_left = 1 + max(0, retries)
|
||||
while tries_left > 0 and pending_indices:
|
||||
current_prompts = [prompts[i] for i in pending_indices]
|
||||
responses = self.pipeline.infer(current_prompts,
|
||||
top_p=top_p,
|
||||
temperature=temperature,
|
||||
repetition_penalty=repetition_penalty)
|
||||
bad_indices = []
|
||||
for idx, response in enumerate(responses):
|
||||
hidden_state = response['hidden_state']
|
||||
hidden_states[pending_indices[idx]] = hidden_state
|
||||
if response['finish_reason'] != 'stop':
|
||||
print(f"Warning: A sentence did not complete generation, likely due to hallucination.")
|
||||
if retries > 0 and self.hallucination_detector(hidden_state):
|
||||
print(f"Warning: A sentence contained a hallucination.")
|
||||
bad_indices.append(pending_indices[idx])
|
||||
if not bad_indices:
|
||||
break
|
||||
else:
|
||||
pending_indices = bad_indices
|
||||
tries_left -= 1
|
||||
if tries_left > 0:
|
||||
print(f"Warning: {len(pending_indices)} sentence(s) will be regenerated.")
|
||||
combined = list(zip(hidden_states, sentence_data))
|
||||
combined.sort(key=lambda x: -x[0].size(0))
|
||||
hidden_states, sentence_data = zip(*combined)
|
||||
|
||||
num_texts = len(texts)
|
||||
audio_concat = [[] for _ in range(num_texts)]
|
||||
for sentence in sentence_data:
|
||||
audio_concat[sentence[1]].append(None)
|
||||
for idx in range(0, len(hidden_states), self.decoder_batch_size):
|
||||
batch_hidden_states = []
|
||||
lengths = list(map(lambda x: x.size(0), hidden_states[idx:idx+self.decoder_batch_size]))
|
||||
N = len(lengths)
|
||||
for i in range(N):
|
||||
batch_hidden_states.append(torch.cat([
|
||||
torch.zeros((1, 512, lengths[0]-lengths[i]), device=self.device),
|
||||
hidden_states[idx+i].unsqueeze(0).transpose(1,2).to(self.device).to(torch.float32),
|
||||
], dim=2))
|
||||
batch_hidden_states = torch.cat(batch_hidden_states)
|
||||
with torch.no_grad():
|
||||
audio = self.decoder(batch_hidden_states)
|
||||
|
||||
for i in range(N):
|
||||
text_id = sentence_data[idx+i][1]
|
||||
sentence_id = sentence_data[idx+i][2]
|
||||
audio_concat[text_id][sentence_id] = audio[i].squeeze()[-(lengths[i]*self.TOKEN_SIZE-self.TOKEN_SIZE):]
|
||||
audio_concat = [torch.cat(x).cpu() for x in audio_concat]
|
||||
|
||||
if out_dir:
|
||||
os.makedirs(out_dir, exist_ok=True)
|
||||
for i in range(len(audio_concat)):
|
||||
wavfile.write(f"{out_dir}/{i}.wav", 32000, audio_concat[i].cpu().numpy())
|
||||
return audio_concat
|
||||
|
||||
def infer_stream(self,
|
||||
text,
|
||||
chunk_size=1,
|
||||
top_p=0.95,
|
||||
temperature=0.0,
|
||||
repetition_penalty=1.2):
|
||||
start_time = time.time()
|
||||
sentence_data = self._preprocess_text([text])
|
||||
|
||||
first_chunk = True
|
||||
for sentence, _, _ in sentence_data:
|
||||
responses = self.pipeline.stream_infer(sentence,
|
||||
top_p=top_p,
|
||||
temperature=temperature,
|
||||
repetition_penalty=repetition_penalty)
|
||||
hidden_states_buffer = []
|
||||
chunk_counter = chunk_size
|
||||
for token in responses:
|
||||
finished = token['finish_reason'] is not None
|
||||
if not finished: hidden_states_buffer.append(token['hidden_state'][-1])
|
||||
hidden_states_buffer = hidden_states_buffer[-(2*self.RECEPTIVE_FIELD+chunk_size):]
|
||||
if finished or len(hidden_states_buffer) >= self.RECEPTIVE_FIELD + chunk_size:
|
||||
if finished or chunk_counter == chunk_size:
|
||||
batch_hidden_states = torch.stack(hidden_states_buffer)
|
||||
inp = batch_hidden_states.unsqueeze(0).transpose(1, 2).to(self.device).to(torch.float32)
|
||||
with torch.no_grad():
|
||||
audio = self.decoder(inp)[0]
|
||||
if finished:
|
||||
audio_chunk = audio[-((self.RECEPTIVE_FIELD+chunk_counter-1)*self.TOKEN_SIZE-self.TOKEN_SIZE):]
|
||||
else:
|
||||
audio_chunk = audio[-((self.RECEPTIVE_FIELD+chunk_size)*self.TOKEN_SIZE-self.TOKEN_SIZE):-(self.RECEPTIVE_FIELD*self.TOKEN_SIZE-self.TOKEN_SIZE)]
|
||||
chunk_counter = 0
|
||||
if first_chunk:
|
||||
print(f"Streaming latency: {1000*(time.time()-start_time):.2f} ms")
|
||||
first_chunk = False
|
||||
yield audio_chunk.cpu()
|
||||
chunk_counter += 1
|
||||
32
soprano_to_rvc/soprano/soprano/utils/auto_select.py
Normal file
32
soprano_to_rvc/soprano/soprano/utils/auto_select.py
Normal file
@@ -0,0 +1,32 @@
|
||||
import torch
|
||||
|
||||
RECOGNIZED_DEVICES = ['auto', 'cuda', 'cpu', 'mps']
|
||||
RECOGNIZED_BACKENDS = ['auto', 'lmdeploy', 'transformers']
|
||||
|
||||
def select_device(device='auto'):
|
||||
if device == 'auto':
|
||||
if torch.cuda.is_available():
|
||||
device = 'cuda'
|
||||
elif torch.backends.mps.is_available():
|
||||
device = 'mps'
|
||||
else:
|
||||
device = 'cpu'
|
||||
|
||||
assert device in RECOGNIZED_DEVICES, f"unrecognized device {device}, device must be in {RECOGNIZED_DEVICES}"
|
||||
print(f"Using device {device}")
|
||||
return device
|
||||
|
||||
def select_backend(backend='auto', device='auto'):
|
||||
if backend == 'auto':
|
||||
if device == 'cpu':
|
||||
backend = 'transformers'
|
||||
else:
|
||||
try:
|
||||
import lmdeploy
|
||||
backend = 'lmdeploy'
|
||||
except ImportError:
|
||||
backend = 'transformers'
|
||||
|
||||
assert backend in RECOGNIZED_BACKENDS, f"unrecognized backend {backend}, backend must be in {RECOGNIZED_BACKENDS}"
|
||||
print(f"Using backend {backend}")
|
||||
return backend
|
||||
34
soprano_to_rvc/soprano/soprano/utils/streaming.py
Normal file
34
soprano_to_rvc/soprano/soprano/utils/streaming.py
Normal file
@@ -0,0 +1,34 @@
|
||||
import sounddevice as sd
|
||||
import torch
|
||||
import time
|
||||
|
||||
|
||||
def play_stream(stream, sample_rate=32000):
|
||||
"""
|
||||
Play streamed audio chunks to speakers in real time.
|
||||
"""
|
||||
with sd.OutputStream(
|
||||
samplerate=sample_rate,
|
||||
channels=1,
|
||||
dtype='float32',
|
||||
blocksize=0
|
||||
) as out_stream:
|
||||
start = time.time()
|
||||
latency = None
|
||||
first = True
|
||||
for chunk in stream:
|
||||
if first:
|
||||
latency = time.time()-start
|
||||
first = False
|
||||
|
||||
if isinstance(chunk, torch.Tensor):
|
||||
chunk = chunk.detach().cpu()
|
||||
|
||||
# Ensure shape (N, 1)
|
||||
if chunk.dim() == 1:
|
||||
chunk = chunk.unsqueeze(1)
|
||||
elif chunk.dim() == 2 and chunk.shape[0] == 1:
|
||||
chunk = chunk.transpose(0, 1)
|
||||
|
||||
out_stream.write(chunk.numpy())
|
||||
return latency
|
||||
410
soprano_to_rvc/soprano/soprano/utils/text_normalizer.py
Normal file
410
soprano_to_rvc/soprano/soprano/utils/text_normalizer.py
Normal file
@@ -0,0 +1,410 @@
|
||||
"""
|
||||
Normalize input text to a format that Soprano recognizes.
|
||||
Adapted from https://github.com/neonbjb/tortoise-tts/blob/main/tortoise/utils/tokenizer.py
|
||||
"""
|
||||
import re
|
||||
|
||||
import inflect
|
||||
from unidecode import unidecode
|
||||
|
||||
|
||||
_inflect = inflect.engine()
|
||||
|
||||
####################################################################################################
|
||||
# Abbreviations
|
||||
|
||||
_abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1]) for x in [
|
||||
('mrs', 'misess'),
|
||||
('ms', 'miss'),
|
||||
('mr', 'mister'),
|
||||
('dr', 'doctor'),
|
||||
('st', 'saint'),
|
||||
('co', 'company'),
|
||||
('jr', 'junior'),
|
||||
('maj', 'major'),
|
||||
('gen', 'general'),
|
||||
('drs', 'doctors'),
|
||||
('rev', 'reverend'),
|
||||
('lt', 'lieutenant'),
|
||||
('hon', 'honorable'),
|
||||
('sgt', 'sergeant'),
|
||||
('capt', 'captain'),
|
||||
('esq', 'esquire'),
|
||||
('ltd', 'limited'),
|
||||
('col', 'colonel'),
|
||||
('ft', 'fort'),
|
||||
]]
|
||||
_cased_abbreviations = [(re.compile('\\b%s\\b' % x[0]), x[1]) for x in [
|
||||
('TTS', 'text to speech'),
|
||||
('Hz', 'hertz'),
|
||||
('kHz', 'kilohertz'),
|
||||
('KBs', 'kilobytes'),
|
||||
('KB', 'kilobyte'),
|
||||
('MBs', 'megabytes'),
|
||||
('MB', 'megabyte'),
|
||||
('GBs', 'gigabytes'),
|
||||
('GB', 'gigabyte'),
|
||||
('TBs', 'terabytes'),
|
||||
('TB', 'terabyte'),
|
||||
('APIs', 'a p i\'s'),
|
||||
('API', 'a p i'),
|
||||
('CLIs', 'c l i\'s'),
|
||||
('CLI', 'c l i'),
|
||||
('CPUs', 'c p u\'s'),
|
||||
('CPU', 'c p u'),
|
||||
('GPUs', 'g p u\'s'),
|
||||
('GPU', 'g p u'),
|
||||
('Ave', 'avenue'),
|
||||
('etc', 'et cetera'),
|
||||
('Mon', 'monday'),
|
||||
('Tues', 'tuesday'),
|
||||
('Wed', 'wednesday'),
|
||||
('Thurs', 'thursday'),
|
||||
('Fri', 'friday'),
|
||||
('Sat', 'saturday'),
|
||||
('Sun', 'sunday'),
|
||||
('and/or', 'and or'),
|
||||
]]
|
||||
|
||||
def expand_abbreviations(text):
|
||||
for regex, replacement in _abbreviations + _cased_abbreviations:
|
||||
text = re.sub(regex, replacement, text)
|
||||
return text
|
||||
|
||||
####################################################################################################
|
||||
# Numbers
|
||||
|
||||
_num_prefix_re = re.compile(r'#\d')
|
||||
_num_suffix_re = re.compile(r'\b\d+(K|M|B|T)\b', re.IGNORECASE)
|
||||
_num_letter_split_re = re.compile(r'(\d[a-z]|[a-z]\d)', re.IGNORECASE)
|
||||
|
||||
_comma_number_re = re.compile(r'(\d[\d\,]+\d)')
|
||||
_date_re = re.compile(r'(^|[^/])(\d\d?[/-]\d\d?[/-]\d\d(?:\d\d)?)($|[^/])')
|
||||
_phone_number_re = re.compile(r'(\(?\d{3}\)?[-.\s]\d{3}[-.\s]?\d{4})')
|
||||
_time_re = re.compile(r'(\d\d?:\d\d(?::\d\d)?)')
|
||||
_pounds_re = re.compile(r'£([\d\,]*\d+)')
|
||||
_dollars_re = re.compile(r'\$([\d\.\,]*\d+)')
|
||||
_decimal_number_re = re.compile(r'(\d+(?:\.\d+)+)')
|
||||
_multiply_re = re.compile(r'(\d\s?\*\s?\d)')
|
||||
_divide_re = re.compile(r'(\d\s?/\s?\d)')
|
||||
_add_re = re.compile(r'(\d\s?\+\s?\d)')
|
||||
_subtract_re = re.compile(r'(\d?\s?-\s?\d)') # also does negative numbers
|
||||
_fraction_re = re.compile(r'(\d+(?:/\d+)+)')
|
||||
_ordinal_re = re.compile(r'\d+(st|nd|rd|th)')
|
||||
_number_re = re.compile(r'\d+')
|
||||
|
||||
def _expand_num_prefix(m):
|
||||
match = m.group(0)
|
||||
return f"number {match[1]}"
|
||||
|
||||
def _expand_num_suffix(m):
|
||||
match = m.group(0)
|
||||
if match[-1].upper() == 'K': return f"{match[:-1]} thousand"
|
||||
elif match[-1].upper() == 'M': return f"{match[:-1]} million"
|
||||
elif match[-1].upper() == 'B': return f"{match[:-1]} billion"
|
||||
elif match[-1].upper() == 'T': return f"{match[:-1]} trillion"
|
||||
return match # unexpected format
|
||||
|
||||
def _split_alphanumeric(m):
|
||||
match = m.group(1)
|
||||
return f"{match[0]} {match[1]}"
|
||||
|
||||
def _remove_commas(m):
|
||||
return m.group(1).replace(',', '')
|
||||
|
||||
def _expand_date(m):
|
||||
match = m.group(2)
|
||||
match = re.split('[./-]', match)
|
||||
return m.group(1) + ' dash '.join(match) + m.group(3)
|
||||
|
||||
def _expand_phone_number(m):
|
||||
match = m.group(1)
|
||||
match = re.sub(r'\D', '', match)
|
||||
assert len(match) == 10
|
||||
match = f"{' '.join(list(match[:3]))}, {' '.join(list(match[3:6]))}, {' '.join(list(match[6:]))}"
|
||||
return match
|
||||
|
||||
def _expand_time(m):
|
||||
match = m.group(1)
|
||||
match = match.split(':')
|
||||
if len(match) == 2:
|
||||
hours, minutes = match
|
||||
if minutes == '00':
|
||||
if int(hours) == 0:
|
||||
return '0'
|
||||
elif int(hours) > 12: return f"{hours} minutes"
|
||||
return f"{hours} o'clock"
|
||||
elif minutes.startswith('0'):
|
||||
minutes = f'oh {minutes[1:]}'
|
||||
return f"{hours} {minutes}"
|
||||
else:
|
||||
hours, minutes, seconds = match
|
||||
if int(hours) != 0:
|
||||
return f"{hours} {'oh oh' if minutes == '00' else f'oh {minutes}' if minutes.startswith('0') else {minutes}} {'' if seconds == '00' else f'oh {seconds}' if seconds.startswith('0') else seconds}"
|
||||
elif minutes != '00':
|
||||
return f"{minutes} {'oh oh' if seconds == '00' else f'oh {seconds}' if seconds.startswith('0') else seconds}"
|
||||
else:
|
||||
return seconds
|
||||
|
||||
def _expand_dollars(m):
|
||||
match = m.group(1)
|
||||
parts = match.split('.')
|
||||
if len(parts) > 2:
|
||||
return match + ' dollars' # Unexpected format
|
||||
dollars = int(parts[0]) if parts[0] else 0
|
||||
cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0
|
||||
if dollars and cents:
|
||||
dollar_unit = 'dollar' if dollars == 1 else 'dollars'
|
||||
cent_unit = 'cent' if cents == 1 else 'cents'
|
||||
return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit)
|
||||
elif dollars:
|
||||
dollar_unit = 'dollar' if dollars == 1 else 'dollars'
|
||||
return '%s %s' % (dollars, dollar_unit)
|
||||
elif cents:
|
||||
cent_unit = 'cent' if cents == 1 else 'cents'
|
||||
return '%s %s' % (cents, cent_unit)
|
||||
else:
|
||||
return 'zero dollars'
|
||||
|
||||
def _expand_decimal_point(m):
|
||||
match = m.group(1)
|
||||
match = match.split('.')
|
||||
return match[0] + ' point ' + ' point '.join(' '.join(list(match[i])) for i in range(1, len(match)))
|
||||
|
||||
def _expand_fraction(m):
|
||||
match = m.group(1)
|
||||
match = match.split('/')
|
||||
return ' over '.join(match) if len(match)==2 else ' slash '.join(match)
|
||||
|
||||
def _expand_multiply(m):
|
||||
return ' times '.join(m.group(1).split('*'))
|
||||
|
||||
def _expand_divide(m):
|
||||
return ' over '.join(m.group(1).split('/'))
|
||||
|
||||
def _expand_add(m):
|
||||
return ' plus '.join(m.group(1).split('+'))
|
||||
|
||||
def _expand_subtract(m):
|
||||
return ' minus '.join(m.group(1).split('-'))
|
||||
|
||||
def _expand_ordinal(m):
|
||||
return _inflect.number_to_words(m.group(0), andword='')
|
||||
|
||||
def _expand_number(m):
|
||||
num = int(m.group(0))
|
||||
if num > 1000 and num < 3000:
|
||||
if num == 2000:
|
||||
return 'two thousand'
|
||||
elif num > 2000 and num < 2010:
|
||||
return 'two thousand ' + _inflect.number_to_words(num % 100)
|
||||
elif num % 100 == 0:
|
||||
return _inflect.number_to_words(num // 100) + ' hundred'
|
||||
else:
|
||||
return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ')
|
||||
else:
|
||||
return _inflect.number_to_words(num, andword='')
|
||||
|
||||
def normalize_numbers(text):
|
||||
text = re.sub(_num_prefix_re, _expand_num_prefix, text)
|
||||
text = re.sub(_num_suffix_re, _expand_num_suffix, text)
|
||||
text = re.sub(_comma_number_re, _remove_commas, text)
|
||||
text = re.sub(_date_re, _expand_date, text)
|
||||
text = re.sub(_phone_number_re, _expand_phone_number, text)
|
||||
text = re.sub(_time_re, _expand_time, text)
|
||||
text = re.sub(_pounds_re, r'\1 pounds', text)
|
||||
text = re.sub(_dollars_re, _expand_dollars, text)
|
||||
text = re.sub(_decimal_number_re, _expand_decimal_point, text)
|
||||
text = re.sub(_multiply_re, _expand_multiply, text)
|
||||
text = re.sub(_divide_re, _expand_divide, text)
|
||||
text = re.sub(_add_re, _expand_add, text)
|
||||
text = re.sub(_subtract_re, _expand_subtract, text)
|
||||
|
||||
text = re.sub(_fraction_re, _expand_fraction, text)
|
||||
text = re.sub(_ordinal_re, _expand_ordinal, text)
|
||||
for _ in range(2): # need to do this twice to find all matches
|
||||
text = re.sub(_num_letter_split_re, _split_alphanumeric, text)
|
||||
text = re.sub(_number_re, _expand_number, text)
|
||||
return text
|
||||
|
||||
####################################################################################################
|
||||
# Special characters & other patterns
|
||||
|
||||
_preunicode_special_characters = [(re.compile(x[0]), x[1]) for x in [
|
||||
('—', ' - '),
|
||||
]]
|
||||
_special_characters = [(re.compile(x[0]), x[1]) for x in [
|
||||
('@', ' at '),
|
||||
('&', ' and '),
|
||||
('%', ' percent '),
|
||||
(':', '.'),
|
||||
(';', ','),
|
||||
(r'\+', ' plus '),
|
||||
(r'\\', ' backslash '),
|
||||
('~', ' about '),
|
||||
('(^| )<3', ' heart '),
|
||||
('<=', ' less than or equal to '),
|
||||
('>=', ' greater than or equal to '),
|
||||
('<', ' less than '),
|
||||
('>', ' greater than '),
|
||||
('=', ' equals '),
|
||||
('/', ' slash '),
|
||||
('_', ' '),
|
||||
(r'\*', ' '),
|
||||
]]
|
||||
_link_header_re = re.compile(r'(https?://)')
|
||||
_dash_re = re.compile(r'(. - .)')
|
||||
_dot_re = re.compile(r'([A-Z]\.[A-Z])', re.IGNORECASE)
|
||||
_parentheses_re = re.compile(r'[\(\[\{].*[\)\]\}](.|$)')
|
||||
|
||||
def expand_preunicode_special_characters(text):
|
||||
for regex, replacement in _preunicode_special_characters:
|
||||
text = re.sub(regex, replacement, text)
|
||||
return text
|
||||
|
||||
def expand_special_characters(text):
|
||||
for regex, replacement in _special_characters:
|
||||
text = re.sub(regex, replacement, text)
|
||||
return text
|
||||
|
||||
def _expand_link_header(m):
|
||||
return 'h t t p s colon slash slash '
|
||||
|
||||
def _expand_dash(m):
|
||||
match = m.group(0)
|
||||
return f"{match[0]}, {match[4]}"
|
||||
|
||||
def _expand_dot(m):
|
||||
match = m.group(0)
|
||||
return f"{match[0]} dot {match[2]}"
|
||||
|
||||
def _expand_parantheses(m):
|
||||
match = m.group(0)
|
||||
match = re.sub(r'[\(\[\{]', ', ', match)
|
||||
match = re.sub(r'[\)\]\}][^$.!?,]', ', ', match)
|
||||
match = re.sub(r'[\)\]\}]', '', match)
|
||||
return match
|
||||
|
||||
def normalize_special(text):
|
||||
text = re.sub(_link_header_re, _expand_link_header, text)
|
||||
text = re.sub(_dash_re, _expand_dash, text)
|
||||
text = re.sub(_dot_re, _expand_dot, text)
|
||||
text = re.sub(_parentheses_re, _expand_parantheses, text)
|
||||
return text
|
||||
|
||||
####################################################################################################
|
||||
# Misc
|
||||
|
||||
def lowercase(text):
|
||||
return text.lower()
|
||||
|
||||
def convert_to_ascii(text):
|
||||
return unidecode(text)
|
||||
|
||||
def normalize_newlines(text):
|
||||
text = text.split('\n')
|
||||
for i in range(len(text)):
|
||||
text[i] = text[i].strip()
|
||||
if not text[i]: continue
|
||||
if text[i][-1] not in '.!?':
|
||||
text[i] = f"{text[i]}."
|
||||
return ' '.join(text)
|
||||
|
||||
def remove_unknown_characters(text):
|
||||
text = re.sub(r"[^A-Za-z !\$%&'\*\+,-./0123456789<>\?_]", "", text)
|
||||
text = re.sub(r"[<>/_+]", "", text)
|
||||
return text
|
||||
|
||||
def collapse_whitespace(text):
|
||||
text = re.sub(r'\s+', ' ', text)
|
||||
text = re.sub(r' [.\?!,]', lambda m: m.group(0)[1], text)
|
||||
return text.strip()
|
||||
|
||||
def dedup_punctuation(text):
|
||||
text = re.sub(r"\.\.\.+", "[ELLIPSIS]", text)
|
||||
text = re.sub(r",+", ",", text)
|
||||
text = re.sub(r"[\.,]*\.[\.,]*", ".", text)
|
||||
text = re.sub(r"[\.,!]*![\.,!]*", "!", text)
|
||||
text = re.sub(r"[\.,!\?]*\?[\.,!\?]*", "?", text)
|
||||
text = re.sub(r"\[ELLIPSIS\]", "...", text)
|
||||
return text
|
||||
|
||||
def clean_text(text):
|
||||
text = expand_preunicode_special_characters(text)
|
||||
text = convert_to_ascii(text)
|
||||
text = normalize_newlines(text)
|
||||
text = normalize_numbers(text)
|
||||
text = normalize_special(text)
|
||||
text = expand_abbreviations(text)
|
||||
text = expand_special_characters(text)
|
||||
text = lowercase(text)
|
||||
text = remove_unknown_characters(text)
|
||||
text = collapse_whitespace(text)
|
||||
text = dedup_punctuation(text)
|
||||
return text
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
print(clean_text('1,2,3,456,176'))
|
||||
print(clean_text('123,456,789'))
|
||||
print(clean_text('123,456,789th'))
|
||||
print(clean_text('123-456-7890'))
|
||||
print(clean_text('111-111-1111'))
|
||||
print(clean_text('(111) 111-1111'))
|
||||
print(clean_text('A(111) 111-1111'))
|
||||
print(clean_text('A (111) 111-1111'))
|
||||
print(clean_text('$2.47'))
|
||||
print(clean_text('$247'))
|
||||
print(clean_text('$0.27'))
|
||||
print(clean_text('$1.00'))
|
||||
print(clean_text('£20'))
|
||||
for i in range(1990, 2030):
|
||||
print(clean_text(str(i)))
|
||||
print(clean_text('2656'))
|
||||
print(clean_text('1024'))
|
||||
print(clean_text('2.47023'))
|
||||
print(clean_text('20.47023'))
|
||||
print(clean_text('1.17.1.1'))
|
||||
print(clean_text('111.111.1111'))
|
||||
print(clean_text('1/1/2025'))
|
||||
print(clean_text('1-1-2025'))
|
||||
print(clean_text('1-1-25'))
|
||||
print(clean_text('A 1/1/11 A'))
|
||||
print(clean_text('A 1/1 A'))
|
||||
print(clean_text('1/1'))
|
||||
print(clean_text('1/10'))
|
||||
print(clean_text('1/1/10'))
|
||||
print(clean_text('11/1/1/10'))
|
||||
|
||||
print(clean_text('0:00'))
|
||||
print(clean_text('12:00'))
|
||||
print(clean_text('13:00'))
|
||||
print(clean_text('8:00'))
|
||||
print(clean_text('8:05'))
|
||||
print(clean_text('8:15'))
|
||||
print(clean_text('0:00:00'))
|
||||
print(clean_text('00:01:10'))
|
||||
print(clean_text('00:10:01'))
|
||||
print(clean_text('01:01:01'))
|
||||
print(clean_text('00:01:00'))
|
||||
print(clean_text('01:00:00'))
|
||||
|
||||
print(clean_text('-1 + 2 * 3 - 4 / 5'))
|
||||
print(clean_text('-1+2*3-5/4/25'))
|
||||
|
||||
print(clean_text('100x1'))
|
||||
print(clean_text('100k'))
|
||||
print(clean_text('100m'))
|
||||
print(clean_text('100b'))
|
||||
print(clean_text('100t'))
|
||||
|
||||
print(clean_text('#1'))
|
||||
|
||||
print(clean_text('12:00'))
|
||||
print(clean_text('11:59'))
|
||||
print(clean_text('01:00'))
|
||||
print(clean_text('0100'))
|
||||
|
||||
print(clean_text('1st 2nd 3rd 4th'))
|
||||
print(clean_text('1K 1M 1B 1T 1K1M1B1T'))
|
||||
print(clean_text('and/or'))
|
||||
76
soprano_to_rvc/soprano/soprano/utils/text_splitter.py
Normal file
76
soprano_to_rvc/soprano/soprano/utils/text_splitter.py
Normal file
@@ -0,0 +1,76 @@
|
||||
"""
|
||||
Copied from https://github.com/neonbjb/tortoise-tts/blob/main/tortoise/utils/text.py
|
||||
"""
|
||||
import re
|
||||
|
||||
|
||||
def split_and_recombine_text(text, desired_length=1, max_length=300):
|
||||
"""Split text it into chunks of a desired length trying to keep sentences intact."""
|
||||
# normalize text, remove redundant whitespace and convert non-ascii quotes to ascii
|
||||
text = re.sub(r'\n\n+', '\n', text)
|
||||
text = re.sub(r'\s+', ' ', text)
|
||||
text = re.sub(r'[“”]', '"', text)
|
||||
|
||||
rv = []
|
||||
in_quote = False
|
||||
current = ""
|
||||
split_pos = []
|
||||
pos = -1
|
||||
end_pos = len(text) - 1
|
||||
|
||||
def seek(delta):
|
||||
nonlocal pos, in_quote, current
|
||||
is_neg = delta < 0
|
||||
for _ in range(abs(delta)):
|
||||
if is_neg:
|
||||
pos -= 1
|
||||
current = current[:-1]
|
||||
else:
|
||||
pos += 1
|
||||
current += text[pos]
|
||||
if text[pos] == '"':
|
||||
in_quote = not in_quote
|
||||
return text[pos]
|
||||
|
||||
def peek(delta):
|
||||
p = pos + delta
|
||||
return text[p] if p < end_pos and p >= 0 else ""
|
||||
|
||||
def commit():
|
||||
nonlocal rv, current, split_pos
|
||||
rv.append(current)
|
||||
current = ""
|
||||
split_pos = []
|
||||
|
||||
while pos < end_pos:
|
||||
c = seek(1)
|
||||
# do we need to force a split?
|
||||
if len(current) >= max_length:
|
||||
if len(split_pos) > 0 and len(current) > (desired_length / 2):
|
||||
# we have at least one sentence and we are over half the desired length, seek back to the last split
|
||||
d = pos - split_pos[-1]
|
||||
seek(-d)
|
||||
else:
|
||||
# no full sentences, seek back until we are not in the middle of a word and split there
|
||||
while c not in '!?.\n ' and pos > 0 and len(current) > desired_length:
|
||||
c = seek(-1)
|
||||
commit()
|
||||
# check for sentence boundaries
|
||||
elif not in_quote and (c in '!?\n' or (c == '.' and peek(1) in '\n ')):
|
||||
# seek forward if we have consecutive boundary markers but still within the max length
|
||||
while pos < len(text) - 1 and len(current) < max_length and peek(1) in '!?.':
|
||||
c = seek(1)
|
||||
split_pos.append(pos)
|
||||
if len(current) >= desired_length:
|
||||
commit()
|
||||
# treat end of quote as a boundary if its followed by a space or newline
|
||||
elif in_quote and peek(1) == '"' and peek(2) in '\n ':
|
||||
seek(2)
|
||||
split_pos.append(pos)
|
||||
rv.append(current)
|
||||
|
||||
# clean up, remove lines with only whitespace or punctuation
|
||||
rv = [s.strip() for s in rv]
|
||||
rv = [s for s in rv if len(s) > 0 and not re.match(r'^[\s\.,;:!?]*$', s)]
|
||||
|
||||
return rv
|
||||
45
soprano_to_rvc/soprano/soprano/vocos/decoder.py
Normal file
45
soprano_to_rvc/soprano/soprano/vocos/decoder.py
Normal file
@@ -0,0 +1,45 @@
|
||||
import torch
|
||||
from torch import nn
|
||||
|
||||
from .models import VocosBackbone
|
||||
from .heads import ISTFTHead
|
||||
|
||||
|
||||
class SopranoDecoder(nn.Module):
|
||||
def __init__(self,
|
||||
num_input_channels=512,
|
||||
decoder_num_layers=8,
|
||||
decoder_dim=768,
|
||||
decoder_intermediate_dim=None,
|
||||
hop_length=512,
|
||||
n_fft=2048,
|
||||
upscale=4,
|
||||
dw_kernel=3,
|
||||
):
|
||||
super().__init__()
|
||||
self.decoder_initial_channels = num_input_channels
|
||||
self.num_layers = decoder_num_layers
|
||||
self.dim = decoder_dim
|
||||
self.intermediate_dim = decoder_intermediate_dim if decoder_intermediate_dim else decoder_dim*3
|
||||
self.hop_length = hop_length
|
||||
self.n_fft = n_fft
|
||||
self.upscale = upscale
|
||||
self.dw_kernel = dw_kernel
|
||||
|
||||
self.decoder = VocosBackbone(input_channels=self.decoder_initial_channels,
|
||||
dim=self.dim,
|
||||
intermediate_dim=self.intermediate_dim,
|
||||
num_layers=self.num_layers,
|
||||
input_kernel_size=1,#dw_kernel,
|
||||
dw_kernel_size=dw_kernel,
|
||||
)
|
||||
self.head = ISTFTHead(dim=self.dim,
|
||||
n_fft=self.n_fft,
|
||||
hop_length=self.hop_length)
|
||||
|
||||
def forward(self, x):
|
||||
T = x.size(2)
|
||||
x = torch.nn.functional.interpolate(x, size=self.upscale*(T-1)+1, mode='linear', align_corners=True)
|
||||
x = self.decoder(x)
|
||||
reconstructed = self.head(x)
|
||||
return reconstructed
|
||||
50
soprano_to_rvc/soprano/soprano/vocos/heads.py
Normal file
50
soprano_to_rvc/soprano/soprano/vocos/heads.py
Normal file
@@ -0,0 +1,50 @@
|
||||
import torch
|
||||
from torch import nn
|
||||
from .spectral_ops import ISTFT
|
||||
|
||||
|
||||
class ISTFTHead(nn.Module):
|
||||
"""
|
||||
ISTFT Head module for predicting STFT complex coefficients.
|
||||
|
||||
Args:
|
||||
dim (int): Hidden dimension of the model.
|
||||
n_fft (int): Size of Fourier transform.
|
||||
hop_length (int): The distance between neighboring sliding window frames, which should align with
|
||||
the resolution of the input features.
|
||||
padding (str, optional): Type of padding. Options are "center" or "same". Defaults to "same".
|
||||
"""
|
||||
|
||||
def __init__(self, dim: int, n_fft: int, hop_length: int, padding: str = "center"):
|
||||
super().__init__()
|
||||
out_dim = n_fft + 2
|
||||
self.out = torch.nn.Linear(dim, out_dim)
|
||||
self.istft = ISTFT(n_fft=n_fft, hop_length=hop_length, win_length=n_fft, padding=padding)
|
||||
|
||||
@torch.compiler.disable
|
||||
def forward(self, x: torch.Tensor) -> torch.Tensor:
|
||||
"""
|
||||
Forward pass of the ISTFTHead module.
|
||||
|
||||
Args:
|
||||
x (Tensor): Input tensor of shape (B, L, H), where B is the batch size,
|
||||
L is the sequence length, and H denotes the model dimension.
|
||||
|
||||
Returns:
|
||||
Tensor: Reconstructed time-domain audio signal of shape (B, T), where T is the length of the output signal.
|
||||
"""
|
||||
x = self.out(x.transpose(1,2)).transpose(1, 2)
|
||||
mag, p = x.chunk(2, dim=1)
|
||||
mag = torch.exp(mag)
|
||||
mag = torch.clip(mag, max=1e2) # safeguard to prevent excessively large magnitudes
|
||||
# wrapping happens here. These two lines produce real and imaginary value
|
||||
x = torch.cos(p)
|
||||
y = torch.sin(p)
|
||||
# recalculating phase here does not produce anything new
|
||||
# only costs time
|
||||
# phase = torch.atan2(y, x)
|
||||
# S = mag * torch.exp(phase * 1j)
|
||||
# better directly produce the complex value
|
||||
S = mag * (x + 1j * y)
|
||||
audio = self.istft(S)
|
||||
return audio
|
||||
61
soprano_to_rvc/soprano/soprano/vocos/models.py
Normal file
61
soprano_to_rvc/soprano/soprano/vocos/models.py
Normal file
@@ -0,0 +1,61 @@
|
||||
from typing import Optional
|
||||
|
||||
import torch
|
||||
from torch import nn
|
||||
|
||||
from .modules import ConvNeXtBlock
|
||||
|
||||
class VocosBackbone(nn.Module):
|
||||
"""
|
||||
Vocos backbone module built with ConvNeXt blocks. Supports additional conditioning with Adaptive Layer Normalization
|
||||
|
||||
Args:
|
||||
input_channels (int): Number of input features channels.
|
||||
dim (int): Hidden dimension of the model.
|
||||
intermediate_dim (int): Intermediate dimension used in ConvNeXtBlock.
|
||||
num_layers (int): Number of ConvNeXtBlock layers.
|
||||
layer_scale_init_value (float, optional): Initial value for layer scaling. Defaults to `1 / num_layers`.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
input_channels: int,
|
||||
dim: int,
|
||||
intermediate_dim: int,
|
||||
num_layers: int,
|
||||
input_kernel_size: int = 9,
|
||||
dw_kernel_size: int = 9,
|
||||
layer_scale_init_value: Optional[float] = None,
|
||||
pad: str = 'zeros',
|
||||
):
|
||||
super().__init__()
|
||||
self.embed = nn.Conv1d(input_channels, dim, kernel_size=input_kernel_size, padding=input_kernel_size//2, padding_mode=pad)
|
||||
self.norm = nn.LayerNorm(dim, eps=1e-6)
|
||||
self.convnext = nn.ModuleList(
|
||||
[
|
||||
ConvNeXtBlock(
|
||||
dim=dim,
|
||||
intermediate_dim=intermediate_dim,
|
||||
dw_kernel_size=dw_kernel_size,
|
||||
layer_scale_init_value=layer_scale_init_value or 1 / num_layers**0.5,
|
||||
)
|
||||
for _ in range(num_layers)
|
||||
]
|
||||
)
|
||||
self.final_layer_norm = nn.LayerNorm(dim, eps=1e-6)
|
||||
self.apply(self._init_weights)
|
||||
|
||||
def _init_weights(self, m):
|
||||
if isinstance(m, (nn.Conv1d, nn.Linear)):
|
||||
nn.init.trunc_normal_(m.weight, std=0.02)
|
||||
if m.bias is not None: nn.init.constant_(m.bias, 0)
|
||||
|
||||
def forward(self, x: torch.Tensor) -> torch.Tensor:
|
||||
x = self.embed(x) # (B, C, L)
|
||||
x = self.norm(x.transpose(1, 2))
|
||||
x = x.transpose(1, 2)
|
||||
for conv_block in self.convnext:
|
||||
x = conv_block(x)
|
||||
x = self.final_layer_norm(x.transpose(1, 2))
|
||||
x = x.transpose(1, 2)
|
||||
return x
|
||||
47
soprano_to_rvc/soprano/soprano/vocos/modules.py
Normal file
47
soprano_to_rvc/soprano/soprano/vocos/modules.py
Normal file
@@ -0,0 +1,47 @@
|
||||
import torch
|
||||
from torch import nn
|
||||
|
||||
|
||||
class ConvNeXtBlock(nn.Module):
|
||||
"""ConvNeXt Block adapted from https://github.com/facebookresearch/ConvNeXt to 1D audio signal.
|
||||
|
||||
Args:
|
||||
dim (int): Number of input channels.
|
||||
intermediate_dim (int): Dimensionality of the intermediate layer.
|
||||
layer_scale_init_value (float, optional): Initial value for the layer scale. None means no scaling.
|
||||
Defaults to None.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
dim: int,
|
||||
intermediate_dim: int,
|
||||
layer_scale_init_value: float,
|
||||
dw_kernel_size: int = 9,
|
||||
):
|
||||
super().__init__()
|
||||
self.dwconv = nn.Conv1d(dim, dim, kernel_size=dw_kernel_size, padding=dw_kernel_size//2, groups=dim) # depthwise conv
|
||||
self.norm = nn.LayerNorm(dim, eps=1e-6)
|
||||
self.pwconv1 = nn.Linear(dim, intermediate_dim) # pointwise/1x1 convs, implemented with linear layers
|
||||
self.act = nn.GELU()
|
||||
self.pwconv2 = nn.Linear(intermediate_dim, dim)
|
||||
self.gamma = (
|
||||
nn.Parameter(layer_scale_init_value * torch.ones(dim), requires_grad=True)
|
||||
if layer_scale_init_value > 0
|
||||
else None
|
||||
)
|
||||
|
||||
def forward(self, x: torch.Tensor) -> torch.Tensor:
|
||||
residual = x
|
||||
x = self.dwconv(x)
|
||||
x = x.transpose(1, 2) # (B, C, T) -> (B, T, C)
|
||||
x = self.norm(x)
|
||||
x = self.pwconv1(x)
|
||||
x = self.act(x)
|
||||
x = self.pwconv2(x)
|
||||
if self.gamma is not None:
|
||||
x = self.gamma * x
|
||||
x = x.transpose(1, 2) # (B, T, C) -> (B, C, T)
|
||||
|
||||
x = residual + x
|
||||
return x
|
||||
74
soprano_to_rvc/soprano/soprano/vocos/spectral_ops.py
Normal file
74
soprano_to_rvc/soprano/soprano/vocos/spectral_ops.py
Normal file
@@ -0,0 +1,74 @@
|
||||
import torch
|
||||
from torch import nn
|
||||
|
||||
class ISTFT(nn.Module):
|
||||
"""
|
||||
Custom implementation of ISTFT since torch.istft doesn't allow custom padding (other than `center=True`) with
|
||||
windowing. This is because the NOLA (Nonzero Overlap Add) check fails at the edges.
|
||||
See issue: https://github.com/pytorch/pytorch/issues/62323
|
||||
Specifically, in the context of neural vocoding we are interested in "same" padding analogous to CNNs.
|
||||
The NOLA constraint is met as we trim padded samples anyway.
|
||||
|
||||
Args:
|
||||
n_fft (int): Size of Fourier transform.
|
||||
hop_length (int): The distance between neighboring sliding window frames.
|
||||
win_length (int): The size of window frame and STFT filter.
|
||||
padding (str, optional): Type of padding. Options are "center" or "same". Defaults to "same".
|
||||
"""
|
||||
|
||||
def __init__(self, n_fft: int, hop_length: int, win_length: int, padding: str = "same"):
|
||||
super().__init__()
|
||||
if padding not in ["center", "same"]:
|
||||
raise ValueError("Padding must be 'center' or 'same'.")
|
||||
self.padding = padding
|
||||
self.n_fft = n_fft
|
||||
self.hop_length = hop_length
|
||||
self.win_length = win_length
|
||||
window = torch.hann_window(win_length)
|
||||
self.register_buffer("window", window)
|
||||
|
||||
def forward(self, spec: torch.Tensor) -> torch.Tensor:
|
||||
"""
|
||||
Compute the Inverse Short Time Fourier Transform (ISTFT) of a complex spectrogram.
|
||||
|
||||
Args:
|
||||
spec (Tensor): Input complex spectrogram of shape (B, N, T), where B is the batch size,
|
||||
N is the number of frequency bins, and T is the number of time frames.
|
||||
|
||||
Returns:
|
||||
Tensor: Reconstructed time-domain signal of shape (B, L), where L is the length of the output signal.
|
||||
"""
|
||||
if self.padding == "center":
|
||||
spec[:,0] = 0 # fixes some strange bug where first/last freqs don't matter when bs<16 which causes exploding gradients
|
||||
spec[:,-1] = 0
|
||||
# Fallback to pytorch native implementation
|
||||
return torch.istft(spec, self.n_fft, self.hop_length, self.win_length, self.window, center=True)
|
||||
elif self.padding == "same":
|
||||
pad = (self.win_length - self.hop_length) // 2
|
||||
else:
|
||||
raise ValueError("Padding must be 'center' or 'same'.")
|
||||
|
||||
assert spec.dim() == 3, "Expected a 3D tensor as input"
|
||||
B, N, T = spec.shape
|
||||
|
||||
# Inverse FFT
|
||||
ifft = torch.fft.irfft(spec, self.n_fft, dim=1, norm="backward")
|
||||
ifft = ifft * self.window[None, :, None]
|
||||
|
||||
# Overlap and Add
|
||||
output_size = (T - 1) * self.hop_length + self.win_length
|
||||
y = torch.nn.functional.fold(
|
||||
ifft, output_size=(1, output_size), kernel_size=(1, self.win_length), stride=(1, self.hop_length),
|
||||
)[:, 0, 0, pad:-pad]
|
||||
|
||||
# Window envelope
|
||||
window_sq = self.window.square().expand(1, T, -1).transpose(1, 2)
|
||||
window_envelope = torch.nn.functional.fold(
|
||||
window_sq, output_size=(1, output_size), kernel_size=(1, self.win_length), stride=(1, self.hop_length),
|
||||
).squeeze()[pad:-pad]
|
||||
|
||||
# Normalize
|
||||
assert (window_envelope > 1e-11).all()
|
||||
y = y / window_envelope
|
||||
|
||||
return y
|
||||
240
soprano_to_rvc/soprano/soprano/webui.py
Normal file
240
soprano_to_rvc/soprano/soprano/webui.py
Normal file
@@ -0,0 +1,240 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Gradio Web Interface for Soprano TTS
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import socket
|
||||
import time
|
||||
import gradio as gr
|
||||
import numpy as np
|
||||
from soprano import SopranoTTS
|
||||
from soprano.utils.streaming import play_stream
|
||||
|
||||
|
||||
parser = argparse.ArgumentParser(description='Soprano Text-to-Speech Gradio WebUI')
|
||||
parser.add_argument('--model-path', '-m',
|
||||
help='Path to local model directory (optional)')
|
||||
parser.add_argument('--device', '-d', default='auto',
|
||||
choices=['auto', 'cuda', 'cpu', 'mps'],
|
||||
help='Device to use for inference')
|
||||
parser.add_argument('--backend', '-b', default='auto',
|
||||
choices=['auto', 'transformers', 'lmdeploy'],
|
||||
help='Backend to use for inference')
|
||||
parser.add_argument('--cache-size', '-c', type=int, default=100,
|
||||
help='Cache size in MB (for lmdeploy backend)')
|
||||
parser.add_argument('--decoder-batch-size', '-bs', type=int, default=1,
|
||||
help='Batch size when decoding audio')
|
||||
args = parser.parse_args()
|
||||
|
||||
# Initialize model
|
||||
print("Loading Soprano TTS model...")
|
||||
model = SopranoTTS(
|
||||
backend=args.backend,
|
||||
device=args.device,
|
||||
cache_size_mb=args.cache_size,
|
||||
decoder_batch_size=args.decoder_batch_size,
|
||||
model_path=args.model_path
|
||||
)
|
||||
device = model.device
|
||||
backend = model.backend
|
||||
print("Model loaded successfully!")
|
||||
|
||||
SAMPLE_RATE = 32000
|
||||
|
||||
|
||||
def generate_speech(
|
||||
text: str,
|
||||
temperature: float,
|
||||
top_p: float,
|
||||
repetition_penalty: float,
|
||||
chunk_size: int,
|
||||
streaming: bool,
|
||||
):
|
||||
if not text.strip():
|
||||
yield None, "Please enter some text to generate speech."
|
||||
return
|
||||
|
||||
try:
|
||||
if streaming:
|
||||
stream = model.infer_stream(
|
||||
text,
|
||||
temperature=temperature,
|
||||
top_p=top_p,
|
||||
repetition_penalty=repetition_penalty,
|
||||
chunk_size=chunk_size,
|
||||
)
|
||||
yield None, "⏳ Streaming..."
|
||||
|
||||
latency = play_stream(stream)
|
||||
|
||||
yield None, (
|
||||
f"✓ Streaming complete | "
|
||||
f"{latency*1000:.2f} ms latency"
|
||||
)
|
||||
return
|
||||
|
||||
start_time = time.perf_counter()
|
||||
|
||||
audio = model.infer(
|
||||
text,
|
||||
temperature=temperature,
|
||||
top_p=top_p,
|
||||
repetition_penalty=repetition_penalty,
|
||||
)
|
||||
|
||||
gen_time = time.perf_counter() - start_time
|
||||
|
||||
audio_np = audio.cpu().numpy()
|
||||
audio_int16 = (audio_np * 32767).astype(np.int16)
|
||||
|
||||
audio_seconds = len(audio_np) / SAMPLE_RATE
|
||||
rtf = audio_seconds / gen_time if gen_time > 0 else float("inf")
|
||||
|
||||
status = (
|
||||
f"✓ Generated {audio_seconds:.2f} s audio | "
|
||||
f"Generation time: {gen_time:.3f} s "
|
||||
f"({rtf:.2f}x realtime)"
|
||||
)
|
||||
|
||||
yield (SAMPLE_RATE, audio_int16), status
|
||||
return
|
||||
|
||||
except Exception as e:
|
||||
yield None, f"✗ Error: {str(e)}"
|
||||
|
||||
|
||||
# Create Gradio interface
|
||||
with gr.Blocks(title="Soprano TTS") as demo:
|
||||
gr.Markdown(
|
||||
f"""# 🗣️ Soprano TTS
|
||||
|
||||
<div align="center">
|
||||
<img width="300" height="300" alt="soprano-github" src="https://github.com/user-attachments/assets/4d612eac-23b8-44e6-8c59-d7ac14ebafd1" />
|
||||
</div>
|
||||
|
||||
**Device:** {device.upper()} | **Backend:** {backend}
|
||||
|
||||
**Model Weights:** https://huggingface.co/ekwek/Soprano-1.1-80M
|
||||
**Model Demo:** https://huggingface.co/spaces/ekwek/Soprano-TTS
|
||||
**GitHub:** https://github.com/ekwek1/soprano
|
||||
"""
|
||||
)
|
||||
with gr.Row():
|
||||
with gr.Column(scale=2):
|
||||
text_input = gr.Textbox(
|
||||
label="Text to Synthesize",
|
||||
placeholder="Enter text here...",
|
||||
value="Soprano is an extremely lightweight text to speech model designed to produce highly realistic speech at unprecedented speed.",
|
||||
lines=5,
|
||||
max_lines=10,
|
||||
)
|
||||
streaming = gr.Checkbox(
|
||||
label="Stream Audio",
|
||||
value=False,
|
||||
info="Note: This bypasses the Gradio interface and streams audio directly to your speaker."
|
||||
)
|
||||
with gr.Accordion("Advanced Settings", open=False):
|
||||
temperature = gr.Slider(
|
||||
minimum=0.0,
|
||||
maximum=1.0,
|
||||
value=0.0,
|
||||
step=0.05,
|
||||
label="Temperature",
|
||||
)
|
||||
top_p = gr.Slider(
|
||||
minimum=0.5,
|
||||
maximum=1.0,
|
||||
value=0.95,
|
||||
step=0.05,
|
||||
label="Top P",
|
||||
)
|
||||
repetition_penalty = gr.Slider(
|
||||
minimum=1.0,
|
||||
maximum=2.0,
|
||||
value=1.2,
|
||||
step=0.1,
|
||||
label="Repetition Penalty",
|
||||
)
|
||||
chunk_size = gr.Slider(
|
||||
minimum=1,
|
||||
maximum=10,
|
||||
value=1,
|
||||
step=1,
|
||||
precision=0,
|
||||
label="Chunk Size (Streaming only)",
|
||||
)
|
||||
generate_btn = gr.Button("Generate Speech", variant="primary", size="lg")
|
||||
with gr.Column(scale=1):
|
||||
audio_output = gr.Audio(
|
||||
label="Generated Speech",
|
||||
type="numpy",
|
||||
autoplay=True,
|
||||
)
|
||||
status_output = gr.Textbox(
|
||||
label="Status",
|
||||
interactive=False,
|
||||
lines=3,
|
||||
max_lines=10
|
||||
)
|
||||
gr.Examples(
|
||||
examples=[
|
||||
["Soprano is an extremely lightweight text to speech model.", 0.0, 0.95, 1.2],
|
||||
["Artificial intelligence is transforming the world.", 0.0, 0.95, 1.2],
|
||||
["I'm so excited, I can't even wait!", 0.0, 0.95, 1.2],
|
||||
["Why don't you go ahead and try it?", 0.0, 0.95, 1.2],
|
||||
],
|
||||
inputs=[text_input, temperature, top_p, repetition_penalty],
|
||||
label="Example Prompts",
|
||||
)
|
||||
generate_btn.click(
|
||||
fn=generate_speech,
|
||||
inputs=[text_input, temperature, top_p, repetition_penalty, chunk_size, streaming],
|
||||
outputs=[audio_output, status_output],
|
||||
)
|
||||
gr.Markdown(
|
||||
f"""
|
||||
### Usage tips:
|
||||
|
||||
- Soprano works best when each sentence is between 2 and 30 seconds long.
|
||||
- Although Soprano recognizes numbers and some special characters, it occasionally mispronounces them.
|
||||
Best results can be achieved by converting these into their phonetic form.
|
||||
(1+1 -> one plus one, etc)
|
||||
- If Soprano produces unsatisfactory results, you can easily regenerate it for a new, potentially better generation.
|
||||
You may also change the sampling settings for more varied results.
|
||||
- Avoid improper grammar such as not using contractions, multiple spaces, etc.
|
||||
"""
|
||||
)
|
||||
|
||||
|
||||
def find_free_port(start_port=7860, max_tries=100):
|
||||
for port in range(start_port, start_port + max_tries):
|
||||
try:
|
||||
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
|
||||
s.bind(("", port))
|
||||
return port
|
||||
except OSError:
|
||||
continue
|
||||
raise OSError("Could not find a free port")
|
||||
|
||||
def main():
|
||||
# Start Gradio interface
|
||||
port = find_free_port(7860)
|
||||
print(f"Starting Gradio interface on port {port}")
|
||||
demo.launch(
|
||||
server_name="0.0.0.0",
|
||||
server_port=port,
|
||||
share=False,
|
||||
theme=gr.themes.Soft(primary_hue="green"),
|
||||
css="""
|
||||
a {
|
||||
color: var(--primary-600);
|
||||
}
|
||||
a:hover {
|
||||
color: var(--primary-700);
|
||||
}
|
||||
"""
|
||||
)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user