当前位置：首页 > news >正文

Spring AI Audio Models

news 2026/6/1 3:04:48

一、Audio介绍

二、文本转语音示例

OpenAi Audio

pom.xml

application.properties

TtsController

TtsService

三、音频转文字

OpenAi Audio

pom.xml

application.properties

TranscribeController

TranscribeService

四、总结

一、Audio介绍

Spring AI 中的Audio Models部分分为两部分：文本转语音（Text-to-Speech）和音频转文字（Transcription）。目前这两部分主要对Open AI 供应商支持，通过Open AI中的 TTS 模型（tts-1/tts-1-hd）将文本转语音，通过 OpenAI中的 Whisper-1 模型将音频转文字，两者均通过统一配置、bean 自动注入、同步/流式调用方式提供易用、可扩展的 API 接口。

关于Open AI 的Api Key 可以通过OpeanAI 官网购买（国内有封号风险），这里也可以某宝自行搜索Api Key 获取一些商家提供的中转key，也可以使用Open AI相关模型。

二、文本转语音示例

Spring AI 抽象出 SpeechModel 接口进行兼容未来多种模型供应商，以便进行文本转语音。目前该接口仅有OpenAiAudioSpeechModel一个实现类，只能使用Open AI中tts-1 或者 tts-1-hd 模型进行文本转语音操作。

如下案例中创建SpringBoot项目，并构建对应Controller和Service实现文本转语音操作。

OpenAi Audio

pom.xml

<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <parent> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-parent</artifactId> <version>3.2.5</version> <relativePath/> <!-- lookup parent from repository --> </parent> <groupId>com.example</groupId> <artifactId>SpringAIAudio</artifactId> <version>0.0.1-SNAPSHOT</version> <name>SpringAIAudio</name> <description>SpringAIAudio</description> <url/> <licenses> <license/> </licenses> <developers> <developer/> </developers> <scm> <connection/> <developerConnection/> <tag/> <url/> </scm> <properties> <java.version>17</java.version> </properties> <!-- 导入 Spring AI BOM，用于统一管理 Spring AI 依赖的版本， 引用每个 Spring AI 模块时不用再写 <version>，只要依赖什么模块 Mavens 自动使用 BOM 推荐的版本 --> <dependencyManagement> <dependencies> <dependency> <groupId>org.springframework.ai</groupId> <artifactId>spring-ai-bom</artifactId> <version>1.0.0-SNAPSHOT</version> <type>pom</type> <scope>import</scope> </dependency> </dependencies> </dependencyManagement> <dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> </dependency> <dependency> <groupId>org.springframework.ai</groupId> <artifactId>spring-ai-starter-model-openai</artifactId> </dependency> </dependencies> <!-- 声明仓库， 用于获取 Spring AI 以及相关预发布版本--> <repositories> <repository> <id>spring-snapshots</id> <name>Spring Snapshots</name> <url>https://repo.spring.io/snapshot</url> <releases> <enabled>false</enabled> </releases> </repository> <repository> <name>Central Portal Snapshots</name> <id>central-portal-snapshots</id> <url>https://central.sonatype.com/repository/maven-snapshots/</url> <releases> <enabled>false</enabled> </releases> <snapshots> <enabled>true</enabled> </snapshots> </repository> </repositories> <build> <plugins> <plugin> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-maven-plugin</artifactId> </plugin> </plugins> </build> </project>

application.properties

spring.application.name=SpringAIAudio #文本转语音文件存放的路径 tts.audio-output-dir=E:\\openSourceProject\\spring-ai-demo\\SpringAIAudio\\src\\main #使用 Open AI 相关模型，实现文字转语音功能、语音转文字功能 spring.ai.openai.base-url=https://api.uchat.site/ spring.ai.openai.api-key=${open_api_key} #使用 Open AI 相关模型，实现文字转语音功能 spring.ai.openai.audio.speech.options.model=tts-1 #指定合成的语音，可以设置为alloy, echo, fable, onyx, nova 和 shimmer spring.ai.openai.audio.speech.options.voice=alloy

TtsController

package com.example.springaiaudio.Controller; import com.example.springaiaudio.service.TtsService; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.http.ResponseEntity; import org.springframework.web.bind.annotation.GetMapping; import org.springframework.web.bind.annotation.RequestMapping; import org.springframework.web.bind.annotation.RequestParam; import org.springframework.web.bind.annotation.RestController; @RestController @RequestMapping("/tts") public class TtsController { @Autowired private TtsService ttsService; @GetMapping("/text") public ResponseEntity<String> textToSpeech(@RequestParam("text") String text) { try { String outputFile = ttsService.textToSpeech(text); return ResponseEntity.ok("文本转语音成功，文件保存路径："+outputFile); }catch (Exception e){ return ResponseEntity.status(500).body("文本转语音失败："+e.getMessage()); } } }

TtsService

package com.example.springaiaudio.service; import org.springframework.ai.openai.OpenAiAudioSpeechModel; import org.springframework.ai.openai.OpenAiAudioSpeechOptions; import org.springframework.ai.openai.api.OpenAiAudioApi; import org.springframework.ai.openai.audio.speech.SpeechPrompt; import org.springframework.ai.openai.audio.speech.SpeechResponse; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.beans.factory.annotation.Value; import org.springframework.stereotype.Service; import java.io.File; @Service public class TtsService { // 注入 OpenAI 的语音模型对象 @Autowired private OpenAiAudioSpeechModel openAiAudioSpeechModel; // 从配置文件中读取语音输出目录 @Value("${tts.audio-output-dir}") private String outputDir; public String textToSpeech(String text) { // 构建语音合成的配置项 OpenAiAudioSpeechOptions options = OpenAiAudioSpeechOptions.builder() //.model("tts-1") .voice(OpenAiAudioApi.SpeechRequest.Voice.ALLOY)// 使用Alloy 语音 .responseFormat(OpenAiAudioApi.SpeechRequest.AudioResponseFormat.MP3) // 输出格式为 MP3 .speed(1.0f)// 语速为 1.0 .build(); // 构建语音合成的请求对象 SpeechPrompt prompt = new SpeechPrompt(text, options); // 调用模型生成语音 SpeechResponse response = openAiAudioSpeechModel.call(prompt); System.out.println("语音合成结果：" + response); // 获取音频二进制内容 byte[] audioContent = response.getResult().getOutput(); // 使用配置目录保存文件 File outputFile = new File(outputDir, "output.mp3"); // 确保输出目录存在，不存在则创建 if (!outputFile.exists()){ outputFile.mkdirs(); } // 创建输出文件名，使用时间戳防止重复 File out = new File(outputDir, System.currentTimeMillis() + ".mp3"); try { // 将二进制内容写入文件 java.nio.file.Files.write(out.toPath(), audioContent); } catch (Exception e) { e.printStackTrace(); } // 返回生成的文件完整目录 return out.getAbsolutePath(); } }

三、音频转文字

Spring AI 提供了OpenAiAudioTranscriptionModel类实现通过Open AI 的Whisper 模型将音频转换成文字，输出的格式可以为json, text, srt, verbose_json, vtt。

OpenAi Audio

pom.xml

同上

application.properties

新增配置

#使用 Open AI 相关模型，实现语音转文字功能 spring.ai.openai.audio.transcription.options.model=whisper-1

TranscribeController

package com.example.springaiaudio.Controller; import com.example.springaiaudio.service.TranscribeService; import org.springframework.ai.openai.OpenAiAudioTranscriptionOptions; import org.springframework.ai.openai.api.OpenAiAudioApi; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.beans.factory.annotation.Value; import org.springframework.http.ResponseEntity; import org.springframework.web.bind.annotation.GetMapping; import org.springframework.web.bind.annotation.RequestMapping; import org.springframework.web.bind.annotation.RestController; import java.io.File; import java.util.HashMap; import java.util.Map; @RestController @RequestMapping("/transcribe") public class TranscribeController { @Autowired private TranscribeService transcribeService; // 从应用配置中读取音频目录，用来存放转录的mp3文件 @Value("${tts.audio-output-dir}") private String audioOutputDir; // 定义GET类型的接口，返回识别结果 @GetMapping("/transcr") public ResponseEntity<Map<String, String>> transcribe() { // 创建 File 对象，指定音频文件路径 File dir = new File(audioOutputDir); if (!dir.exists() || !dir.isDirectory()){ return ResponseEntity.badRequest().body(Map.of("error", "音频目录不存在")); } // 列出所有后缀为 mp3 的文件 File[] mp3Files = dir.listFiles((dir1, name) -> name.toLowerCase().endsWith(".mp3")); if (mp3Files == null || mp3Files.length == 0){ return ResponseEntity.badRequest().body(Map.of("error", "音频目录中没有mp3文件")); } // 构建 OpenAi 转录选项，文件格式、温度、语言等 OpenAiAudioTranscriptionOptions options = OpenAiAudioTranscriptionOptions.builder() .responseFormat(OpenAiAudioApi.TranscriptResponseFormat.TEXT) // 设置输出为文本 .temperature(0.0f) // 设置温度为 0.0，范围 0-1，即不进行随机选择，越大不确定度越高 .language("zh") // 设置语言为中文 .build(); Map<String, String> result = new HashMap<>(); for (File mp3File : mp3Files) { String relPath = mp3File.getName(); try { // 调用服务进行识别 String transcribedText = transcribeService.transcribeFromFile(mp3File.getAbsolutePath(), options); // 添加识别结果到结果列表 result.put(relPath, transcribedText); }catch (Exception e){ result.put(relPath, "识别失败：" + e.getMessage()); } } // 返回所有文件的转录结果 return ResponseEntity.ok(result); } }

TranscribeService

package com.example.springaiaudio.service; import org.springframework.ai.audio.transcription.AudioTranscriptionPrompt; import org.springframework.ai.audio.transcription.AudioTranscriptionResponse; import org.springframework.ai.openai.OpenAiAudioTranscriptionModel; import org.springframework.ai.openai.OpenAiAudioTranscriptionOptions; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.core.io.FileSystemResource; import org.springframework.core.io.Resource; import org.springframework.stereotype.Service; import java.io.File; @Service public class TranscribeService { // 注入 OpenAI 的语音模型对象 @Autowired private OpenAiAudioTranscriptionModel openAiAudioTranscriptionModel; /** * 从文件识别音频内容 * @param audioFilePath 音频文件路径 * @param transcriptionOptions 识别配置项 * @return 识别结果 */ public String transcribeFromFile(String audioFilePath, OpenAiAudioTranscriptionOptions transcriptionOptions) { // 根据传入路径构造文件对象 File file = new File(audioFilePath); // 校验文件存在且是普通文件，否则抛出异常 if (!file.exists() || !file.isFile()) { throw new RuntimeException("文件不存在或不是普通文件：" + file.getAbsolutePath()); } // 将 File 封装为 Resource 对象，适配模型输入 Resource audioFile = new FileSystemResource(file); // 创建转录请求提示对象，包含音频资源和转录选项 AudioTranscriptionPrompt prompt = new AudioTranscriptionPrompt(audioFile, transcriptionOptions); // 调用模型处理请求，返回响应对象 AudioTranscriptionResponse response = openAiAudioTranscriptionModel.call(prompt); System.out.println("识别结果：" + response); // 从响应中获取第一条结果的文字输出并返回 return response.getResults().get(0).getOutput(); } }

四、总结

高度解耦（Provider-Agnostic）：业务代码完全依赖 Spring AI 的统一接口，底层切换 OpenAI、Azure 或阿里云等厂商时，核心逻辑零改动。
开发极简：通过依赖注入直接使用TranscriptionModel或TextToSpeechModel，无需手动拼接 multipart/form-data 等复杂的音频 API 请求。
参数标准化：将不同厂商特有的参数（如 Whisper 的temperature、TTS 的voice音色）映射为统一的 Options 对象，降低了学习成本。
多模态协同：与ImageModel、ChatModel保持一致的设计范式，非常便于构建“语音输入 -> 文本理解 -> 语音回复”的完整多模态智能交互系统。