بهبود ترنسکریپشن‌های Whisper: تکنیک‌های پیش و پس پردازش

این پست راهنمایی برای بهبود ترنسکریپشن‌های Whisper ارائه می‌دهد. ما داده‌های صوتی را از طریق برش و تقسیم‌بندی بهینه‌سازی می‌کنیم تا کیفیت ترنسکریپشن‌های Whisper را افزایش دهیم. پس از ترنسکریپشن، خروجی را با افزودن علائم نگارشی، تنظیم اصطلاحات محصول (مثلاً ‘five two nine’ به ‘529’) و کاهش مشکلات Unicode بهبود می‌بخشیم. این استراتژی‌ها به بهبود وضوح ترنسکریپشن‌ها کمک می‌کنند، اما به یاد داشته باشید که سفارشی‌سازی بر اساس یوزکیس خاص شما ممکن است مفید باشد.

راه‌اندازی #

برای شروع، بیایید چند کتابخانه مختلف را وارد کنیم:

کتابخانه‌ی PyDub یک کتابخانه ساده و آسان برای استفاده در پایتون برای وظایف پردازش صوتی مانند برش، ترکیب و صادرات فایل‌های صوتی است.
کلاس Audio از ماژول IPython.display به شما امکان می‌دهد یک کنترل صوتی ایجاد کنید که می‌تواند صدا را در نوت‌بوک‌های Jupyter پخش کند و راهی ساده برای پخش داده‌های صوتی مستقیماً در نوت‌بوک شما فراهم می‌کند.
برای فایل صوتی ما، از یک فایل صوتی که متنش توسط ChatGPT نوشته شده و توسط نویسنده خوانده شده است، استفاده خواهیم کرد. این فایل صوتی نسبتاً کوتاه است، اما امیدواریم ایده‌ای تصویری از چگونگی اعمال این مراحل پیش و پس پردازش به هر فایل صوتی به شما بدهد.

برای اجرای کدهای زیر ابتدا باید یک کلید API را از طریق پنل کاربری گیلاس تولید کنید. برای این کار ابتدا یک حساب کاربری جدید بسازید یا اگر صاحب حساب کاربری هستید وارد پنل کاربری خود شوید. سپس، به صفحه کلید API بروید و با کلیک روی دکمه “ساخت کلید API” یک کلید جدید برای دسترسی به Gilas API بسازید.

 1from openai import OpenAI
 2import os
 3import urllib
 4from IPython.display import Audio
 5from pathlib import Path
 6from pydub import AudioSegment
 7import ssl
 8
 9client = OpenAI(
10    api_key=os.environ.get(("GILAS_API_KEY", "<کلید API خود را اینجا بسازید https://dashboard.gilas.io/apiKey>")), 
11    base_url="https://api.gilas.io/v1/" # Gilas APIs
12)

1# set download paths
2earnings_call_remote_filepath = "https://cdn.openai.com/API/examples/data/EarningsCall.wav"
3
4# set local save locations
5earnings_call_filepath = "data/EarningsCall.wav"
6
7# download example audio files and save locally
8ssl._create_default_https_context = ssl._create_unverified_context
9urllib.request.urlretrieve(earnings_call_remote_filepath, earnings_call_filepath)

گاهی اوقات، فایل‌هایی با سکوت طولانی در ابتدا می‌توانند باعث شوند Whisper صدا را به اشتباه ترنسکرایب کند. ما از Pydub برای تشخیص و برش سکوت استفاده خواهیم کرد.

در اینجا، آستانه دسیبل را روی 20 تنظیم کرده‌ایم. می‌توانید این مقدار را بر اساس نیاز خود تغییر دهید.

 1# Function to detect leading silence
 2# Returns the number of milliseconds until the first sound (chunk averaging more than X decibels)
 3def milliseconds_until_sound(sound, silence_threshold_in_decibels=-20.0, chunk_size=10):
 4    trim_ms = 0  # ms
 5
 6    assert chunk_size > 0  # to avoid infinite loop
 7    while sound[trim_ms:trim_ms+chunk_size].dBFS < silence_threshold_in_decibels and trim_ms < len(sound):
 8        trim_ms += chunk_size
 9
10    return trim_ms

 1def trim_start(filepath):
 2    path = Path(filepath)
 3    directory = path.parent
 4    filename = path.name
 5    audio = AudioSegment.from_file(filepath, format="wav")
 6    start_trim = milliseconds_until_sound(audio)
 7    trimmed = audio[start_trim:]
 8    new_filename = directory / f"trimmed_{filename}"
 9    trimmed.export(new_filename, format="wav")
10    return trimmed, new_filename

1def transcribe_audio(file,output_dir):
2    audio_path = os.path.join(output_dir, file)
3    with open(audio_path, 'rb') as audio_data:
4        transcription = client.audio.transcriptions.create(
5            model="whisper-1", file=audio_data)
6        return transcription.text

گاهی اوقات، در ترنسکریپشن‌ها تزریق کاراکترهای Unicode مشاهده می‌شود، حذف هر کاراکتر غیر ASCII باید به کاهش این مشکل کمک کند.

به خاطر داشته باشید که اگر در حال ترنسکرایب به زبان‌های یونانی، سیریلیک، عربی، چینی و غیره هستید٬ نباید از این تابع استفاده کنید.

1# Define function to remove non-ascii characters
2def remove_non_ascii(text):
3    return ''.join(i for i in text if ord(i)<128)

این تابع فرمت‌بندی و علائم نگارشی را به ترنسکریپشن ما اضافه می‌کند. Whisper یک ترنسکریپشن با علائم نگارشی و بدون فرمت‌بندی تولید می‌کند.

برای آگاهی دقیق‌تر از نحوه پرامپت کردن مدل whisper پیشنهاد می‌دهیم این پست را مطالعه کنید.

 1# Define function to add punctuation
 2def punctuation_assistant(ascii_transcript):
 3
 4    system_prompt = """You are a helpful assistant that adds punctuation to text.
 5      Preserve the original words and only insert necessary punctuation such as periods,
 6     commas, capialization, symbols like dollar sings or percentage signs, and formatting.
 7     Use only the context provided. If there is no context provided say, 'No context provided'\n"""
 8    response = client.chat.completions.create(
 9        model="gpt-4o-mini",
10        temperature=0,
11        messages=[
12            {
13                "role": "system",
14                "content": system_prompt
15            },
16            {
17                "role": "user",
18                "content": ascii_transcript
19            }
20        ]
21    )
22    return response

فایل صوتی ما در مورد داده‌های مالی است که شامل بسیاری از محصولات مالی است. این تابع می‌تواند کمک کند تا اگر Whisper این نام‌های محصولات مالی را به اشتباه ترنسکرایب کرد، آنها تصحیح شوند.

 1# Define function to fix product mispellings
 2def product_assistant(ascii_transcript):
 3    system_prompt = """You are an intelligent assistant specializing in financial products;
 4    your task is to process transcripts of earnings calls, ensuring that all references to
 5     financial products and common financial terms are in the correct format. For each
 6     financial product or common term that is typically abbreviated as an acronym, the full term 
 7    should be spelled out followed by the acronym in parentheses. For example, '401k' should be
 8     transformed to '401(k) retirement savings plan', 'HSA' should be transformed to 'Health Savings Account (HSA)'
 9    , 'ROA' should be transformed to 'Return on Assets (ROA)', 'VaR' should be transformed to 'Value at Risk (VaR)'
10, and 'PB' should be transformed to 'Price to Book (PB) ratio'. Similarly, transform spoken numbers representing 
11financial products into their numeric representations, followed by the full name of the product in parentheses. 
12For instance, 'five two nine' to '529 (Education Savings Plan)' and 'four zero one k' to '401(k) (Retirement Savings Plan)'.
13 However, be aware that some acronyms can have different meanings based on the context (e.g., 'LTV' can stand for 
14'Loan to Value' or 'Lifetime Value'). You will need to discern from the context which term is being referred to 
15and apply the appropriate transformation. In cases where numerical figures or metrics are spelled out but do not 
16represent specific financial products (like 'twenty three percent'), these should be left as is. Your role is to
17 analyze and adjust financial product terminology in the text. Once you've done that, produce the adjusted 
18 transcript and a list of the words you've changed"""
19    response = client.chat.completions.create(
20        model="gpt-4",
21        temperature=0,
22        messages=[
23            {
24                "role": "system",
25                "content": system_prompt
26            },
27            {
28                "role": "user",
29                "content": ascii_transcript
30            }
31        ]
32    )
33    return response

این تابع یک فایل جدید با نام ’trimmed’ به نام فایل اصلی اضافه می‌کند.

1# Trim the start of the original audio file
2trimmed_audio = trim_start(earnings_call_filepath)

1trimmed_audio, trimmed_filename = trim_start(earnings_call_filepath)

فایل صوتی گزارش درآمدی ما نسبتاً کوتاه است، بنابراین بخش‌ها را به طور مناسب تنظیم خواهیم کرد. به خاطر داشته باشید که می‌توانید طول بخش‌ها را به دلخواه تنظیم کنید.

 1# Segment audio
 2trimmed_audio = AudioSegment.from_wav(trimmed_filename)  # Load the trimmed audio file
 3
 4one_minute = 1 * 60 * 1000  # Duration for each segment (in milliseconds)
 5
 6start_time = 0  # Start time for the first segment
 7
 8i = 0  # Index for naming the segmented files
 9
10output_dir_trimmed = "trimmed_earnings_directory"  # Output directory for the segmented files
11
12if not os.path.isdir(output_dir_trimmed):  # Create the output directory if it does not exist
13    os.makedirs(output_dir_trimmed)
14
15while start_time < len(trimmed_audio):  # Loop over the trimmed audio file
16    segment = trimmed_audio[start_time:start_time + one_minute]  # Extract a segment
17    segment.export(os.path.join(output_dir_trimmed, f"trimmed_{i:02d}.wav"), format="wav")  # Save the segment
18    start_time += one_minute  # Update the start time for the next segment
19    i += 1  # Increment the index for naming the next file

1# Get list of trimmed and segmented audio files and sort them numerically
2audio_files = sorted(
3    (f for f in os.listdir(output_dir_trimmed) if f.endswith(".wav")),
4    key=lambda f: int(''.join(filter(str.isdigit, f)))
5)

1# Use a loop to apply the transcribe function to all audio files
2transcriptions = [transcribe_audio(file, output_dir_trimmed) for file in audio_files]

1# Concatenate the transcriptions
2full_transcript = ' '.join(transcriptions)

1print(full_transcript)

اولین نسخه از ترنسکریپت تولید شده:

Good afternoon, everyone. And welcome to FinTech Plus Sync's second quarter 2023 earnings call. I'm John Doe, CEO of FinTech Plus. We've had a stellar Q2 with a revenue of 125 million, a 25% increase year over year. Our gross profit margin stands at a solid 58%, due in part to cost efficiencies gained from our scalable business model. Our EBITDA has surged to 37.5 million, translating to a remarkable 30% EBITDA margin. Our net income for the quarter rose to 16 million, which is a noteworthy increase from 10 million in Q2 2022. Our total addressable market has grown substantially thanks to the expansion of our high yield savings product line and the new RoboAdvisor platform. We've been diversifying our asset-backed securities portfolio, investing heavily in collateralized. debt obligations, and residential mortgage-backed securities. We've also invested $25 million in AAA rated corporate bonds, enhancing our risk adjusted returns. As for our balance sheet, total assets reached $1.5 billion with total liabilities at $900 million, leaving us with a solid equity base of $600 million. Our debt-to-equity ratio stands at 1.5, a healthy figure considering our expansionary phase. We continue to see substantial organic user growth, with customer acquisition cost dropping by 15% and lifetime value growing by 25%. Our LTVCAC ratio is at an impressive 3.5%. In terms of risk management, we have a value-at-risk model in place with a 99%... confidence level indicating that our maximum loss will not exceed 5 million in the next trading day. We've adopted a conservative approach to managing our leverage and have a healthy tier one capital ratio of 12.5%. Our forecast for the coming quarter is positive. We expect revenue to be around 135 million and 8% quarter over quarter growth driven primarily by our cutting edge blockchain solutions and AI driven predictive analytics. We're also excited about the upcoming IPO of our FinTech subsidiary Pay Plus, which we expect to raise 200 million, significantly bolstering our liquidity and paving the way for aggressive growth strategies. We thank our shareholders for their continued faith in us and we look forward to an even more successful Q3. Thank you so much.

حال می‌خواهیم کارکترهای اسکی را از متن تولید شده حذف کنیم:

1# Remove non-ascii characters from the transcript
2ascii_transcript = remove_non_ascii(full_transcript)

1print(ascii_transcript)

Good afternoon, everyone. And welcome to FinTech Plus Sync's second quarter 2023 earnings call. I'm John Doe, CEO of FinTech Plus. We've had a stellar Q2 with a revenue of 125 million, a 25% increase year over year. Our gross profit margin stands at a solid 58%, due in part to cost efficiencies gained from our scalable business model. Our EBITDA has surged to 37.5 million, translating to a remarkable 30% EBITDA margin. Our net income for the quarter rose to 16 million, which is a noteworthy increase from 10 million in Q2 2022. Our total addressable market has grown substantially thanks to the expansion of our high yield savings product line and the new RoboAdvisor platform. We've been diversifying our asset-backed securities portfolio, investing heavily in collateralized. debt obligations, and residential mortgage-backed securities. We've also invested $25 million in AAA rated corporate bonds, enhancing our risk adjusted returns. As for our balance sheet, total assets reached $1.5 billion with total liabilities at $900 million, leaving us with a solid equity base of $600 million. Our debt-to-equity ratio stands at 1.5, a healthy figure considering our expansionary phase. We continue to see substantial organic user growth, with customer acquisition cost dropping by 15% and lifetime value growing by 25%. Our LTVCAC ratio is at an impressive 3.5%. In terms of risk management, we have a value-at-risk model in place with a 99%... confidence level indicating that our maximum loss will not exceed 5 million in the next trading day. We've adopted a conservative approach to managing our leverage and have a healthy tier one capital ratio of 12.5%. Our forecast for the coming quarter is positive. We expect revenue to be around 135 million and 8% quarter over quarter growth driven primarily by our cutting edge blockchain solutions and AI driven predictive analytics. We're also excited about the upcoming IPO of our FinTech subsidiary Pay Plus, which we expect to raise 200 million, significantly bolstering our liquidity and paving the way for aggressive growth strategies. We thank our shareholders for their continued faith in us and we look forward to an even more successful Q3. Thank you so much.

حال می‌خواهیم علامت‌های نگارشی صحیح را به متن اضافه کنیم:

1# Use punctuation assistant function
2response = punctuation_assistant(ascii_transcript)

1# Extract the punctuated transcript from the model's response
2punctuated_transcript = response.choices[0].message.content

1print(punctuated_transcript)

Good afternoon, everyone. And welcome to FinTech Plus Sync's second quarter 2023 earnings call. I'm John Doe, CEO of FinTech Plus. We've had a stellar Q2 with a revenue of $125 million, a 25% increase year over year. Our gross profit margin stands at a solid 58%, due in part to cost efficiencies gained from our scalable business model. Our EBITDA has surged to $37.5 million, translating to a remarkable 30% EBITDA margin. Our net income for the quarter rose to $16 million, which is a noteworthy increase from $10 million in Q2 2022. Our total addressable market has grown substantially thanks to the expansion of our high yield savings product line and the new RoboAdvisor platform. We've been diversifying our asset-backed securities portfolio, investing heavily in collateralized debt obligations, and residential mortgage-backed securities. We've also invested $25 million in AAA rated corporate bonds, enhancing our risk-adjusted returns. As for our balance sheet, total assets reached $1.5 billion with total liabilities at $900 million, leaving us with a solid equity base of $600 million. Our debt-to-equity ratio stands at 1.5, a healthy figure considering our expansionary phase. We continue to see substantial organic user growth, with customer acquisition cost dropping by 15% and lifetime value growing by 25%. Our LTVCAC ratio is at an impressive 3.5%. In terms of risk management, we have a value-at-risk model in place with a 99% confidence level indicating that our maximum loss will not exceed $5 million in the next trading day. We've adopted a conservative approach to managing our leverage and have a healthy tier one capital ratio of 12.5%. Our forecast for the coming quarter is positive. We expect revenue to be around $135 million and 8% quarter over quarter growth driven primarily by our cutting-edge blockchain solutions and AI-driven predictive analytics. We're also excited about the upcoming IPO of our FinTech subsidiary Pay Plus, which we expect to raise $200 million, significantly bolstering our liquidity and paving the way for aggressive growth strategies. We thank our shareholders for their continued faith in us and we look forward to an even more successful Q3. Thank you so much.

و در نهایت می‌خواهیم به مدل کمک کنیم تا از اسم صحیح محصولات مالی در هنگام تولید ترنسکریپت استفاده کند.

1# Use product assistant function
2response = product_assistant(punctuated_transcript)

1# Extract the final transcript from the model's response
2final_transcript = response.choices[0].message.content

1print(final_transcript)

Good afternoon, everyone. And welcome to FinTech Plus Sync's second quarter 2023 earnings call. I'm John Doe, CEO of FinTech Plus. We've had a stellar second quarter (Q2) with a revenue of $125 million, a 25% increase year over year. Our gross profit margin stands at a solid 58%, due in part to cost efficiencies gained from our scalable business model. Our Earnings Before Interest, Taxes, Depreciation, and Amortization (EBITDA) has surged to $37.5 million, translating to a remarkable 30% EBITDA margin. Our net income for the quarter rose to $16 million, which is a noteworthy increase from $10 million in second quarter (Q2) 2022. Our total addressable market has grown substantially thanks to the expansion of our high yield savings product line and the new RoboAdvisor platform. We've been diversifying our asset-backed securities portfolio, investing heavily in Collateralized Debt Obligations (CDOs), and Residential Mortgage-Backed Securities (RMBS). We've also invested $25 million in AAA rated corporate bonds, enhancingour risk-adjusted returns. As for our balance sheet, total assets reached $1.5 billion with total liabilities at $900 million, leaving us with a solid equity base of $600 million. Our Debt-to-Equity (D/E) ratio stands at 1.5, a healthy figure considering our expansionary phase. We continue to see substantial organic user growth, with Customer Acquisition Cost (CAC) dropping by 15% and Lifetime Value (LTV) growing by 25%. Our LTV to CAC (LTVCAC) ratio is at an impressive 3.5%. In terms of risk management, we have a Value at Risk (VaR) model in place with a 99% confidence level indicating that our maximum loss will not exceed $5 million in the next trading day. We've adopted a conservative approach to managing our leverage and have a healthy Tier 1 Capital ratio of 12.5%. Our forecast for the coming quarter is positive. We expect revenue to be around $135 million and 8% quarter over quarter growth driven primarily by our cutting-edge blockchain solutions and AI-driven predictive analytics. We're also excited about the upcoming Initial Public Offering (IPO) of our FinTech subsidiary Pay Plus, which we expect to raise $200 million, significantly bolstering our liquidity and paving the way for aggressive growth strategies. We thank our shareholders for their continued faith in us and we look forward to an even more successful third quarter (Q3). Thank you so much.

Words Changed:
1. Q2 -> second quarter (Q2)
2. EBITDA -> Earnings Before Interest, Taxes, Depreciation, and Amortization (EBITDA)
3. Q2 2022 -> second quarter (Q2) 2022
4. CDOs -> Collateralized Debt Obligations (CDOs)
5. RMBS -> Residential Mortgage-Backed Securities (RMBS)
6. D/E -> Debt-to-Equity (D/E)
7. CAC -> Customer Acquisition Cost (CAC)
8. LTV -> Lifetime Value (LTV)
9. LTVCAC -> LTV to CAC (LTVCAC)
10. VaR -> Value at Risk (VaR)
11. IPO -> Initial Public Offering (IPO)
12. Q3 -> third quarter (Q3)