Runnel Zhang | Bibliotheca Runnel

Overview

Recently, the first phase of the Bank of Nanjing customer emotion recognition system project that I am in charge of has been successfully completed. During this period, I have been fully engaged in audio signal processing, multimodal large model inference optimization, and the adaptation of the bank's customer service quality inspection vertical scenario. There are no fancy achievement displays; instead, it has been a pragmatic exploration of stepping on pitfalls, making adjustments, and optimizing step by step. Now, the project is about to enter the key link of the second phase — using Bank of Nanjing's real business data for model fine-tuning. However, due to the strict control requirements on sensitive bank data, details related to sensitive data cannot be disclosed. For the subsequent specific technical implementation of fine-tuning, I will not elaborate on every technical detail for the time being. Combining the practice of the first phase, I will sort out the thoughts, adjustments, and insights during this R&D process, which is both a review of past work and a foreshadowing for subsequent R&D.

Initial Architecture Conception Guided by Business Pain Points

When taking over the Bank of Nanjing customer emotion recognition system project, the first core requirement to clarify was to solve the inherent shortcomings of traditional emotion recognition schemes in the bank's call center customer service quality inspection scenario. Traditional schemes either rely solely on speech acoustic analysis, which can only capture surface features such as intonation and speech rate, lacking in-depth understanding of dialogue semantics and failing to accurately judge the real demands behind customers' emotions; or they only perform pure text sentiment analysis, losing the most critical paralinguistic information in audio. For example, customers' dissatisfaction and anxiety hidden in their tone often cannot be accurately captured. In the bank's customer service quality inspection scenario, the requirements for the accuracy of emotion recognition are extremely high, which is directly related to customer satisfaction evaluation, customer service quality optimization, and even the prediction of potential risks. Therefore, how to effectively integrate speech acoustic features with semantic information has become the core direction of my initial thinking.

Based on this core requirement, I initially conceived a dual-path processing architecture, which is the subsequent Plan A and Plan B. Among them, the idea of Plan A is relatively straightforward: relying on VAD (Voice Activity Detection) technology to segment call audio, connect the segmented effective speech segments to a multimodal large model for emotion inference, thereby realizing a closed loop of "audio segmentation — semantic understanding — emotion recognition". At that time, the reason for prioritizing Plan A was that VAD technology is widely used in audio segmentation scenarios and has relatively low implementation difficulty. My initial idea was to quickly complete audio preprocessing through a simple and efficient segmentation method to lay the foundation for subsequent model inference.

To implement Plan A, I implemented a dynamic threshold adjustment algorithm in the audio_splitter.py file based on the Librosa library. The core idea is to dynamically adjust the threshold parameters of VAD segmentation by real-time analyzing the energy value and zero-crossing rate of the audio, so as to avoid the loss of key information caused by fixed thresholds — for example, when a customer expresses dissatisfaction at a low volume, a fixed threshold may judge it as invalid speech, while a dynamic threshold can flexibly adjust the recognition standard according to the actual situation of the audio. However, what I did not expect was that during the actual implementation test, the segmentation accuracy of VAD was far lower than expected, failing to meet my initial assumptions: in scenarios where the call recording quality is good and a single person speaks continuously, VAD segmentation can barely meet the requirements, but in the actual call scenarios of banks, there are mostly situations of multiple people interrupting and high background noise. At this time, VAD either splits the same emotionally coherent speech into fragments, making it impossible for the subsequent model to capture the continuity of emotions; or misjudges non-core segments such as background noise and multiple people's interruptions as effective speech, increasing the redundancy of subsequent model inference; what's more, it may miss those low-volume but emotionally critical segments, directly affecting the accuracy of emotion recognition.

Precisely because of the insufficient segmentation accuracy of VAD, and subsequent multiple optimizations of threshold parameters and adjustments of algorithm logic failed to fundamentally solve this problem, I gradually adjusted the focus of R&D and shifted all attention to Plan B — it is not that Plan A is completely infeasible, but that combined with the actual business scenario of Bank of Nanjing, Plan B is more suitable for the complex situation of call recordings and can better ensure the accuracy of emotion recognition. This is also an important idea adjustment in my R&D process, and almost all subsequent R&D work has been carried out around Plan B.

Plan B Implementation and Adaptation Optimization Under Hardware Constraints

The core idea of Plan B is to use the Speaker Diarization technology of Pyannote-audio 3.1, combined with the ASR (Automatic Speech Recognition) technology of Whisper, to first complete speaker separation and speech transcription, then realize audio segmentation through role alignment, and finally connect to a multimodal large model for emotion inference. Compared with Plan A, the process of Plan B is more complex, but its advantages are also very obvious: Speaker Diarization technology can accurately distinguish between customer service and customers in calls, avoiding segmentation confusion caused by multiple people interrupting; ASR technology can transcribe speech into text to provide support for subsequent semantic understanding, and at the same time, combine with speaker separation results to achieve accurate correspondence of "speaker — text — emotion", fundamentally solving the problem of insufficient VAD segmentation accuracy.

During the implementation of Plan B, the first challenge I encountered was the limitation of hardware resources — the R&D equipment I used personally had a small video memory and could not load the three large models Pyannote, Whisper, and Qwen2-Audio at the same time. Once loaded simultaneously, an out-of-memory problem would occur, making the entire R&D process impossible to proceed. It should be specially noted here that the subsequent use of 4-bit quantization technology to optimize the model was completely to adapt to the video memory limitation of my personal equipment, not the business requirement of Bank of Nanjing, nor to pursue the so-called "technical optimization highlights". It was purely to enable Plan B to be normally implemented and tested on my equipment, which was also a helpless but necessary compromise in my R&D process.

To solve the problem of insufficient video memory, I designed a video memory management strategy in the orchestrator_plan_b.py file. The core idea is to load models serially instead of in parallel: first load the Pyannote-audio 3.1 model, and after completing speaker diarization and speaker separation, immediately unload the model and release the CUDA cache; then load the Whisper model to perform ASR transcription on the separated speaker speech, and after completing the transcription, also unload the model and release the cache; finally load the Qwen2-Audio model to perform emotion inference combined with the transcribed text and original audio segments. Through this serial loading and timely release method, the video memory occupation is effectively reduced, ensuring that the entire process can run normally on my personal consumer-grade graphics card without affecting the processing accuracy of each link.

In addition to the optimization of hardware constraints, another key link in Plan B is speaker role alignment — that is, accurately distinguishing between "customer service" and "customer", which is directly related to the pertinence of subsequent emotion recognition. After all, in the bank's quality inspection scenario, our core focus is on the emotional changes of customers, not the emotions of customer service. In the role alignment link, I did not simply rely on the voiceprint clustering function of Pyannote-audio, because during actual tests, I found that the voiceprint features of some customer service and customers are relatively similar, and relying solely on voiceprint clustering is prone to role confusion. Combined with the business characteristics of Bank of Nanjing, I designed a set of heuristic rules to assist in role alignment: on the one hand, detect customer service feature words in the transcribed text, such as "employee ID", "glad to serve you", "how can I help you", etc. As long as such feature words appear, the speaker is initially judged as customer service; on the other hand, combined with the distribution law of call duration — in the bank's call scenario, the single speech of customer service is usually short, and they mostly listen to customer demands, while the speech duration of customers is relatively long, and their emotional expression is more coherent. Through the combination of these two dimensions, the accuracy of role recognition is greatly improved, avoiding emotion recognition deviations caused by role confusion.

The R&D process around Plan B was indeed tedious. Every day, I dealt with details such as model loading and role alignment, but it was precisely this experience that gave me a more practical understanding of technology implementation — technology is never paper talk, and there is no one-step, perfect solution. Just like this time, I originally thought that the core idea had been solved, but I was stuck by the seemingly trivial problem of the video memory of my personal equipment. Later, I slowly adjusted and found a way to solve it through serial loading and quantization. This also made me understand that in R&D, we cannot only focus on the algorithm itself, but also take into account actual conditions. Even external constraints such as hardware may directly affect whether the solution can be implemented. Targeted optimization not only solves the current troubles, but also makes the entire solution more suitable for actual business, which is the most real insight I gained during this period.

Multimodal Inference Optimization and Engineering Detail Polishing

After determining the core process of Plan B and solving the problems of hardware constraints and role alignment, the next core work was to optimize the multimodal inference logic, so that the Qwen2-Audio-7B model could truly "understand" the emotions in the audio, instead of only judging the emotion of each segment in isolation. In my opinion, the biggest shortcoming of traditional Speech Emotion Recognition (SER) models is that they process each sentence in isolation, ignoring the continuity of emotions in human communication — customers' emotions never arise or disappear suddenly. They may gradually become dissatisfied from initial calm, then calm down later, or suddenly burst into anger because of a sentence from the customer service. Therefore, simply judging the emotion of each segment in isolation often cannot accurately reflect the real emotional state of customers, nor can it meet the needs of Bank of Nanjing's quality inspection scenario.

Based on this thinking, I designed a context-aware multimodal inference logic in the inference_engine.py file. The core idea is that when constructing the model input Prompt, not only the current audio segment and the corresponding transcribed text are passed in, but also the emotional history and key semantic summary of the previous dialogue are dynamically injected, so that the model can capture the continuity and mutation points of emotions. For example, in the Prompt template I designed, it will explicitly include context information such as Emotion of the previous sentence: {prev_emotion} and Key summary of the previous dialogue: {prev_summary}. At the same time, combined with the system prompt "As a bank customer service quality inspection expert, accurately identify the customer's current emotional state based on the call context and audio intonation, focus on the customer's negative emotions such as dissatisfaction and anxiety, and make objective and rigorous judgments", guide the model to make more accurate emotion judgments based on the context.

Take an actual test case as an example: in a section of call recording, the customer's previous sentence was "It's too troublesome to handle this business; I've run several times and haven't done it well", and the model identified it as "dissatisfied" emotion; the customer's subsequent sentence was "I just want to know now, can it be done after all, and when can it be done". Semantically, this sentence does not directly complain, but the intonation is significantly higher and the speech rate is faster. At this time, combined with the previous "dissatisfied" emotional context, the model can accurately identify that this sentence is still a continuation of the customer's dissatisfaction, rather than an isolated "neutral" emotion; without the injection of context information, the model may misjudge it as "neutral" due to the plain semantics, thereby affecting the accuracy of the quality inspection results. Through this context-aware design, the accuracy of the model's emotion recognition has been significantly improved, and it is more suitable for the actual quality inspection needs of banks.

However, even with the optimization of context awareness, multimodal large models still have their limitations — especially when processing non-verbal acoustic cues, they occasionally have "hallucination" phenomena. For example, when some customers speak, there is no negative expression in the semantics, but their tone is obviously roaring and impatient, which is a typical "angry" emotion, but the model judges it as "calm" because of the plain semantics; on the contrary, some customers have slight complaints in the semantics, but their tone is calm, and the model may misjudge it as "dissatisfied". To make up for this problem, I designed a dual-factor weighting algorithm to fuse the results of traditional signal processing acoustic features and the semantic understanding results of large models, so as to improve the accuracy of emotion recognition.

Specifically, I first defined corresponding acoustic feature templates for each emotion (calm, dissatisfied, anxious, angry, etc.). For example, angry emotion corresponds to an audio energy value greater than 0.08, or a fundamental frequency standard deviation greater than 40Hz; anxious emotion corresponds to a speech rate exceeding 4.5 characters per second, or an audio energy fluctuation range greater than 0.05; calm emotion corresponds to an audio energy value between 0.02 and 0.05, and a speech rate between 2.5 and 3.5 characters per second. Extract the acoustic features of each audio segment through the Librosa library, match them with the corresponding emotion templates, and calculate the acoustic matching degree $S_{acoustic}$ ; at the same time, extract the emotion judgment confidence $C_{LLM}$ output by the multimodal large model, and then calculate the final emotion confidence through the formula $C_{final} = W_{semantic} \cdot C_{LLM} + W_{acoustic} \cdot S_{acoustic}$ according to the ratio of "semantic weight 0.6 and acoustic weight 0.4" (where $W_{semantic}=0.6$ and $W_{acoustic}=0.4$ ). In this way, when the model's judgment is biased, the matching result of acoustic features can lower or increase the final confidence, thereby triggering an manual review warning, avoiding misjudgment caused by relying solely on the model, and making the result of emotion recognition more reliable.

In addition to the optimization of the core inference logic, the polishing of details in engineering implementation is also an indispensable part of this R&D process — after all, a stable and robust system requires not only accurate algorithms, but also perfect detail design. Especially in scenarios such as banks that have high requirements for system stability, any oversight of details may lead to the interruption of the entire Pipeline, affecting the R&D progress and subsequent business implementation.

For example, the output format of large models is unstable, which is an inherent problem of generative AI. During the test, it was found that the emotion recognition results output by the Qwen2-Audio model often have illegal JSON formats and contain extra characters (such as explanatory text generated by the model). Once JSON parsing fails, the entire processing process will be interrupted and cannot continue. To solve this problem, I built a set of JSON repairers based on regular expressions and fuzzy matching in the inference_engine.py file. The core idea is to match the key nodes of the JSON format (such asemotion, confidence, etc.) through regular expressions, eliminate redundant explanatory characters, and repair illegal formats (such as missing quotation marks, commas, etc.). Even if the content output by the model is mixed with a lot of irrelevant characters, it can restore the effective emotion recognition data to the greatest extent, ensure that the entire processing process is not interrupted, and improve the robustness of the system.

In addition, the Bank of Nanjing's quality inspection scenario has high requirements for the reproducibility of emotion recognition results — for the same section of call audio, no matter when the inference is performed, it must output the same emotion judgment and confidence score. It cannot cause inconsistent results due to the randomness of generative AI, which is directly related to the fairness and reliability of quality inspection results. Therefore, in the inference engine, I strictly controlled all parameters that may affect randomness: forced to set do_sample=False to enable greedy decoding mode to avoid result differences caused by sampling; at the same time, explicitly set randomness parameters such as temperature, top_p, and top_k to None, ensuring that every link of model inference is deterministic, thereby ensuring that the output result is bit-consistent whenever the same audio input is run, meeting the reproducibility requirements of the bank's quality inspection scenario.

In addition, in the final emotion scoring link, I did not adopt a simple average method — if we simply take the average of the emotion confidence of each segment, it will ignore the importance difference of different segments. For example, the customer's emotion at the end of the call can often better reflect their final satisfaction and is more concerned by the bank's quality inspection, while the emotional fluctuations at the beginning of the call are relatively less important. Therefore, I implemented a non-linear confidence weighting + exponential time decay algorithm in theaggregator.py file, and the core formula is $FinalScore = \frac{\sum (Score_i \times C_i^2 \times \gamma^i)}{\sum (C_i^2 \times \gamma^i)}$ . Among them, $Score_i$ is the emotion score of the i-th segment, $C_i$ is the emotion confidence of the segment, and the square term is used to greatly reduce the weight of data with low confidence (such as below 0.5) to only 0.25, which is equivalent to automatically "muting" those uncertain emotional feedback to avoid affecting the final scoring result; $\gamma$ is set to 0.9, and $i$ is the reverse index of the segment (that is, the segment at the end of the call is i=0, increasing forward in turn). This design can reflect the "recency effect" — the emotion score at the end of the call has a higher weight, which can better reflect the real feeling of the customer when hanging up, which is also fully in line with the core demand of Bank of Nanjing's quality inspection work to focus on the final customer satisfaction.

R&D Review and Subsequent Direction Conception

Looking back at the first phase of the R&D of the Bank of Nanjing customer emotion recognition system, there are no major technical breakthroughs worth showing off, but more a process of exploring, stepping on pitfalls, and adjusting step by step. From initially thinking that VAD technology is simple and efficient, full of expectations that it can handle audio segmentation, to finding that the accuracy is insufficient during actual tests, and having to reluctantly shift all focus to Plan B; from being stuck by the small video memory of personal equipment, with out-of-memory when loading models, to slowly figuring out the method of serial loading and 4-bit quantization to barely make the solution work; from worrying about the inaccurate recognition of multimodal models, to gradually optimizing the context-aware logic and designing the dual-factor weighting algorithm; from initially ignoring engineering details, leading to frequent interruptions of the process, to gradually improving JSON repair and controlling randomness parameters, every step was taken solidly without any falseness. This experience also gave me a deeper understanding of "technology implementation": technology implementation is not about stacking high-end algorithms, nor about pursuing superficial "high-end", but about making the most practical choices and solving the most practical problems based on actual business scenarios, own hardware conditions, and data characteristics. This is the core significance of R&D.

Regarding the R&D of the second phase, I currently have some preliminary ideas, which are not very mature. They are only simple plans based on the experience of the first phase, and there are also some divergent conceptions, mainly focusing on the direction of model fine-tuning and the possibility of optimizing existing algorithms. These are still in the conception stage, and will be gradually implemented or adjusted in combination with Bank of Nanjing's real business data and actual needs in the future.

First of all, for the core direction of fine-tuning, combined with the multimodal architecture of the first phase and the actual business of the bank, I initially conceived two basic technical routes, giving priority to the direction that is most in line with the existing architecture and has low implementation difficulty. The first is basic fine-tuning based on dialogue text data. The core is to use the historical dialogue text of Bank of Nanjing's call center (after removing sensitive information) to fine-tune the text understanding branch of the Qwen2-Audio model, focusing on optimizing the model's ability to understand the semantics of professional bank scenario terms, common customer demands, and industry-specific expressions — for example, customer-specific terms such as "wealth management product redemption" and "loan approval progress". The general model in the first phase is not accurate enough in capturing such scenario-based semantics. Text fine-tuning can enable the model to adapt to the bank scenario faster, helping to improve the accuracy of emotion recognition. This fine-tuning method has low data processing cost, low requirements on hardware resources, and can quickly see optimization results. The second is multimodal fine-tuning based on speech, which is more in line with the existing system architecture. After all, the core of the first phase is the integration of audio and text. This fine-tuning will combine speech segments with corresponding text and emotion labels to fine-tune the audio feature extraction branch and multimodal fusion layer of Qwen2-Audio, focusing on optimizing the model's ability to capture audio features corresponding to different intonations, speech rates, and emotions in the bank scenario. For example, customers' low-volume complaints when expressing dissatisfaction and fast speech rate when anxious. The general model has limited adaptability to such scenario-based audio features. Multimodal fine-tuning can enable the model to better combine speech and text information, further reducing the deviation of emotion recognition. However, this fine-tuning has higher requirements on data quality and hardware resources. In the future, it will be gradually promoted according to actual conditions, possibly starting with basic text fine-tuning and then gradually transitioning to multimodal fine-tuning.

In addition to the basic fine-tuning route, I am also thinking about the optimization space of existing algorithms, especially the weight distribution and scoring formula designed in the first phase. In the first phase, the semantic weight (0.6) and acoustic weight (0.4) of the dual-factor weighting algorithm, and the non-linear confidence weighting + exponential time decay algorithm for emotion scoring are all fixed parameters and formulas set based on my experience. Although they can meet the needs of the first phase, they lack data-driven rationality and cannot adapt to the differences of different call scenarios and different emotion types. Therefore, I envision that in the future, we can try to use a simple neural network to replace the existing fixed formulas to dynamically adjust these weights and scoring logic. For example, build a lightweight fully connected neural network, whose input is the acoustic matching degree $S_{acoustic}$ , model confidence $C_{LLM}$ , context emotion history and other features of each segment, and the output is the dynamically adjusted semantic weight, acoustic weight, and related parameters in the scoring algorithm (such as $\gamma$ value), so that the weight and scoring logic can be adaptively adjusted according to different call scenarios and different emotion types, instead of using fixed values all the time. This can make the emotion score more in line with the needs of actual business scenarios, and also solve the problem of insufficient adaptability of fixed formulas under different call qualities. In the future, combined with fine-tuning data, we will test the feasibility of replacing fixed formulas with this neural network, focusing on controlling the model complexity to avoid increasing the inference pressure of the system.

In addition, I am also thinking about optimization ideas beyond conventional fine-tuning, such as the application possibility of reinforcement learning. This is a relatively divergent conception, and there are no clear implementation details yet, but combined with the business characteristics of bank quality inspection and the R&D pain points of the first phase, I have some more specific design directions, rather than pure theoretical assumptions. The core idea is to deeply bind reinforcement learning with the bank's manual quality inspection process, take "manual quality inspection annotation results" as the core reward signal, design a implementable Agent training logic, and allow the model to independently optimize emotion recognition capabilities through continuous interaction, instead of relying solely on static data fine-tuning.

Specifically, the reinforcement learning framework I envision will be divided into three core modules: Agent, Environment, and Reward Function, which are fully in line with the existing system architecture, avoiding the introduction of overly complex new modules and reducing the difficulty of implementation. First of all, the definition of the Agent: directly use the fine-tuned Qwen2-Audio multimodal model as the core Agent, and its Action is to output the emotion recognition result (such as "calm", "dissatisfied", "anxious", "angry") and corresponding confidence for the input call audio segment (combined with text context); the State is defined as "the acoustic features of the current audio segment, transcribed text, emotional history of the previous dialogue, and the deviation feedback between the previous recognition result and manual annotation", allowing the Agent to adjust its own recognition logic based on historical interaction information.

The design of the Environment is completely in line with the quality inspection scenario of Bank of Nanjing. I will take "call audio segments with sensitive information removed, corresponding text transcription, and standard emotion labels annotated by manual quality inspection" as the environment input, and simulate the real quality inspection process to set two interaction scenarios: "single segment recognition" and "full call emotion summary". In the single segment scenario, after the Agent outputs the emotion recognition result of a single segment, the environment immediately returns the deviation feedback of manual annotation; in the full call scenario, after the Agent completes the emotion recognition of all segments of the entire call and outputs the final score, the environment returns the overall deviation feedback (such as whether key emotional segments are missing, the difference between the final emotion score and manual annotation), allowing the Agent to not only optimize the accuracy of single segment recognition, but also take into account the continuity judgment of the full call emotion, which is in line with the "emotion continuity" pain point concerned in the first phase.

The most critical part is the design of the Reward Function, which is the core of whether reinforcement learning can be implemented and meet business needs. I will abandon the simple logic of "reward for accurate recognition" and design a multi-dimensional weighted reward function combined with the actual needs of bank quality inspection to avoid the model falling into the misunderstanding of "only pursuing accuracy and ignoring business priorities". The specific formula will not be refined for the time being, but the core weight distribution will focus on three points: first, the accuracy of core emotion recognition (weight 0.4). If the Agent's recognition result is completely consistent with the manual annotation and the confidence is ≥0.8, a positive reward of +10 will be given; if the recognition is wrong (such as misjudging "anxious" as "dissatisfied"), a negative reward of -15 will be given; if the recognition is consistent but the confidence is <0.6, only a positive reward of +3 will be given to guide the model to prioritize accuracy under high confidence. Second, the capture of key emotional segments (weight 0.3). In bank quality inspection, segments where customers burst into negative emotions (anger, strong dissatisfaction) are key concerns. If the Agent successfully captures such segments and the recognition is accurate, an additional positive reward of +8 will be given; if missing or misjudged, a negative reward of -20 will be given to meet the core business demands. Third, recency effect adaptation (weight 0.3). Continuing the scoring idea of the first phase, the weight of the emotion recognition accuracy of the later stage of the call (the last 3 segments) will be increased by 1.2 times. If the later recognition is accurate and consistent with the manual annotation, an additional reward will be given to guide the model to pay more attention to the real emotion of the customer when hanging up, which is in line with the demand of bank quality inspection to focus on the final customer satisfaction.

In addition, considering the feasibility of implementation, I also envision a phased training idea to avoid model instability caused by one-step complex training. The first phase is the "exploration period", reducing the punishment intensity of the Reward Function, allowing the Agent to freely output recognition results, focusing on collecting "types of recognition deviations" (such as acoustic cue misjudgment, semantic understanding deviation, context connection error) to provide a basis for subsequent adjustment of the Reward Function weight; the second phase is the "optimization period", adjusting the weight of the Reward Function, strengthening the reward and punishment for business priorities (key emotions, later emotions), and allowing the Agent to gradually adapt to the bank's quality inspection logic; the third phase is the "stable period", introducing a small amount of new call data to allow the Agent to verify the optimization effect in new data, and at the same time fine-tune the Reward Function parameters to ensure the stability of the model in different call scenarios.

Of course, this reinforcement learning idea also has clear limitations: first, it requires a large amount of manually annotated data, and a large amount of high-quality "audio-text-manual emotion annotation" data is needed as the environment input. However, the cost of manual quality inspection annotation in banks is relatively high. In the future, it is necessary to combine existing quality inspection data and give priority to using existing annotation resources to avoid additional costs; second, the training complexity is high, which needs to take into account the compatibility between Agent training and the existing system, and cannot affect the inference speed of the model. In the future, we will consider adopting the method of "offline training + online fine-tuning", completing reinforcement learning training offline, and only loading the trained model parameters online to ensure that the system operation efficiency is not affected. Therefore, this idea will not be the priority direction of the second phase for the time being. We will first focus on the implementation of conventional fine-tuning. After the conventional fine-tuning is completed and data resources are sufficient, we will gradually explore the feasibility of reinforcement learning implementation, focusing on verifying its optimization effect in key emotion capture and emotion continuity recognition, to see if it can make up for the shortcomings of conventional fine-tuning.

In general, the first phase of the Bank of Nanjing project has been a challenging but rewarding journey for me. During this period, I not only accumulated more practical operational experience in technical aspects such as audio processing, multimodal large model inference, and engineering optimization, but more importantly, I learned how to make reasonable technical choices under various constraints, how to face various problems in R&D without evasion or perfunctoriness, and find solutions little by little. At the same time, I will continue to record the thoughts and insights during this R&D process, which is not only an account of my every exploration and effort, but also hopes to provide some valuable references for peers engaged in R&D in similar fields. Even if it only helps everyone avoid one or two small pitfalls, it is enough.