Speechocean-海天瑞声(688787) - 投资者关系活动记录表-（2023年9月21日）

Data Requirements and Differences - The data requirements for pre-training in large models are fundamentally similar to traditional deep learning, involving text, speech, and images, but differ in scale, quality, and sources [3] - Pre-training data typically involves trillions of tokens, compared to around 1 billion tokens for traditional models [3] - Large models require more diverse data sources, including copyrighted and public data, in addition to traditional targeted collection [4] - The engineering capabilities for data cleaning are more critical in large model pre-training due to the massive scale of raw data [4] Multi-Modal Data Demand - The development of multi-modal large models will generate new data demands, such as text-to-image models that require mapping text semantics to image tags [4] - High-quality multi-modal training datasets will become increasingly important as models expand into multi-modal capabilities [5] - The growth of multi-modal capabilities will drive the data services industry into a larger incremental space [5] Financial Performance and Outlook - The company's revenue declined in the first half of the year due to a decrease in overseas income, influenced by customer layoffs, business adjustments, and data export regulations [5] - Domestic revenue showed year-on-year growth in Q2, driven by the intelligent driving business [5] - The company expects overseas revenue to recover as customer adjustments conclude and data export assessments normalize [6] - The company will focus on emerging strategic businesses like intelligent driving and large models, while exploring the data element market to achieve steady performance recovery [6]