基于自建平行语料库的人机翻译语言特征差异多维分析 —— 以《射雕英雄传》英译本为例 A Parallel-Corpora Comparative Study of Machine Translation and Human Translation—— Based upon “Legends of the Condor Heroes毕业论文
2020-02-15 19:15:33
摘 要
文学作品的机器翻译较其他文体来说难度更高,如何通过语言学理论优化翻译模型,从而提升文学作品翻译准确性引起了国内外研究人员的广泛关注。本文自建中国文学巨著《射雕英雄传》双语平行语料库,通过多维分析(MD)方法统计并比较了人工翻译与机器翻译的语言特征差异。作者选取了符合研究条件的中文文学作品样本,并在预处理后将中文译本输入Google翻译得到英文机翻译本。之后使用Multidimensional Analysis Tagger(MAT)软件对英文人工译本、英文机翻译本进行多维标注,并对结果进行统计分析。结果显示,人工翻译与机器翻译在维度1“交互性与信息性表达”、 维度2“叙述性与非叙述性关切”、 维度3“指称明晰性与情境依赖型指称” 具有显著差异;维度4“显性劝说型表述”、维度5“信息抽象与具体程度”和维度6“即席信息组织精细度”上无显著差异。通过对两个语料库67种语言特征进行对比分析,有45个特征具有显著差异(超过2/3)。之后本研究对具有代表性的实例进行了分析解释。本文研究成果是语言研究的科学化转向的一次尝试,也对文学作品的机器翻译优化具有一定价值。
关键词:人工翻译;机器翻译;平行语料库;语言特征;多维分析
Abstract
Translation of literary work is much harder than other writing forms. Thus, how to perfect current translation models through linguistic theories attracts wide attention in researchers all over the world. This paper compares the linguistic features between human translation and machine translation using Multidimensional (MD) analysis through a self-built corpus. This literary work is selected to be pretreated and translated into English by Google Translation applying neural network technology. The human translation work and machine translation result are tagged and analyzed for further comparison using Multidimensional Analysis Tagger(MAT)software. Results shows that human translation and machine translation have significant differences in Dimension 1 “involved versus informational production”, Dimension 2 “narrative versus non-narrative concerns”, Dimension 3 “explicit versus situation-dependent reference”. And they have no significant difference in Dimension 4 “overt expression of persuasion”, Dimension 5 “abstract versus non-abstract information” and Dimension 6 “online information elaboration”. After comparing 67 linguistic features of two corpora, it is found that 45 features have significant differences (more than 2/3). Detailed cases are included with examples. This study is an instance of computational linguistics and may be helpful to the optimization of machine translation in the field of literature.
Key Words: Human Translation; Machine Translation; Parallel corpus; Linguistic features; Multidimensional Analysis
CONTENTS
1 Introduction1
1.1 Background1
1.2 Objectives1
2 Literature Review 2
2.1 Previous Studies about Machine Learning and Human Translation 2
2.2 Biber’s Analytical Framework 2
2.3 Previous Studies Using the MD Analysis 3
3 Methodology5
3.1 Data Source5
3.2 Procedures6
3.3 Analytical Tools6
4 Results and Discussion8
4.1 Overall8
4.2 Involved versus Informational discourse9
4.3 Narrative versus Non-Narrative 11
4.4 Contexts-Independent Discourse versus Context-Dependent Discourse 12
5 Conclusion14
5.1 Major Findings14
5.2 Limitation 14
Reference15
Appendix17
Acknowledgements21
A Parallel-Corpora Comparative Study of Machine Translation and Human Translation —— Based on “Legends of the Condor Heroes”
1 Introduction
1.1 Background
Machine Translation (MT) is a pattern that converts one kind of language symbols to another through computer techniques. As Machine Learning (ML) algorithms have gotten more complex and efficient, Natural Language Processing (NLP) techniques have been widely applied. Technical giants like Google and Baidu have developed progressive MT system and perform well. Although it remains controversial whether MT will ultimately exceed human translation, there is an irreversible trend that they should cooperate to better serve the translation industry.
Nowadays, most of relevant studies are concentrating on how Machine Translation algorithms iterate and helping improve the quality of translation work. But the functions of linguistic theories especially corpus linguistics have always been ignored. Thus, in addition to algorithm optimization, we should pay more attention to linguistics and its assistance on translation work.
1.2 Objectives
This paper compared the linguistic and literary features of human translation with machine translation through Multidimensional (MD) analysis and Quantitative Linguistics. It is based on a parallel corpus of translations of “Legends of the Condor Heroes” built and optimized by the author. The machine translation corpus is provided by Google Translation applying neural network technology. The comparison aims to find discrepancy of their linguistic features with MD analysis and quantitative linguistic approach.
2 Literature Review
2.1 Previous Studies about Machine Learning and Human Translation
Most MT assessment tools use statistical methods to calculate the consistency between the translated work and referential ones. The better they are matched, the better translation work will be. The implementation of this approach requires one prerequisite: there must be sufficient high-quality reference translation work. This system utilizes two methods to assess comparability based on string or grammatical structures. The former is used more frequently. Wolk (2015) states in a study that Machine Translation based on neutral network has made great progress compared to traditional rule-based translation models and the performance is constantly improving.
2.2 Biber’s Analytical Framework
The multidimensional (MD) analysis was created by Biber (1988). Its initial purpose was to make comparisons between spoken and written English and has been gradually developed as an analytical method for textual comparison. It contains 6 dimensions and 67 linguistic features:
- Involved versus informational production
- Narrative versus nonnarrative
- Explicit versus situation-dependent references
- Overt expression of persuasion
- Abstract versus non-abstract information
- Online informational elaboration
All the co-occurring linguistic features have a common function. Each dimension includes a set of positive features (features with positive loadings) and negative features (features with negative loadings) language features. According to Biber (1988), a loading of a feature indicates the extent to which a given feature is representative of the dimension. That is to say, the loading of a feature on a Dimension reflects the extent to which the variation in the frequency of that feature correlates with the overall variation of the dimension. A positive or negative sign does not influence the importance of a loading. Rather than indicating differences in importance, positive and negative loadings show groups of features that are distributed in texts in a complementary pattern. That is, when a text has several occurrences of the negative features, it will likely have few of the positive features, and vice versa. In the interpretations of the factors, both the negative and positive cluster of features must be taken into consideration. Tagged linguistic features with absolute value no less than 0.30 were reserved for further explanation in each dimension following the MD approach. The taggers of each variable are mainly used in the tables for discussion in the following sections. List of variables can be found in the appendix.
Dimension 1 is the contrast between Involved and Informational discourse. A low score of these variables indicates that the textual information is intense, while a high score indicates that the text is emotional and interactive. The high score of this dimension shows that the text has many verbs and pronouns, and the low score in this dimension means that the text contains many nouns, long words, and adjectives.
Dimension 2 is the contrast between Narrative and Non-Narrative Concerns. A low score for these variables indicates that the text is not narrative, while a high score indicates that the text is narrative. The high score of this dimension shows that the text has many past tenses as well as foreign pronouns.
Dimension 3 is the contrast between Context Independent Discourse and Context Dependent Discourse. A low score for these variables indicates that the text is contextual, while a high score indicates that the text is not contextual. A high score for this dimension shows that the text contains many nouns, while a low score for this dimension indicates that the text contains many adverbs.