The widely-used automatic evaluation metrics cannot ade- quately reflect the fluency of the translations. The n-gram-based metrics, like BLEU, limit the maximum length of matched fragments to n and cannot catch the matched fragments longer than n , so they can only reflect the fluency indirectly. M