Fi: Käännösparit ilmaiseksi internetistä? En: Translation pairs for free from the Internet? Filip Ginter & Jenna Kanerva TurkuNLP bionlp.utu.f
Motivation MT systems are trained on parallel data EU texts, movie subtitles, software localization data, in-domain data of any kind More data better MT system But where to get more data? on the Internet, maybe? :)
Parallel data from corpora Methods exist for getting data from comparable corpora Same domain Some parallelism on document level (Relatively) small size So how about the whole of Internet?
Parallel data from the Internet Cheap and plentiful monolingual data But no comparable corpus Anything and everything and then yet some Plan: build a method which fnds candidates for every sentence, sort by confdence, pick the best
Fi-En: casting the net Finnish 270M sentences English 420M sentences Filter by length and remove duplicates: 170M + 300M left Only a meager 51,000,000,000,000,000 candidate pairs to consider
How? Computers are (un)surprisingly good at multiplying numbers together Taito at CSC can do 180,000,000,000,000 51,000,000,000,000,000...this many multiplications a second (180TFlop)
Taito is here.
How? Must turn the problem into number operations! Turn every sentence into a vector To compare two sentences, dotproduct their vectors If the vectors are 100-long, that s 7 hours on taito in theory of course
Training: +1 for translation pairs (positive examples) -1 for random pairs (negative examples) [-1,1] Sequence of character ngrams Cosine Similarity One vector One vector Recursive or convolutional NN gizmo Recursive or convolutional NN gizmo Juon uon on v n vi vis visk iski skiä kiä. Juon viskiä. Notes: Languages isolated in their own half of the net Translation pairs get similar representation Forced to comply with cosine similarity I dr dri drin rink ink nk w k wh whi whis hisk iske skey key. I drink whiskey.
Training data OPUS (opus.lingfl.uu.se) FI-EN Some 17M pairs fully de-duplicated (no sentence repeated twice) Negative examples: random sentence pairs Trains in a day or so (GPU, keras.io)
Example output (NN) source: Tiedät, etten tunne niin. reference: You would know that I do n't feel it. candidates out of 1000 random: - You would know that I do n't feel it. (sim:0.606) I do n't know. That 's what worries me. (sim:0.577) Okay. Look, I know it 's not my place to say it, but (sim:0.555) Nobody 's telling me nothing. (sim:0.537) It 's not like me at all. (sim:0.532) Are you sure no one has seen me? (sim:0.502) I 'm glad I did n't take your advice about not coming along. (sim:0.494) I WAS AN ASSHOLE, OK? BUT I CA N'T GO BACK AND CHANGE THAT. (sim:0.488) - I do n't know why you 're not taking this `` I 'm out of here '' seriously, but I am out of here, seriously. (sim:0.481)
source: Tämä asetus tulee voimaan 10 päivänä helmikuuta 2011. reference: This Regulation shall enter into force on 10 February 2011. candidates out of 1000 random: - It shall expire on 28 May 2010. (sim:0.619) - This Regulation shall enter into force on 10 February 2011. (sim:0.615) - The Annex to Decision 2008/377/EC is replaced by the text in the Annex to this Decision. (sim:0.594) - FLAVOURING SUBSTANCES REFERRED TO IN ARTICLE 9 ( a ) OF REGULATION ( EC ) No 1334/2008 (sim:0.579) - Commission Regulation ( EC ) No 1156/2009 (sim:0.575) - Member States referred to in Article 5 ( 4 ) of Regulation ( EC ) No 479/2008 shall not have an obligation to fll point C and F. (sim:0.572) - OJ L 261, 6.8.2004, p. 28. (sim:0.553)
Does it work? Out of 1000 randomly chosen candidates, the correct one will be ranked on average on ~26th place Not good enough :( Something learned, but not the words
Dictionary overlap Proportions of words in source sentence which can be translated to a word in the target sentence, and vice versa Dictionary easy to get via OPUS+MT Correct ranked ~70th place of 1000 Even worse than the NN! :(
Join the forces NN correct sentence type Dictionary correct vocabulary Individually both suck Combined: Correct candidate ~1.06th out of 1000
Does it work at scale? 37M Finnish sentences fnished Estimated total run-time for all 170M sentences ~50K core hours, ~1 year on a 6-core CPU...that s nothing on a cluster system!
Does it work at scale? After frst runs, data quality not as good as hoped 1.06/1000 does not cut it! Final fltering: Align the candidate pairs Expensive, but not that many needed
Top-10 Yksi, kaksi, kolme, neljä, viisi. One, two, three, four, five assholes. tai ehkä jopa kaksi tai kolme. maybe even two or three 2. Osa 2 täällä ja osa 3 täällä. Find Part 2 here and Part 3 here. Hän on sama eilen, tänään ja huomenna. He is the same yesterday, today and tomorrow. Ja jos on, miten paljon? And if it is, how much? Michael Jackson kuoli viisi vuotta sitten. Michael Jackson died almost five years ago. Kun on liian paljon liian paljon? When is too much, too much? Hän on sama, eilen, tänään ja iankaikkisesti. He is the same, yesterday, today and forever. Hän oli myös historiantutkija ja grafologi. He was also an historian and archeologist. Paljon on muuttunut ja paljon ei. Much has changed and much has not.
Top-100,000 Casino Royale on yksi menestyneimmistä ja suosituimmista Bond-elokuvista. The ambience of the Casino is one of glamour and sophistication. Se on vielä helppoa kun olet noin nuori. It 's easy to learn languages when you are young. Työikäisten sepelvaltimotautikuolleisuus on pienentynyt yli 80 prosenttia 30 vuodessa. AIG 's stock has fallen nearly 80 % this year. Olen henkisesti nyt paljon vahvempi kuin koskaan. " I 'm really a lot stronger mentally than I was. Tulokset ovat yllättäviä ja usein ratkiriemukkaita. The results are quite profound and often surprising. Hallitukseen on kuuluttava puheenjohtaja, sihteeri ja rahastonhoitaja. The Chairman, Secretary and Treasurer must be Directors. Ja hän kertoi minulle mitä eriskummallisimman tarinan. And then he told me a fascinating story. Myös Koulutuksen ja tutkimuksen kehittämissuunnitelmassa ohjauksen merkitys tunnustetaan. The Group also has an important education and training role. Asioissa on aina monta, ainakin kaksi puolta. But there 's always at least two sides to a story. Mutta hän ei halua otella minua vastaan. I want him but he does not want me.
Top-300,000 Se merkitsee ristiä ja monenlaista ahdistusta. It signifies fear and anguish of mind. Raamia on joko liikaa tai liian vähän. Either the bow is too heavy or the stern 's too heavy. Tämä on pääteltävissä jo sopimuksen 2 artiklasta. The Convention 's basic rule is set out in Article 2. Tästä johtuen hänen ennustuksiaan on tulkittu useimmiten väärin. This is why his work is so often misinterpreted. Ja jonkun muun on syytä nähdä myös. And it 's always somebody ELSE 's fault. Ja Lauri oli tohkeissaan kuin pieni lapsi. He was married and had a small child. Myös kielen on taivuttava suomen lisäksi englanniksi ja ruotsiksi. Finland has two " national " languages, Finnish and Swedish. Ja jos se on käyttötarpeen ilmentyessä saatavilla. And that 's if there is anyone available. Luonnossa, ihmiset ovat ainoa itsekeskeisyyden laji. Humans are only one of many migratory species. Vuosi 2011 on ollut vilkas verkkokeskusteluiden osalta. 2011 has been an amazing year for music.
Typical classes Travel and hotel reviews Books and movies Food recipes Religion Phrases Facts General topics Machine translations
Typical classes (cont.) Michael Jackson luotti henkensä Conrad Murrayn lääkinnällisten taitojen varaan. Michael Jackson trusted his life to the medical skills of Conrad Murray. Savua ja myrkyllisiä kaasuja tappaa enemmän ihmisiä kuin liekit. Smoke and toxic gases kill more people than flames do. Itse keikka oli jopa parempi kuin etukäteen olin kuvitellut. It was even better than I thought it would be. Siskoni kuoli viime yönä, kun kauppa oli pimeänä. My sister died last night, when the store was dark.
Wrapping it all up It is perfectly possible technically to compare 170M Finnish sentences with 300M English sentences Estimate gain of few million translation pairs with decent accuracy to train MT systems No manual work, just number crunching
Where next? Can we move to document level? If two documents have a number of sentence hits, maybe they are translations also otherwise? Michael Jackson luotti henkensä Conrad Murrayn lääkinnällisten taitojen varaan. Michael Jackson trusted his life to the medical skills of Conrad Murray. Siskoni kuoli viime yönä, kun kauppa oli pimeänä. My sister died last night, when the store was dark.
Thanks! Jenna Kanerva / Turku Robert Östling / Helsinki Jörg Tiedemann / Helsinki KONE Foundation / Internet Parsebank Project CSC Centre for Scientific Computing