Nine years ago I asked the question “Can we localize entire libraries?” In the wake of the National Library of Norway‘s (Nasjonalbiblioteket) digitization of its holdings and collaboration with Nigeria on materials in the latter’s languages, it seems like time to review what mass digitization could mean for translation of knowledge into diverse languages.
My original question in 2008 came from looking at trends in digitizing books – notably Google Books – and machine translation (MT). It elicited some interesting responses, including Kirtee Vashi’s mention of an Asia Online project planning to link these two trends.
In the ensuing years, Google Books’ digitization program – the biggest and most promising book digitization effort – ran into controversy over rights to reproduce copyrighted materials beginning in 2008. This ultimately has put their entire vision of digital access to a vast library of works in doubt. And the unrelated Asia Online project, which used statistical MT to translate 3.5 million pages of the English Wikipedia into Thai, was stopped in mid-2011 in the aftermath of a changed political situation in Thailand and funding issues. (Asia Online has since become Omniscien Technologies)
So while the technologies for digitization and for MT – the two pieces in localizing libraries of information – are established and improving, each has encountered some combination of legal, political , or funding issues limiting their use individually for mass expansion of access to knowledge, as well as their potential use in tandem.
However, could the Norwegian program, announced in 2013, and the project it has with Nigeria, announced earlier this year, introduce a new dynamic, at least for mass digitization? Could and should large national libraries take the lead in this area?
The idea of digitizing libraries has generally been advocated in terms of access to knowledge, without particular reference to the languages in which publications are written. But languages are critical not only for access to knowledge, but also for facilitating scholarship and the interfacing of ways of knowing. Hence the need to associate mass digitization and MT.
There is at least one proposed project mentioning the potential for translation of books that have been digitized – Internet Archive’s initiative to digitize 4 million books (a semifinalist in the MacArthur Foundation’s 100 & Change grant competition).
Any such digital data produced by the Nasjonalbiblioteket, Google Books, Internet Archive, or any other organization could be translated with MT into other languages, with a few caveats (quality of optical character recognition [OCR]; how well resourced a particular language is; and of course the accuracy of the MT). This means that potentially any mass digitization could be mass translated into a large number of languages, given legal cover and sufficient funding.
What about the accuracy of MT, and how useful could mass MT of mass digitization be if there are inaccuracies? These are critical questions for any project to use MT to translate digitizations. Responses could reference, for instance, domain-specific MT, which is generally more accurate than general MT, provided of course that the material matches the domain used. Or perhaps some system for post-editing could be devised.
This is an exciting area that needs more attention and policy support. Books and other production in print can be digitized on a mass scale, making the knowledge in them more widely available. Digitized text can be machine translated into other languages, and the quality of that can be made high enough for use by speakers of the target languages. As much as the printing press revolutionized access to knowledge of that age, so too the potential to digitize and translate what is in print promises another revolution benefiting more people directly.