Education South Africa Project #18036

Together for Burundi!

by Kamusi Project USA

Project Report | Jul 27, 2017
Q2 2017: Kamusi Labs in motion

By Martin Benjamin | Executive Director

Mobile knowledge for 5 South African languages

2017 is shaping up as the watershed year in which many of the claims that Kamusi has been making about our potential to document "every word in every language" become demonstrable. While the goal will always remain unreachable, our recent activities show energetic progress in moving toward it. We have released tools for the public that already provide much better vocabulary translation than Google Translate among included languages. Our open search is on the path for around 60 languages yielding 3500 highly accurate bilingual dictionaries by the end of the year. At lesser accuracy, we are importing data from about 7000 languages, integrated with tools for users to participate in precision alignment; some of the data sets have hundreds of thousands of terms, while a few thousand languages have only a smattering. By year's end we will have as many as 30 million terms, and several free new tools for sharing that knowledge.

In terms of data, our latest news is the inclusion of 5 languages from South Africa within the system. 18 languages from India will be online very soon.

Our progress is due to a new approach toward moving the project. After years of fruitless efforts to find funds to pay for basic development, we decided to finish proving the concept first, and find funding later. (In industry, this is called reaching "minimal viable product", while in non-profit management it is called "insanity".) Kamusi Labs is now an international "virtual" laboratory for computational linguistics. Graduate and undergraduate students from near and far join the project for summer or term-time internships. The students gain experience, satisfy credit obligations from their home universities, and see immediate results from their work. This summer we have 20 students from nine countries, with slots filling rapidly for the autumn.

Several of the students are working on core development (language data, database design, input and output systems), but many are pushing forward on new elements that we could not undertake if we first sank months into seeking grants to pay for them.

The transliterator will convert phonetically among dozens of written alphabets, solving a fundamental communications problem for places like India by enabling a speaker of, for example, Hindi, to plausibly read text in, say, Tamil, without needing to recognize the characters from that other script. The transliterator will be built into dictionary search results and also made available as a stand-alone web app for users to convert free-form text.
"EatUp" will soon be the app that takes the guesswork out of ordering in a foreign restaurant. This solves a huge problem in Europe, where people speak dozens of languages and travel frequently, but menus only have space for one or two languages. Google Translate is hopeless in this domain, while Kamusi's approach should enable diners to confidently order the food they want.
"Pre-D" is our system to reliably determine the correct vocabulary for sophisticated machine translation via user-managed source-side pre-disambiguation, including a comprehensive new approach to multiword expressions. This is more complicated than can be described in this space, but will make more sense when the prototype is online, and is aimed to eventually result in significantly improved translation among numerous languages.
We are upgrading KamusiTERMS, a system we designed for the African Network for Localization for participatory community terminology development across languages and domains. The new TERMS will open the millions of domain-specific terms produced for the European Union to systematic extension to non-European languages, with a special target of terms that can improve students' scores in their technical courses.
Language wheels are a new graphic method we have devised for identifying and selecting languages on websites and software, providing each known language variety with a distinctive multi-colored icon. We will propose the wheels as an international standard within the ISO after a comment period from language specialist communities.
"EmojiWorldBot" was introduced last year as a dictionary on the Telegram chat platform between Emojis and 72 languages. Now we are greatly expanding the features so that it can be a full-fledged dictionary using all our multilingual data. The bot is now being integrated with Facebook Messenger. WeChat development is planned to start in August, to extend the bot to the Chinese market.
We are developing a variety of games to elicit language data from members of the public, via the web and mobile apps. We will open the first games for play when we resolve some complex networking and data coordination issues.

Most of our recent focus has been more technical than linguistic. As the bolts are tightened on our tools for collecting, managing, and sharing data, we anticipate concerted activities to enhance the resources for individual languages. Several of these are in the pipeline for autumn, with computer science students tasked to work with data sets and language communities for their mother tongues. Opportunities are legion to use our platform for the development of any particular language. We especially invite linguistically-oriented faculty to have their students talk with Kamusi Labs about interesting projects for their thesis research.

None of this activity can continue indefinitely without funding, of course. One intern is working on a platform for supporters to sponsor individual words, and we are exploring other ways to generate revenue within our non-profit mission. We are also continuing to pursue grants. Hopefully, funding will be easier to come y now that we have exciting results to show, but so far we have not located any government agency or philanthropy that finds language infrastructure worthwhile. If you have any ideas or contacts, please let us know.

Meanwhile, we will keep pressing on. We invite you to use our free services:

Multilingual Dictionaries:

Swahili Dictionary:

iPhone: https://is.gd/kD5zlg
Android: https://is.gd/fzHI2R

72-language Emoji Dictionary:

http://telegram.me/emojiworldbot

Back to all project reports

About Project Reports

Project reports on GlobalGiving are posted directly to globalgiving.org by Project Leaders as they are completed, generally every 3-4 months. To protect the integrity of these documents, GlobalGiving does not alter them; therefore you may find some language or formatting issues.

If you donate to this project or have donated to this project, you can receive an email when this project posts a report. You can also subscribe for reports without donating.