Together for Burundi!

by Kamusi Project USA

Project Report | Jan 23, 2018
launching Yiddish and French

By Martin Benjamin | Executive Director

The biggest Kamusi news of the quarter is that we have received a grant, from a foundation that wishes to remain anonymous, to launch a project called "Digital Yiddish". This will fund about 20% of our operating costs for the next three years - so we are still scrambling to finance other languages, but at least we know we'll be able to make headway in one direction.

We are also about to launch service in French. The data is imperfect, but we've decided that it is better to make it public and improve it as we go, rather than keep it offline until it is great, because not having French has been preventing too many people from using our resources in Africa and Europe.

A few projects in the lab look like they should launch in the nearest future, but I've learned not to predict release dates. As teasers:
• A picture is worth a thousand words
• WeChat is used by 900 million people in China. Kamusi has 100,000 terms in Chinese. WeChat supports bot technology that is similar to what we have launched on Facebook. !

I'll leave the news brief this quarter, and invite you to see what's going on behind the scenes by looking at our whiteboard: http://bit.ly/kamusilabs

Happy Year of the Dog!
Martin

Links:

KamusiLabs working whiteboard

Permalink

Oct 25, 2017
The Smallest Biggest Dictionary - bot, bot, bot

By Martin Benjamin | Executive Director

The highlight of this quarter has been the v1.0 completion of our bot on Facebook. This makes Kamusi the "smallest biggest dictionary" - smallest because students can access it with the least possible effort at the least possible cost, and biggest because we've now got the most precise links in a matrix of 43 languages and counting.

To use the new service, just go to Facebook Messenger and send a message to kamusiproject, as you would send any other chat. A message such as go/spanish/zulu/coche will set your languages and search for your word, whereas simply sending a message such as "coche" will look for that word using your previous settings. No bookstore, no library, no website, not even an app - you just type your word in Facebook, and presto, full info! We have students in Kamusi Labs who are working on porting the bot to several other platforms over the coming months.

In the last quarterly report, I promised we'd have 18 languages from India online "very soon". Promise fulfilled! Before we make a big deal about this, though, we are working to complete a unique universal transliteration system among alphabets, because it isn't much use to know that "coche" is in Malayalam if you can't sound out the letters of that script. Indian languages are written in many different scripts, so transliteration is really a key to making a socially useful dictionary for the sub-continent. As is common with Kamusi Labs projects, the reason a universal transliteration system hasn't been tackled before is that its relentlessly complexity is too insane to even contemplate. Look for our first implementation very soon.

Another exciting recent development has been the spontaneous emergence of a vibrant group of young users for the Fon language of Benin. Unfortunately, the group is using the WhatsApp messaging platform, which does not support bots, so we have to transfer their enthusiasm to Facebook when we've added data collection features to the current bot. This could happen soon, or could drag out for a while, This group will be a model for many other languages. Right now we are focused on expanding in the West Africa region, and then hopefully we can bring the model back to Burundi and the Swahili zone.

I'll look forward to seeing which of the fun things coming down the pike I can tell you about next time. Meanwhile, I'll share this piece of fan mail, which I think gives some insight about our persistent difficulties in attracting funding:

Subject: Regarding Kamusi
I have seen your project Kamusi Gold. I am just wondering about this.
It's mentioned that there are 7000 languages spoken and your vision is to bring most of the content online.
I personally feel like most of the languages should die fast because lots of things can be made easier. The languages issues like working from different cultures, trade related issues etc., will be gone.
Thanks and regards, MS

Permalink

Jul 27, 2017
Q2 2017: Kamusi Labs in motion

By Martin Benjamin | Executive Director

Mobile knowledge for 5 South African languages

2017 is shaping up as the watershed year in which many of the claims that Kamusi has been making about our potential to document "every word in every language" become demonstrable. While the goal will always remain unreachable, our recent activities show energetic progress in moving toward it. We have released tools for the public that already provide much better vocabulary translation than Google Translate among included languages. Our open search is on the path for around 60 languages yielding 3500 highly accurate bilingual dictionaries by the end of the year. At lesser accuracy, we are importing data from about 7000 languages, integrated with tools for users to participate in precision alignment; some of the data sets have hundreds of thousands of terms, while a few thousand languages have only a smattering. By year's end we will have as many as 30 million terms, and several free new tools for sharing that knowledge.

In terms of data, our latest news is the inclusion of 5 languages from South Africa within the system. 18 languages from India will be online very soon.

Our progress is due to a new approach toward moving the project. After years of fruitless efforts to find funds to pay for basic development, we decided to finish proving the concept first, and find funding later. (In industry, this is called reaching "minimal viable product", while in non-profit management it is called "insanity".) Kamusi Labs is now an international "virtual" laboratory for computational linguistics. Graduate and undergraduate students from near and far join the project for summer or term-time internships. The students gain experience, satisfy credit obligations from their home universities, and see immediate results from their work. This summer we have 20 students from nine countries, with slots filling rapidly for the autumn.

Several of the students are working on core development (language data, database design, input and output systems), but many are pushing forward on new elements that we could not undertake if we first sank months into seeking grants to pay for them.

The transliterator will convert phonetically among dozens of written alphabets, solving a fundamental communications problem for places like India by enabling a speaker of, for example, Hindi, to plausibly read text in, say, Tamil, without needing to recognize the characters from that other script. The transliterator will be built into dictionary search results and also made available as a stand-alone web app for users to convert free-form text.
"EatUp" will soon be the app that takes the guesswork out of ordering in a foreign restaurant. This solves a huge problem in Europe, where people speak dozens of languages and travel frequently, but menus only have space for one or two languages. Google Translate is hopeless in this domain, while Kamusi's approach should enable diners to confidently order the food they want.
"Pre-D" is our system to reliably determine the correct vocabulary for sophisticated machine translation via user-managed source-side pre-disambiguation, including a comprehensive new approach to multiword expressions. This is more complicated than can be described in this space, but will make more sense when the prototype is online, and is aimed to eventually result in significantly improved translation among numerous languages.
We are upgrading KamusiTERMS, a system we designed for the African Network for Localization for participatory community terminology development across languages and domains. The new TERMS will open the millions of domain-specific terms produced for the European Union to systematic extension to non-European languages, with a special target of terms that can improve students' scores in their technical courses.
Language wheels are a new graphic method we have devised for identifying and selecting languages on websites and software, providing each known language variety with a distinctive multi-colored icon. We will propose the wheels as an international standard within the ISO after a comment period from language specialist communities.
"EmojiWorldBot" was introduced last year as a dictionary on the Telegram chat platform between Emojis and 72 languages. Now we are greatly expanding the features so that it can be a full-fledged dictionary using all our multilingual data. The bot is now being integrated with Facebook Messenger. WeChat development is planned to start in August, to extend the bot to the Chinese market.
We are developing a variety of games to elicit language data from members of the public, via the web and mobile apps. We will open the first games for play when we resolve some complex networking and data coordination issues.

Most of our recent focus has been more technical than linguistic. As the bolts are tightened on our tools for collecting, managing, and sharing data, we anticipate concerted activities to enhance the resources for individual languages. Several of these are in the pipeline for autumn, with computer science students tasked to work with data sets and language communities for their mother tongues. Opportunities are legion to use our platform for the development of any particular language. We especially invite linguistically-oriented faculty to have their students talk with Kamusi Labs about interesting projects for their thesis research.

None of this activity can continue indefinitely without funding, of course. One intern is working on a platform for supporters to sponsor individual words, and we are exploring other ways to generate revenue within our non-profit mission. We are also continuing to pursue grants. Hopefully, funding will be easier to come y now that we have exciting results to show, but so far we have not located any government agency or philanthropy that finds language infrastructure worthwhile. If you have any ideas or contacts, please let us know.

Meanwhile, we will keep pressing on. We invite you to use our free services:

Multilingual Dictionaries:

Swahili Dictionary:

iPhone: https://is.gd/kD5zlg
Android: https://is.gd/fzHI2R

72-language Emoji Dictionary:

http://telegram.me/emojiworldbot

Permalink

1 2 3 4 5 ›

About Project Reports

Project reports on GlobalGiving are posted directly to globalgiving.org by Project Leaders as they are completed, generally every 3-4 months. To protect the integrity of these documents, GlobalGiving does not alter them; therefore you may find some language or formatting issues.

If you donate to this project or have donated to this project, you can receive an email when this project posts a report. You can also subscribe for reports without donating.