Introduction to Machine learning in Natural Language Computing

[Tech Talk delivered by Mahendrarajan Chandrasekaran at PSG Tech on Aug, 2021]

[Transcript of the talk]

Good morning. I would like to give you an introduction about machine learning in natural language processing. I will give you an overview about what machine learning is about and more specifically in terms of natural language computing. I also planned to talk a bit about the current state of Indian languages with respect to machine learning. Next we will talk about machine translation and then I’ll be giving you a small demo using one of the open source machine learning libraries. We will then announce the programming contest for this year .

We’re all aware that software is in the center of everything in the world and is becoming as ubiquitous as electricity. But not all the problems have been solved by computing. There are certain classes of problems that used to be traditionally very hard for computing to solve. There are still problems that software doesn’t handle very well like image recognition and speech recognition. Then there is natural language processing, for example if you give a text can the system understand what that text is about and then autonomous driving, where in realtime the software does driving in the real world scenario and then there is a medical image processing which typically involves specialized people to look at CT scans and other medical images.

One of the common themes across these challenges is that it’s not a simple number crunching but rather it involves some sort of a cognition to perform really well in these areas. It requires some sort of intelligence to make sense out of the data which is typically in the realm of human intelligence. Machine learning has made huge advances in these areas and currently there is a huge Improvement.

Here is the chart that shows improvement on the areas that I have mentioned like image classification, Machine translation and Medical image processing. I have highlighted the areas where you see a huge change, in fact it’s a step change in terms of improvement. These are results for samples where it is compared against humans. For example, image classification – when you show a picture of an image of a dog, it’s very simple for humans to identify but a hard challenge for computers. If you look at this chart, it does almost achieve 90 to 95 %. Medical imaging requires very special skill, for example like a doctor or radiologist, who has to look at the CT scans, x-ray images and then identify the cases like tumors etc., even in this it achieves 90 to 95 percent which is closer to human efficiency. It is as good as any specially trained humans.

In the case of autonomous driving, it used to be a dream but these days it has achieved about 30%. You might all be aware that chess has been a solved problem as computers beat humans in chess some twenty years before, but even then scientists claimed that the game Go cannot be solved any time soon. It is very hard for computers to beat humans in the game of Go because Go as a game has very complex game play. Because it takes years and years for an expert to get trained on Go but a program called AlphaGo by Google has beat a top notch Go player recently.

Another problem in a different domain is called protein folding, this is a well understood problem for about 50 years that couldn’t be solved due to the complexity of the solution. This problem is very critical for new drug discovery. Recently it has been solved by one of the program called Alpha Fold.

Next in the field of natural language interpretation, there’s a company called as OpenAI, that has created a data model called GPT-3 which is built from indexing text data from dictionaries, books, Wikipedia and processed them to create a 175 billion parameter model which takes a lot of computing power running for for weeks and weeks. Using this model, one can then generate a human-like text by providing a seed sentence, it can create a new sentence which at times would be hard to decipher whether it has been generated by the program or by a human.

So, what changed? Change happened on two fronts. From a hardware perspective, of course there is a continuous decrease in cost of computing power and also the rise of high powerful GPUs ( graphical processing units). If you look in the last few years, you could notice that the GPUs have grown tremendously even when compared against the CPUs, the price per performance has increased. Solid state drives which are more compact and better price per performance are becoming the norm. On the software front, there’s a step change in the approach which is called a convoluted neural network. Earlier when you used to solve a problem you had to create a very smart algorithm, because the cost to save huge data is prohibitive and also data is harder to collect and annotate. But in recent years, it is very easy to store voluminous data efficiently, so your algorithm can be very simple, it is like, you capture the data and explain the problem and then basically ask the computer to use the data and figure out the solution. While there might be processes involved to clean out the data, the solution itself is simple. You can have a very simple algorithm to achieve your output as it is based on a simpler linear progression algorithm.

Let’s now look at the current state in this chart on natural languages. I want to share the data on the state of Indic languages. If you look at this graph it shows the number of native speakers against the number of articles by language in Wikipedia. So a very simple measure to show the state of the languages. On the top, even though there are less number of speakers, the availability of Wikipedia articles is very high in English, French, German, western languages. In fact, for English I need to trim the bar because it’s so off the charts, but if you look at the bottom portion you see what are called low resource languages. If you consider Hindi which is spoken by many more than Japanese, the no. of articles is less. For Tamil, even though the number of native speakers are almost the same as German or French, the availability of articles is completely disproportionate. What it means is if you speak a high resource language, you already have a huge text corpus that is made available for you. These are all annotated with various morphological analysis and available for you. You can then use that data and do your machine translation or run other enhanced processing on top of it. Similarly for text to speech conversion, you need to capture the speech from various native speakers which cater for different accents, regions etc., and then it needs to be annotated. ie., someone would have manually transcribed and built a speech model so that you can take any other new speech and do the automatic transcription. This is what is being used in Google Docs and in other areas. So now you can pretty much see the difference between high resource and low resource languages and the same goes for optical character recognition where you need data to convert printed text to electronic text.

Even though India has a very high and specific need to have resources to cater for multi-language we don’t quite possess it. But there are some significant projects that have been focussing on natural language processing set up like some IIT’s that have created some research groups that do it and there are some government institutes that do it. Recently I have stumbled upon some open source initiatives which cater for a missing corpus for Indic languages and also provide code to work with it. One of the best parts about open-source is, today even though you can translate text with Google Docs you cannot deploy on your phone you have to use what has been provided by the service. you cannot even improve upon that even though you might have means to do it because you don’t have access to the model. This AI4Bharath is an initiative started by 2 IIT professors and led by a volunteer group of programmers. It currently supports 11 Indian languages along with it’s pairs. They have built a machine learning model that translates from English to other Indian languages as well as from Indian languages to English. They have created this model that has been trained on all these Indian languages using 47 million pairs across all languages thus making the model which is previously available only from the commercial sources as open source.

Now I wanted to show a quick demo. In natural language processing one of the important processes is Machine translation that enables people to communicate in a globalized manner and it’s important to understand other cultures. Traditionally, machine translation has been used in a very confined domain for example legal or medical transcription. Literary domain has always been one of the hardest to crack because the literary text has its own subtleties and nuances to capture. I’m trying to show you a snippet here and compare it against a text with a translated version. So you could see a source text and below that we see its manual translation. I’m going to compare this against the existing state-of-the-art and open source. I’ve used the same text shown here and used Google, Microsoft translation services and one of the open-source libraries I talked about.

When I did this exercise I was hoping that it has no chance to stand against the state-of-the-art systems because this has been done by multi-billion dollar companies. ButI was pleasantly surprised to see it perform as good as if not better than the state-of-the-art systems. The translation is not perfect but it is much better. For example it is syntactically correct. This is better than what was available even from the last couple of years ago. It also does a sensible translation which is grammatically correct without any major issues because English has its own separate structure and Tamil has separate structure.

Now, I’m going to go show you a quick demo using the AI4Bharat Library. I’m taking the BBC English version of the news item and then trying to compare against the same article in the Tamil version. So if you look at the output you could see that it pretty much matches with the English version.

Thank you all for joining.

Share this:

Related

Leave a comment Cancel reply