Webinars – Venmurasu Programming Team

[Tech Talk delivered by Mahendrarajan Chandrasekaran at PSG Tech on Aug, 2021]

[Transcript of the talk]

Good morning. I would like to give you an introduction about machine learning in natural language processing. I will give you an overview about what machine learning is about and more specifically in terms of natural language computing. I also planned to talk a bit about the current state of Indian languages with respect to machine learning. Next we will talk about machine translation and then I’ll be giving you a small demo using one of the open source machine learning libraries. We will then announce the programming contest for this year .

We’re all aware that software is in the center of everything in the world and is becoming as ubiquitous as electricity. But not all the problems have been solved by computing. There are certain classes of problems that used to be traditionally very hard for computing to solve. There are still problems that software doesn’t handle very well like image recognition and speech recognition. Then there is natural language processing, for example if you give a text can the system understand what that text is about and then autonomous driving, where in realtime the software does driving in the real world scenario and then there is a medical image processing which typically involves specialized people to look at CT scans and other medical images.

One of the common themes across these challenges is that it’s not a simple number crunching but rather it involves some sort of a cognition to perform really well in these areas. It requires some sort of intelligence to make sense out of the data which is typically in the realm of human intelligence. Machine learning has made huge advances in these areas and currently there is a huge Improvement.

Here is the chart that shows improvement on the areas that I have mentioned like image classification, Machine translation and Medical image processing. I have highlighted the areas where you see a huge change, in fact it’s a step change in terms of improvement. These are results for samples where it is compared against humans. For example, image classification – when you show a picture of an image of a dog, it’s very simple for humans to identify but a hard challenge for computers. If you look at this chart, it does almost achieve 90 to 95 %. Medical imaging requires very special skill, for example like a doctor or radiologist, who has to look at the CT scans, x-ray images and then identify the cases like tumors etc., even in this it achieves 90 to 95 percent which is closer to human efficiency. It is as good as any specially trained humans.

In the case of autonomous driving, it used to be a dream but these days it has achieved about 30%. You might all be aware that chess has been a solved problem as computers beat humans in chess some twenty years before, but even then scientists claimed that the game Go cannot be solved any time soon. It is very hard for computers to beat humans in the game of Go because Go as a game has very complex game play. Because it takes years and years for an expert to get trained on Go but a program called AlphaGo by Google has beat a top notch Go player recently.

Another problem in a different domain is called protein folding, this is a well understood problem for about 50 years that couldn’t be solved due to the complexity of the solution. This problem is very critical for new drug discovery. Recently it has been solved by one of the program called Alpha Fold.

Next in the field of natural language interpretation, there’s a company called as OpenAI, that has created a data model called GPT-3 which is built from indexing text data from dictionaries, books, Wikipedia and processed them to create a 175 billion parameter model which takes a lot of computing power running for for weeks and weeks. Using this model, one can then generate a human-like text by providing a seed sentence, it can create a new sentence which at times would be hard to decipher whether it has been generated by the program or by a human.

So, what changed? Change happened on two fronts. From a hardware perspective, of course there is a continuous decrease in cost of computing power and also the rise of high powerful GPUs ( graphical processing units). If you look in the last few years, you could notice that the GPUs have grown tremendously even when compared against the CPUs, the price per performance has increased. Solid state drives which are more compact and better price per performance are becoming the norm. On the software front, there’s a step change in the approach which is called a convoluted neural network. Earlier when you used to solve a problem you had to create a very smart algorithm, because the cost to save huge data is prohibitive and also data is harder to collect and annotate. But in recent years, it is very easy to store voluminous data efficiently, so your algorithm can be very simple, it is like, you capture the data and explain the problem and then basically ask the computer to use the data and figure out the solution. While there might be processes involved to clean out the data, the solution itself is simple. You can have a very simple algorithm to achieve your output as it is based on a simpler linear progression algorithm.

Let’s now look at the current state in this chart on natural languages. I want to share the data on the state of Indic languages. If you look at this graph it shows the number of native speakers against the number of articles by language in Wikipedia. So a very simple measure to show the state of the languages. On the top, even though there are less number of speakers, the availability of Wikipedia articles is very high in English, French, German, western languages. In fact, for English I need to trim the bar because it’s so off the charts, but if you look at the bottom portion you see what are called low resource languages. If you consider Hindi which is spoken by many more than Japanese, the no. of articles is less. For Tamil, even though the number of native speakers are almost the same as German or French, the availability of articles is completely disproportionate. What it means is if you speak a high resource language, you already have a huge text corpus that is made available for you. These are all annotated with various morphological analysis and available for you. You can then use that data and do your machine translation or run other enhanced processing on top of it. Similarly for text to speech conversion, you need to capture the speech from various native speakers which cater for different accents, regions etc., and then it needs to be annotated. ie., someone would have manually transcribed and built a speech model so that you can take any other new speech and do the automatic transcription. This is what is being used in Google Docs and in other areas. So now you can pretty much see the difference between high resource and low resource languages and the same goes for optical character recognition where you need data to convert printed text to electronic text.

Even though India has a very high and specific need to have resources to cater for multi-language we don’t quite possess it. But there are some significant projects that have been focussing on natural language processing set up like some IIT’s that have created some research groups that do it and there are some government institutes that do it. Recently I have stumbled upon some open source initiatives which cater for a missing corpus for Indic languages and also provide code to work with it. One of the best parts about open-source is, today even though you can translate text with Google Docs you cannot deploy on your phone you have to use what has been provided by the service. you cannot even improve upon that even though you might have means to do it because you don’t have access to the model. This AI4Bharath is an initiative started by 2 IIT professors and led by a volunteer group of programmers. It currently supports 11 Indian languages along with it’s pairs. They have built a machine learning model that translates from English to other Indian languages as well as from Indian languages to English. They have created this model that has been trained on all these Indian languages using 47 million pairs across all languages thus making the model which is previously available only from the commercial sources as open source.

Now I wanted to show a quick demo. In natural language processing one of the important processes is Machine translation that enables people to communicate in a globalized manner and it’s important to understand other cultures. Traditionally, machine translation has been used in a very confined domain for example legal or medical transcription. Literary domain has always been one of the hardest to crack because the literary text has its own subtleties and nuances to capture. I’m trying to show you a snippet here and compare it against a text with a translated version. So you could see a source text and below that we see its manual translation. I’m going to compare this against the existing state-of-the-art and open source. I’ve used the same text shown here and used Google, Microsoft translation services and one of the open-source libraries I talked about.

When I did this exercise I was hoping that it has no chance to stand against the state-of-the-art systems because this has been done by multi-billion dollar companies. ButI was pleasantly surprised to see it perform as good as if not better than the state-of-the-art systems. The translation is not perfect but it is much better. For example it is syntactically correct. This is better than what was available even from the last couple of years ago. It also does a sensible translation which is grammatically correct without any major issues because English has its own separate structure and Tamil has separate structure.

Now, I’m going to go show you a quick demo using the AI4Bharat Library. I’m taking the BBC English version of the news item and then trying to compare against the same article in the Tamil version. So if you look at the output you could see that it pretty much matches with the English version.

Thank you all for joining.

Webinars

Startups:

November 24, 2021 Viswanathan

Talk by Viswanathan Mahalingam to PSG Tech students in July 2020. Talk intended to Information Technology/Computer Science engineering college grads @ India

Slides: https://docs.google.com/presentation/d/138282uMWaILZx7ydF3jLZjs58TjstXXAOjKYTcmWHOI/edit#slide=id.p

My father is a mechanical engineer and works in a sugar factory as General Manager in Karnataka. For a long time (pre- IT boom), marine engineering jobs were one of the highest paid jobs in India. My father actually wanted to become a marine engineer, but my farmer grandfather overruled him stating that seafaring as a risky job. So, my father made my elder brother a marine engineer. Oh, why else will you have kids, if not for your dreams! My brother is a marine engineer, who works for Bernhard Schulte Ship management company. BSM is one of the largest ship management company in the world. Ship owners lease their ships to BSM. Brother works in contract mode, where he sails for 6-8 months a year, transporting iron ore and other natural resources from all parts of the world (Australia, Brazil, etc) to China.

Sugar, shipping, oil, banking industries are what I call ‘old world Industries’. How much ever my father or brother worked hard in their lifetime, they will be only paid salaries and bonuses, but not stocks or equities. Forget about buying your own ship or building a sugar factory. Also, these industries are heavily regulated by government and entry barrier to start a business is very high. You need license, land, lots of money, etc. I simply call businesses / companies where the owners / founders are willing to share their wealth in terms of equity to their employees, as ‘New world industries’.

I suggest to read Paul Graham’s essay on startups, mainly to highlight that every year thousands of cafes and barber shops are opened across the world, but not all entrepreneurship can considered startups. Once I was tricked by a frenemy to meet an IIT educated ‘entrepreneur’ in sweltering Chennai heat, who turned out to be a multi-marketing amway agent.

Also, I want to put a disclaimer that in contrast to popular opinion, you will be financially better off (most of the time) to work in a well established company (FAANG, Salesforce, Paypal, etc) than to work in a startup. You will have a relatively stable, less stressful job. The common reasons on working in a startup are faster learning and career growth, even if your startup is not successful, your resume and future career prospects will be better. But the most important reason I think to work in a startup is it gives you space to try out your ideas. A friend of mine working in Google, Mountain View complained that his bosses are not listening to his ideas. He exactly knew why Google+ is failing compared to Facebook and how they can make Google+ win. I was telling him, well, Google became Google without your ideas in first place and they hired you not for ‘your ideas’, but to execute ‘their ideas’. If you really want to try your ideas, then quit Google, go find your own company or join a early stage startup, where you add value. He moved to Bangalore from Mountain View few years later and indeed quit Google to start https://crio.do with another friend, who worked in FlipKart.

Working and starting a startup, brings back the concept of ‘Ecosystem’. Vasco Dagama, first European to find a sea route to India, didn’t wake up one day from sleep and sail to India straight away. It took multiple generations of Kings and Sea Farers, it is an ONE HUNDRED YEAR dream project to collectively find the sea route to India. Starting from Prince the Henry the Navigator time, Sailor Alvaro Fernandes reached Siearra Leone in 1446, Diago Cam reached Congo in 1482, Bartolomeu Dias circled the cape of good hope – 1488 and finally Vasco da Gama – reached India in 1498 and future seafarers sailed further east finding routes to South East Asia and Far East(Spice islands then). The successes and failures of each sailor, helped the next one to take a step forward. The present era venture capitalists are similar to those Kings and the entrepreneurs are akin to the sea-farers, with the architects, product managers, et.all to the sailing crew. FlipKart’s successful exit has created the startup ecosystem in Bangalore. Coimbatore has a good manufacturing and entrepreneurial base. Let’s hope that Coimbatore gets its own FlipKart sometime in the future.

Webinars

Higher Studies

November 24, 2021 Viswanathan

Talk by Viswanathan Mahalingam to PSG Tech students in July 2020. The talk’s intended audience was students @ PSG Tech, India. But the content more or less applies to a wider audience.

Higher Studies:

Slides: https://docs.google.com/presentation/d/1-vb2kc9ZaRfVnuli-HZWGUAaN-JFovudahXMbI2HN00/edit?usp=sharing

Summary:

I studied in 5 different schools, as my father got transfers every 3-4 years. I did my 10th std at Ramakrishna Vidyaalaya Matriculation HSS (RVM), Villupuram and did my 12th std from MSP Solai Nadar MHSS, Dindigul. RVM is an english medium matriculation school, while MSP is primarily a tamil medium government aided school under private management with few english medium sections. During my 12th std (2001), we had entrance exams for engineering and medical courses in Tamil Nadu. 300 was the total cutoff marks with TNPCEE exams accounting for 100 and 12th board exams accounting for 200 marks. The school fees at RVM is 5X more than MSP and the student quality is more or less same in both the schools and I used to get around 5th or 6th rank in both of them. Dindigul is twice as populous and wealthier than Villupuram. As most of them in Tamil Nadu know that PSG Tech is one of the top Engineering colleges in Tamil Nadu, let me ask a question: How many students joined PSG Tech from each of these schools in 2001?

0 out of 200 students from RVM joined PSG Tech.

8 out of 300 students, from MSP including me, joined PSG Tech.

The numbers get even more skewed towards MSP, when you consider all the top engineering colleges & medical colleges in Tamil Nadu. What explains this puzzle ? A more expensive school with similar student quality could not send any student to colleges like PSG Tech. The numbers cuts across caste and class. The answer lies in the single word called ‘Ecosystem’.

The teachers in MSP school knew how to crack TNPCEE. They had the last 10 year question papers of TNPCEE. As soon as our 11th grade exams were over, MSP teachers conducted private tuitions, where they started to coach students for TNPCEE. RVM students prepared for TNPCEE only after their 12th grade public exams. So, MSP students had a head start of 1 year over RVM students’ 1 month prep time. MSP teachers also had a feedback loop of how well their coaching is working, as year after year, more MSP students went to good colleges [1]. While MSP sent students to top engineering colleges in Tamil Nadu, no one went to IITs, as there was no exposure or coaching on how to crack IIT JEEs.

A relative asked my “Why do we need to study abroad ? What do we not have in India?”. Many of my PSG batch mates are well settled in their life without doing any higher studies (in India & Abroad). Also, compared to 15 years ago, now a lot of well paying product companies and startups jobs are available in Indian cities – Bangalore, Hyderabad, Chennai, Pune, etc. So, you do not need to do higher studies for better job prospects, go abroad or earn wealth. Then why should one study abroad ? The only reason I would say is the “ecosystem” the top universities provide for research opportunities and starting your startups.

Let us quickly look at the operational expense (Staff salaries, maintenance of hostel, library, cost of conducting Conference, workshops) etc for the year 2018-19:

PSG Tech: ~$11M [One of the top engineering college in Tamil Nadu.]

IIT Madras: ~$110 M [Ranked 1 in all of India]

Georgia Institute of Technology: ~$1.6B

I am just listing these numbers to highlight the difference in scale in terms of money and not on student quality. One of the primary reason for scale of Gatech is the research funding the institute and its faculty receive from the US Government agencies like NSF, etc. In India, research funding is primarily given to CSIR, HAL, DRDO, BARC and other central government research institutions. Few drops trickle down to Indian universities and colleges, which evaporate like morning mist. Another point is in top US universities, a professor will get a permanent job (‘tenure’) only if they attract certain amount of funds to the institutes. Professor’s salary are highly competitive as well – if not the FAANG salary levels, they are on-par with the industry. Well funded professors help top notch talented students to start their own companies. Prof.Steve Dickerson of Gatech gave the initial seed funding to Gatech grad students Aarjav Trivedi & Arun Elangovan, which helped them to found Ridecell, where I currently work for the past eight years. Another company is Pindrop, founded by Vijay Balasubramaniam, PhD graduate student at Gatech (who was my mentor at Gatech) and Prof.Mustaque Ahamad, who funded my graduate studies at Gatech. One can hope that the ‘Center of Eminence’ initiatives, potential “defense corridor” projects change the equation in India and send the research funds towards the colleges.

My personal advice to any one who want to study abroad is to work for few years in India, figure out what you want to do, stay in touch with your classmates, even those who you ‘think’ are not your friends, because you can learn from the success and failures of everyone in your network. Some of my not-so-close friends in college, helped me shortlist my university applications, review statement of purpose, gave good guidance with respect to choosing the best university, helped in placement preps later and eventually became my best friends. From my side, all I have to do is to let go off my ego, acknowledge their smartness and ask for help.

Notes:

Thanks to Jacque Camie, Chief of Staff, Ridecell and Arun Gomathinayagam, Engineering Manager, NYTimes (and classmate at PSG Tech) on reviewing the content and listening through a dry run.

For further reading on thoughts of improving Indian Universities, I recommend – ‘Institutions of Sand’ chapter in ‘Imagining India‘ by Nandan Nilekani. One of his ideas was to allow foreign universities to open up in India. Around 2010, Indian government relaxed regulations to allow foreign universities to setup shop in India. Georgia Tech had a plan to open a remote campus in Hyderabad, but postponed the idea when they learnt that they cannot transfer the money earned out of India back to US. I am not sure of the present state and I still think the top tech universities of the world (MIT, Harvard, Stanford, Berkeley, Gatech) should be allowed to setup remote campuses in India, the benefits far outweigh the monetary loss. Opening up of Indian economy in 1990’s did not wipe out Indian Industry as feared by titans of the industry then, but made them resilient and compete with the global corporations, improving the quality of product far better.

As a counterpoint to the above arguments, the future of learning could turn out to be completely outside formal educational systems and become online / skill based learnings – ex: https://www.zohoschools.com/ or https://lambdaschool.com/ or https://byjus.com/us/ (of-course Zohoschools differ from others as the motives are different). ‘Universities are dead’ has been a favorite argument and the pandemic is fundamentally changing many of our assumptions and accelerating new ideas.

I am currently working with quite a few smart folks, who are college dropouts and its been an humbling experience for me to learn from them that college degrees do not matter and changed my value systems forever – I hired 6+ college dropouts for my teams in SF & Pune and some of them turned out to be the best and in-turn are leading teams now in SF and Pune. (I will cover in separate topic).

While, I welcome the new online learning institutes/newer systems with whole heart, many questions remain as how will the new systems scale, how do we find teachers in scale, provide best pedigree to the top students so they invent new things, incentivize faculty to start their own companies while teaching in parallel – provide market based pay, provide enough opportunities to everyone so we as a society are inclusive. My father studied in Tamil medium in school, scored 98 out of 100 in Math from his SSLC (11th grade) but he barely knew english, struggled in college (academically & financially) and literally scrapped through the college degree with minimum marks. Many in my extended family benefited from the reservation (affirmative actions) programs of Tamil Nadu government, but me and my brother are well off and didn’t need affirmative actions. I do understand the fear of “commercializing” education, importance of affirmative actions, competitive space for everyone to thrive. My point here is, if the new online education systems scale, then many of the existing problems will resurface in a different way — much like centuries old plague, Spanish flu and other epidemics re-surfacing after a hundred years in the new scientifically advanced world. So, I do think its important that we improve the existing formal education sector (universities / colleges) and the old & the new are not mutually exclusive. There are other important benefits like how students learn from one another much faster when they are under one roof, re-use the vast infrastructure we already have, etc.. More personal stories and my other thoughts in a later post.

Category: Webinars

Introduction to Machine learning in Natural Language Computing

Startups:

Higher Studies