I sat down with former rugby school captain whose rugby career was cut short by a shoulder injury while playing for Black Blad at Kenyatta University. It is always a great pleasure to talk to someone who is extremely passionate about what he does and his passion for Data Science was evident during my chat with “BlackOrwa” at iHub Nairobi Offices. He is the Data Lab Manager at iHub and in another life he would have been a military man. He avoids over hyped movies (has not watched any of the Star wars) and is more inclined to movies set on one location like “Phone Booth”, “Identity” and “Buried”. Read on to know more about his startup Kwetha, IBM’S Watson and Bluemix, parallel processing, using GPU’s, PCA, mantel’s test among other things.
You have a very interesting mantra, “Nerdistic intent with delusions of grandeur”, tell us more about it.
This came about several years ago. I cofounded a data mining startup called Kwetha (means to find). My partner was the business guy and I was the crazy guy coming up with ideas and he used to say I am deluded and I was like yeah, but the ideas work, right? At the same time while watching music on YouTube, I came across a Kenyan rock band known as “Narcissistic Tendencies with Delusions of Grandeur” and being the nerd that I am, I flipped the phrase to simply explain what I do – mad scientist kind of feel.
Your blog is also quite interesting. What is the motivation behind it?
After campus I did not have a proper CV. I thought of starting a blog to showcase my data mining skills in addition to documenting my ideas and experiences. It was later on when I wrote a blog post on ‘Breaking Safaricom Scratch Card Code’ (more on this below) that I focused on data science posts. It was insane, I got about 2500 mentions on Twitter and the blog stats were on steroids – you can read about this here. I guess this was me living up to my mantra.
Data science is a relatively unknown field for most Kenyans. How did you venture into the field and what motivated you?
That is kind of an interesting story. At Kenyatta University, 4th year BSc Computer Science students are required to undertake specialization units. After checking the department’s notice board I opted for Advanced Artificial Intelligence. At the time, I used to play a lot of computer games and did a lot of computer modeling and I thought maybe if I did this course, then I could figure out how to make intelligent characters for my games. So I signed up for the class. Funny enough only 6 people signed up. So when we started the class, the lecturer said we were going to focus on data mining and initially I thought to myself, this not what I anticipated. At the time, I did not know what data mining was all about so I felt a bit let down. The lecturer had just come from the U.S and he explained how he used his data mining knowledge to predict stock prices and I was like, ‘so you can actually do that’. In short, I felt it was very interesting and started learning more about data mining algorithms. So after the course I knew I wanted to explore this more.
Tell us about your first Data science project.
My first data science project was studying consumer spending patterns on mobile phones based on reverse engineering scratch card serial numbers. I would go to town to Safaricom scratch card vendors and collect used cards from the bowls they kept to discard used cards. I would then manually enter the scratch card data into a spreadsheet and run various analyses to explore patterns and correlations. Later on, I came to iHub when the research arm was just being set up and met Angela Okune who was running a workshop on research methodologies. I pitched the idea of studying consumer spending patterns from scratch card data. She thought it was interesting and promised to seek funding for the project. I couldn’t wait for funding so I partnered with a friend Elvis Bando to push the analysis further. He cracked the serial number which made it possible to track how many scratch cards were produced, zonal spending patterns and profit projections. We wrote a blogpost about it and asked ourselves, what more could we do?
Tell us more about the work you do at iHub, what a typical day looks like and the major tools you use to do Data Science.
I started off as a consultant and later as a full time data scientist and currently as the Data Lab manager. My work involves writing code, developing analysis methodologies, managing people and looking for business opportunities where Data Science can be used to solve problems.
The whole iHub ethos is about open communities and connecting people. In line with that, we embrace open source tools, for example our servers run Linux, our core analysis languages are R and Python, primarily because they are good and open source makes our projects accessible to more people.
iHub is a for profit organization and we have a wide range of clients. Essentially we solve problems for our clients. One of the projects we have done was during the elections that involved identifying newsworthy information on Twitter. Currently, our most interesting project is code named “Umati” – it entails tracking hate speech online. A research done by an American professor came up with a 7 point methodology to identify offline hate speech. We are combining different Machine Learning methods to fit her methodology to the online world. The translation process poses challenging questions such as ‘how to measure intent’, ‘how to measure influence’, ‘how do we measure susceptibility’ et cetera. So we experimented a lot with sentiment analysis, subjectivity analysis, topic modelling, network analysis, classification, clustering among others to develop a perfect tool for identifying hate speech. These tools are open source and can be found here.
There are a lot of exciting things happening in the DS field like recently Google open sourced its ML system TensorFlow. What else has captured your eye?
At the bleeding edge I would say IBM’S Watson which has primarily been good at beating people at chess. I am looking forward to applications in health, transport and infrastructure. IBM also launched Bluemix which is a cloud infrastructure platform similar to Microsoft’s Azure but they have added ML and data mining capabilities. So if you want to build an application to say track traffic, you don’t have to worry about which algorithm to use or what the implementation factors are. It’s a drag and drop setup where you have the data source and the problem and you just plug this in and it solves the problem for you.
Microsoft has also bought Revolution analytics and are building a much better version of R. They are adding multi-threading capabilities to R and aim to make it easier to do data mining work. It is sought of like dreamweaver where you design a website and it generates code for you. Similarly here, you have logical process on one side and on the other it generates R code, spins up servers and does most of the back end processes. Good news for those who are afraid of R.
Something on Parallel Processing, Hadoop and GPU’S…
Hadoop is good but you can also use GPUs to do complex simulations, calculations, image manipulations etc. We do have GPUs here and we are finding ways to incorporate them in our analysis processes. A good example is when we were removing closely similar tweets from a corpus of 2.5 million tweets. This involves measuring each tweets against the rest while performing a ‘similarity’ test. The computer has to perform 6.25 trillion computations to exhaust the whole dataset which might take months if not years. GPUs offer a high level of parallelism than Hadoop which enable more than 1 million computations per second thereby reducing our problem to a 30 minute wait.
Let’s channel back to your presentation on finding deep structures in data. Why do we need to find deep structures in data and how do we go about it?
Clients normally give you data and require you to come up with interesting patterns and relationships. Standardization of methods and tools limit how much information can be squeezed from a dataset. So I was interested in deep structures to uncover non-obvious relationships between data points. I thought this would be useful when exploring questionnaires data to identifying dynamic patterns. The presentation focuses on combining different statistical concepts to reveal deep structures. Read how he solved this here.
Any reading assignments before the presentation?
- Deep learning
- Statistical background on Principal Component Analysis
- Mantels test
- Group theory
More Reference Material
iHub Data Lab Work
- Building a Company
- Insight from Safaricom Trash
- Zahanati: Malaria Mobile App
- Building Automated Filters for Elections
- The Headitors
Data Science Meet-ups
- Data Science Meetup with Wayo
- Data Science Meetup with MobiAds
- Data Science Meetup with HDX
- Data Science Meetup with Abacus
Summer Data Jam