Posts by Category

visualization

Modified readthedown RMarkdown template for stylish analytical documents

2 minute read

This is a modified readthedown rmarkdown template, which is greatly inspired and modified based on juba/rmdformats package. readthedown offer a similar sphnix style, which is commmonly used in various python package documentations. I personally very much like the readthedown style and hence dive a little bit on the source code to figure out ways to make it easier for further customization.

Deploy deep learning models in browser using Tensorflow.js

5 minute read

A brief guide on how to deploy deep learning model in browser using tensorflow.js.In this post, a mobileNet model was trained to predict BMI, Age and Gender. The model takes input (either from webcam or uploaded files) to make prediction from browser. This deployment has a obvious advantage of reduced uploading traffic compared to RESTful API approach.

Build an API App backed by FastAPI and Vue.js

10 minute read

Presenting an API is never going to be attractive. In this post, I documented my approach of developing a web page on top of existing API using FastAPI + Vue.js technology stack.

Released a DataFrame summarytool for Jupyter Notebook

less than 1 minute read

Want to include a data summary as quick reference in your Jupyter notebooks ? I used to have summarytools package in R to do this. I miss that one when I’m doing python projects. So I developed a similar python function with some additional widgets. Please check out this post if you are interested.

Set up Superset on ubuntu 16.04 LTS

1 minute read

Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application. Compared with business-focused BI tool like Tableau, superset is more technology-navy. It supports more types of visualization and able to work in distributed manner to boost the query performance. Most importantly, it is free of charge! An example dashboard: Let’s go and set it up. create a virtualenv Assume Anaconda is installed for python management. # create a virtualenv with python ...

Shiny + shinydashboard + googleVis = Powerful Interactive Visiualization

4 minute read

If you are a data scientist, who spent several weeks on developing a fantanstic model, you’d like to have an equally awesome way to visualize and demo your results. For R users, ggplots are good option, but no longer sufficient. R-shiny + shinydashboard + googleVis could be a wonderful combination for a quick demo application. For the purpose of illustration, I just downloaded a random sample data test.csv from kaggle’s latest competitions: https://www.kaggle.com/c/new-york-city-taxi-fare-pre...

Tableau Intersection Filter Tutorial

less than 1 minute read

If you used Tableau before, you will know that the filters in Tableau are union/or selection.Let’s take the table below for example. If you are going to create a filter and select product a & b, tableau will show client A,B,C and E instead of A,C. It’s because the filters will show us the list of clients who purchased product a or b, instead of product a and b. the idea Firstly, create a variable to count the selection of products. Then create another variable to count the selection...

Getting Started With Tableau

2 minute read

Intro to Tableau Aspired by the course ‘Data Visualization’ offered by University of Illinois on Cousera, I have worked on the interactive data visualization using Tableau. There is a free version of Tableau Public is available and you can upload the visualization online for sharing. Tableau is one of the Business Intelligence tools that makes it easier to do with aesthetic chart plotting and interactive report generating. There are 3 main components used in Tableau: Worksheet, Dashboard an...

Spatial Visualization with ggmap R package

3 minute read

ggmap, which is an R package built to visualize with map, is very similar to ggplot2. And its output is ‘ggplot’ class, which means it also support layered visualization just like ggplot2. I will demonstrate with two examples “Crimes in San Fransisco” and “Taxi in Porto”. The data of both examples are taken from kaggle. some basics of ggmap: get_map: is the function to download map from source (e.g. google/openstreetmap). some parameters can play with: location: the longitude and the...

Back to top ↑

deep-learning

Deploy deep learning models in browser using Tensorflow.js

5 minute read

A brief guide on how to deploy deep learning model in browser using tensorflow.js.In this post, a mobileNet model was trained to predict BMI, Age and Gender. The model takes input (either from webcam or uploaded files) to make prediction from browser. This deployment has a obvious advantage of reduced uploading traffic compared to RESTful API approach.

Implement DeepFM model in Keras

8 minute read

Introduction Wide and deep architect has been proven as one of deep learning applications combining memorization and generatlization in areas such as search and recommendation. Google released its wide&deep learning in 2016. wide part: helps to memorize the past behaviour for specific choice deep part: embed into low dimension, help to discover new user, product combinations Later, on top of wide & deep learning, deepfm was developed combining DNN model and Factorization machi...

Not so basic Keras tutorial for R

3 minute read

The basic tutorial of Keras for R is provided by keras here, which simple and fast to get started. But very soon, I realize this basic tutorial won’t meet my need any more, when I want to train larger dataset. And this is the tutorial I’m going to discuss about keras generators, callbacks and tensorboard. Keras Installation If you haven’t got your keras in R, just follow the steps at below: devtools::install_github("rstudio/keras") library(keras) install_keras() MNIST handwriting recogniti...

Digit Recognition with Tensor Flow

7 minute read

This time I am going to continue with the kaggle 101 level competition – digit recogniser with deep learning tool Tensor Flow. In the previous post, I used PCA and Pooling methods to reduce the dimensions of the dataset, and train with the linear SVM. Due to the limited efficiency of the R SVM package. I only sampled 500 records and performed a 10-fold cross validation. The resulting accuracy is about 82.7% 1. this time with tensorflow we can address the problem differently: Deep Lea...

Back to top ↑

data-engineering

Build an API App backed by FastAPI and Vue.js

10 minute read

Presenting an API is never going to be attractive. In this post, I documented my approach of developing a web page on top of existing API using FastAPI + Vue.js technology stack.

Introduction of renv package

2 minute read

R users have been complaining about the package version control for a long time. We admire python users, who can use simple commands to save and restore the packages with correct versions. The good news is that, RStudio recently introduced renv package to manage the local dependency and environment, filling the gap between R and python. renv resembles the conda / virtualenv concept in python.

It’s time to upgrade your scheduler to Airflow

4 minute read

Airflow is an open source scheduling tool, incubated by Airbnb. Airflow is now getting popular and more Tech companies start using it. Compared with our company’s existing scheduling tool - crontab, it provides advantageous features, such as user-friendly web UI, multi-process/distributed executions,notification when failure/re-try. In this post, I’m going to record down my journey of airflow setup. Content 1.Install Airflow 2.Configure Airflow 3.Choices of Executors 4.Final Note...

Revisit Titanic Data using Apache Spark

5 minute read

This post is mainly to demonstrate the pyspark API (Spark 1.6.1), using Titanic dataset, which can be found here (train.csv, test.csv). Another post analysing the same dataset using R can be found here. Content Data Loading and Parsing Data Manipulation Feature Engineering Apply Spark ml/mllib models 1. data loading & parsing data loading sc is the SparkContext launched together with pyspark. Using sc.textFile, we can read csv file as text in RDD data format and data is sep...

Back to top ↑

recsys

Implement DeepFM model in Keras

8 minute read

Introduction Wide and deep architect has been proven as one of deep learning applications combining memorization and generatlization in areas such as search and recommendation. Google released its wide&deep learning in 2016. wide part: helps to memorize the past behaviour for specific choice deep part: embed into low dimension, help to discover new user, product combinations Later, on top of wide & deep learning, deepfm was developed combining DNN model and Factorization machi...

Implementation of Model Based Recommendation System in R

1 minute read

The most straight forward recommendation system are either user based CF (collaborative filtering) or item based CF, which are categorized as memory based methods. User-Based CF is to recommend products based on behaviour of similar users, and the Item-Based CF is to recommend similar products from products that user purchased. No matter which method is used, the user-user or item-item similarity matrix, which could be sizable, is required to compute. While on the contrast, a model based app...

Job Hunting Like A Data Analyst (Part III)

6 minute read

Continued with previous post – Explore the Job Market, this week I am going to develop a simple recommender system to find a suitable job . Recommender Let’s talk some background of recommendation system. A typical example of recommendation could be product recommended in the sidebar at Amazon or people you may know in Facebook. Usually we can categorised recommender into two types: 1. Content Based Recommendation: Content-based could mean user-based or product-based and the choice is de...

Back to top ↑

kaggle

Recognize the Digits

2 minute read

This time I am going to demostrate the kaggle 101 level competition - digit recogniser. We are asked to train a model to recogize the digit from the pixel data in this competition. The data set is available here. description of the data: label: the integers from 0 - 9; features: pixel001-pixel784, which are rolled out from 28x28 digit image; pixel data is ranged from 0 -255, which indicating the brightness of the pixel in grey scale; Visualize the digit: Let’s randomly look at 100 dig...

Tree based models in R on Titanic Data

5 minute read

This is the first time I blog my journey of learning data science, which starts from the first kaggle competition I attempted - the Titanic. In this competition, we are asked to predict the survival of passengers onboard, with some information given, such as age, gender, ticket fare… Translated letter reveals first hand account of the “unforgettable scenes where horror mixed with sublime heroism” as the Titanic sank Photo: Getty Images How bad is this tragedy? Let’s take some exploratory d...

Back to top ↑

notes

Write Your Own R Packages

2 minute read

This post is to write my own util package to wrap all my udfs with a neat documentation.

Review on Stanford Machine Learning Course

5 minute read

I have been signed up for this course for a long time. And since last week, I finally managed to complete it with a good score. This course is taught by Andrew Ng, who is also the co-founder of Cousera. The content of the course spans from supervised learning to unsupervised learning, as well as some advice for the ML model improvement, special topics on pipeline setup and implementation on large scale data. I personally feel this course is very beneficial for the beginners like me. The f...

Back to top ↑

web-scraping

Web Scraping of JavaScript website

2 minute read

In this post, I’m using selenium to demonstrate how to web scrape a JavaScript enabled page. If you had some experience of using python for web scraping, you probably already heard of beautifulsoup and urllib. By using the following code, we will be able to see the HTML and then use HTML tags to extract the desired elements. However, if the web page embedded with JavaScript, you will notice that some of the HTML elements can’t be seen from beautiful soup, because they are render by the JavaS...

Job Hunting Like A Data Analyst (Part I)

6 minute read

Motivation: I’m currently suffering a tough time in looking for a data analyst job. Instead of doing it in a traditional way, I am thinking why not do the job hunting just like a data analyst, by making use of the advantages of data science. Always, the first step to flight a battle is to know your enemy. As I’m looking for a job in Singapore/China, the first thing I would like to explore is the job market in these areas. I’m interested to know about: Who is hiring data analyst Which ...

Back to top ↑

exploratory analysis

Job Hunting Like A Data Analyst (Part II)

4 minute read

Continued with previous post, I’ve added some additional lines of codes to fetch the job description of each job post. This will take a bit longer time, which is about (1.5 hour) for me, because I set a delay of ~10 seconds between each request. This week I will continue with overview picture of the job market of Data Analyst and develop a simple recommender based on skill and experience requirement. 0. Tools python 2.7 python package: pandas python package: re 1. Job Market Overv...

Back to top ↑

NLP

Back to top ↑