Review a few important neural network architectures, including VGG, Resnet, GoogleNet(Inception), MobileNet.
Since 2012 AlexNet was published, many architectures have been developed to significantly improve the accuracy, increase the depth of neural networks, and reduce the model size as well as calculation operations. Here I study and review a few important developments.
Let’s first have a big picture of these neural architectures regarding the accuracy, size, operations, inference time and power usage. This is a paper from 2016 so it doesn’t include MobileNet and other latest developments.
A brief review of boosting, gradient boosting, gradient boosting decision tree (GBDT) and XGboost
Boosting is a statistical ensemble method, in contrast to Bagging (Bootstrapping aggregation). Bagging trains each base classifier independently and averages the prediction. Boosting trains each base classifier sequentially and uses “residuals” from the previous classifier to train the next classfier. The generic framework of Boosting consists of addictive models and forward stepwise learning.
Although Step 2.a just mathematically represents the goal in this step with a single equation, it is the essential step to actually train a new classifier. Step 2.a depends on the loss function…
It is an iterative process to try out some decisions for unit of diversion and population, see what the implication is on both the size and the duration of the experiment. Depending on the results, we will need to revisit the decisions and iterate.
Unit of diversion basically answers the question that “how to assign events to either the control or to the experiment”. Even though the metric is computed based on the events (e.g. page view), the unit of diversion decides how these page…
Use A/A tests to
For example, 20 A/A experiments, 50 users per group in each experiment and one click-through-probability computed based on one experiment from 50 + 50 users. The following table shows 20 experiments (20 rows). Take the first row for example. Based on the clicks and pageviews of 50 users in Group 1 and 2, the CTP is 0.1 and 0.04. The difference is…
A mixture from multiple textbooks and online resources
A typical way of solving classification is to find a hyperplane in the feature space. The algorithms that use this approach include SVM and logistic regression (the hyperplane of logistic regression is the one getting through y=0.5. How does logistic regression find that hyperplane? By fitting the data points with logistic regression function.).
Given a point x0 and a line wT*x + b = 0, the functional margin between the point and the line is
functional margin = wT*x0 + b
geometric margin = (wT*x0 + b) / ||w||
A/B testing consists of choosing a metric, reviewing statistics, designing experiments, and analyzing results. A/B testing is a general control/experiment methodology used online to test out a new product or a feature. For example, two groups of users act on two versions of websites, their activities will be recorded, some metrics will be computed based on the activities, and the metrics will be used to evaluate the two versions. A variety of things can be tested, from some new features, additions to your UI, different look for you website. Examples:
These are some notes for reviewing the statistics knowledge while I was studying the lesson 1 of Udacity A/B testing. Specifically, it is for binomial distribution converging to normal distribution when n is large. Here is a more basic note for understanding the intuition of CLT and confidence interval I wrote previously, mostly assuming a normal distribution.
In Udacity A/B testing session 1, the instructors reviewed how to compute confidence interval of the estimated probability p of binomial distribution. When n is very large, binomial distribution tends to converge to normal distribution. Thus, the same formula to estimate the mean…
Vertex (V), Edge (E)
Undirected and directed graph: for undirected graph, there is a handshaking lemma, sum(degree(v)) = 2|E|
Adjacency list: O(|V|+|E|) * w where w is the word size. The advantage is that 1) for sparse adjacency matrix; 2) multiple graphs can use the same nodes
Adjacency matrix: O(V²) * 1bit, good for dense matrix
OOP: one graph use one set of nodes, good for clean code
Breadth-first search (BFS)
Graph representation: adjacency list
Goal: traversal the connected component of one graph from one starting node level by level
Application: find the shortest path from a starting…
The null hypothesis: N0
The alternative hypothesis: Na
Normal distribution and Z statistic vs. t distribution and t statistic
For one mean inference, suppose sampling from a normal distribution.
Create an AWS instance.
Save the .pem file, cd to the folder, and do ‘ssh -i xxx.pem email@example.com’. The ssh information can be found when click “connect” on the instance.
sudo dpkg — configure -a
sudo apt install docker.io
sudo usermod -a -G docker $USER (https://techoverflow.net/2017/03/01/solving-docker-permission-denied-while-trying-to-connect-to-the-docker-daemon-socket/)
find $USER using whoami
log out and log in again
test using “docker run hello-world”
After set up Docker, Anaconda and Clipper on AWS. Run the Clipper deployment. Because the port for the Clipper application is 1337, create a Security Group with a Custom TCP Rule for port 1337.
Follow Getting Started Guide to…