# Basic

• https://en.wikipedia.org/wiki/Lasso_(statistics). It has a discussion when two covariates are highly correlated. For example if gene $i$ and gene $j$ are identical, then the values of $\beta _{j}$ and $\beta _{k}$ that minimize the lasso objective function are not uniquely determined. Elastic Net has been designed to address this shortcoming.
• Strongly correlated covariates have similar regression coefficients, is referred to as the grouping effect. From the wikipedia page "one would like to find all the associated covariates, rather than selecting only one from each set of strongly correlated covariates, as lasso often does. In addition, selecting only a single covariate from each group will typically result in increased prediction error, since the model is less robust (which is why ridge regression often outperforms lasso)".
• Glmnet Vignette. It tries to minimize $RSS(\beta )+\lambda [(1-\alpha )\|\beta \|_{2}^{2}/2+\alpha \|\beta \|_{1}]$ . The elastic-net penalty is controlled by $\alpha$ , and bridge the gap between lasso ($\alpha =1$ ) and ridge ($\alpha =0$ ). Following is a CV curve (adaptive lasso) using the example from glmnet(). Two vertical lines are indicated: left one is lambda.min (that gives minimum mean cross-validated error) and right one is lambda.1se (the most regularized model such that error is within one standard error of the minimum). For the tuning parameter $\lambda$ ,
• The larger the $\lambda$ , more coefficients are becoming zeros (think about coefficient path plots) and thus the simpler (more regularized) the model.
• If $\lambda$ becomes zero, it reduces to the regular regression and if $\lambda$ becomes infinity, the coefficients become zeros.
• In terms of the bias-variance tradeoff, the larger the $\lambda$ , the higher the bias and the lower the variance of the coefficient estimators.
set.seed(1010)
n=1000;p=100
nzc=trunc(p/10)
x=matrix(rnorm(n*p),n,p)
beta=rnorm(nzc)
fx= x[,seq(nzc)] %*% beta
eps=rnorm(n)*5
y=drop(fx+eps)
px=exp(fx)
px=px/(1+px)
ly=rbinom(n=length(px),prob=px,size=1)

## Full lasso
set.seed(999)
cv.full <- cv.glmnet(x, ly, family='binomial', alpha=1, parallel=TRUE)
plot(cv.full)  # cross-validation curve and two lambda's
plot(glmnet(x, ly, family='binomial', alpha=1), xvar="lambda", label=TRUE) # coefficient path plot
plot(glmnet(x, ly, family='binomial', alpha=1))  # L1 norm plot
log(cv.full$lambda.min) # -4.546394 log(cv.full$lambda.1se) # -3.61605
sum(coef(cv.full, s=cv.full$lambda.min) != 0) # 44 ## Ridge Regression to create the Adaptive Weights Vector set.seed(999) cv.ridge <- cv.glmnet(x, ly, family='binomial', alpha=0, parallel=TRUE) wt <- 1/abs(matrix(coef(cv.ridge, s=cv.ridge$lambda.min)
[, 1][2:(ncol(x)+1)] ))^1 ## Using gamma = 1, exclude intercept
## Adaptive Lasso using the 'penalty.factor' argument
set.seed(999)
cv.lasso <- cv.glmnet(x, ly, family='binomial', alpha=1, parallel=TRUE, penalty.factor=wt)
# defautl type.measure="deviance" for logistic regression
plot(cv.lasso)
log(cv.lasso$lambda.min) # -2.995375 log(cv.lasso$lambda.1se) # -0.7625655
sum(coef(cv.lasso, s=cv.lasso$lambda.min) != 0) # 34  ## Lambda • A list of potential lambdas: see Linear Regression case. The λ sequence is determined by lambda.max and lambda.min.ratio. The latter (default is ifelse(nobs<nvars,0.01,0.0001)) is the ratio of smallest value of the generated λ sequence (say lambda.min) to lambda.max. The program then generated nlambda values linear on the log scale from lambda.max down to lambda.min. lambda.max is not given, but easily computed from the input x and y; it is the smallest value for lambda such that all the coefficients are zero. • Choosing hyper-parameters (α and λ) in penalized regression by Florian Privé • lambda.min vs lambda.1se • The lambda.1se represents the value of λ in the search that was simpler than the best model (lambda.min), but which has error within 1 standard error of the best model. In other words, using the value of lambda.1se as the selected value for λ results in a model that is slightly simpler than the best model but which cannot be distinguished from the best model in terms of error given the uncertainty in the k-fold CV estimate of the error of the best model. • The lambda.min option refers to value of λ at the lowest CV error. The error at this value of λ is the average of the errors over the k folds and hence this estimate of the error is uncertain. • https://www.rdocumentation.org/packages/glmnet/versions/2.0-10/topics/glmnet • glmnetUtils: quality of life enhancements for elastic net regression with glmnet • Mixing parameter: alpha=1 is the lasso penalty, and alpha=0 the ridge penalty and anything between 0–1 is Elastic net. • RIdge regression uses Euclidean distance/L2-norm as the penalty. It won't remove any variables. • Lasso uses L1-norm as the penalty. Some of the coefficients may be shrunk exactly to zero. • In ridge regression and lasso, what is lambda? • Lambda is a penalty coefficient. Large lambda will shrink the coefficients. • cv.glment()$lambda.1se gives the most regularized model such that error is within one standard error of the minimum
• A deep dive into glmnet: penalty.factor, standardize, offset
• Lambda sequence is not affected by the "penalty.factor"
• How "penalty.factor" used by the objective function may need to be corrected

## Underfittig, overfitting and relaxed lasso

• Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties, Fan & Li (2001) JASA
• Adaptive Lasso: What it is and how to implement in R. Adaptive lasso weeks to minimize $RSS(\beta )+\lambda \sum _{1}^{p}{\hat {\omega }}_{j}|\beta _{j}|$ where $\lambda$ is the tuning parameter, ${\hat {\omega }}_{j}={\frac {1}{(|{\hat {\beta }}_{j}^{ini}|)^{\gamma }}}$ is the adaptive weights vector and ${\hat {\beta }}_{j}^{ini}$ is an initial estimate of the coefficients obtained through ridge regression. Adaptive Lasso ends up penalizing more those coefficients with lower initial estimates. $\gamma$ is a positive constant for adjustment of the adaptive weight vector, and the authors suggest the possible values of 0.5, 1 and 2.
• When n goes to infinity, ${\hat {\omega }}_{j}|\beta _{j}|$ converges to $I(\beta _{j}\neq 0)$ . So the adaptive Lasso procedure can be regarded as an automatic implementation of best-subset selection in some asymptotic sense.
• What is the oracle property of an estimator? An oracle estimator must be consistent in 1) variable selection and 2) consistent parameter estimation.
• (Linear regression) The adaptive lasso and its oracle properties Zou (2006, JASA)
• (Cox model) Adaptive-LASSO for Cox's proportional hazard model by Zhang and Lu (2007, Biometrika)
• When the LASSO fails???. Adaptive lasso is used to demonstrate its usefulness.