Juan Jose Carin

Data Scientist

Welcome

I'm Juan Jose (or just Juanjo). Here I'd like to share a little about me and some of the projects I've worked on to earn my last two master's degrees: an academic one in Statistical & Computational Information Processing from Universidad Politécnica de Madrid, and a professional one in Information and Data Science from University of California, Berkeley.

CONTACT INFO

Bio

I am originally from Madrid, Spain (I moved to the SF Bay Area in February 2015). While I was earning my master's in Telecommunication Engineering, I worked at Ericsson and later as a tutor of differential and integral calculus, physics, and digital electronic circuits (I love teaching!). After that, I worked 9 years in the Test & Measurement industry: first as a sales engineer at a distributor of T&M solutions (for all kinds of telecom networks: mobile, Ethernet, CaTV, WiFi, ...), and later at Yokogawa (whose oscilloscopes, DAQ systems, and power analyzers are aimed at industries such as energy, transportation, and power electronics; hence, I had to move from bits to volts and from dBm to megawatts), where I soon became sales director of the T&M division for Spain and Portugal. My work mainly involved helping organizations to get data (from telecom networks first, and then from wind farms, solar parks, nuclear plants, transformers, electric motors, and so on) to check that everything worked properly and conforming to standards, and more importantly, to find any problem and its cause. Above all, it meant caring about customers' needs.

I really liked my job, but I also wanted to be in an even more technical position. I've loved math problems^[1] and puzzles since I was a kid, and I didn't even have to look for them: if I was at Starbucks and saw 14 people using a laptop, 9 (64.3%) of them being a Mac, I found myself wondering if it could be said that the number of Macs was significantly higher than the number of PCs^[2]. And I was always trying to get valuable insights from the reduced amount of sales data we had (it was a small market). I eventually realized that I wanted to dig down deeper in the data (lots of it and of all kinds!) and be able to extract unobserved patterns, infer causality through experiments, and make reliable predictions and well-informed decisions; I wanted to be a data scientist. That's why I went back to my alma mater to earn another Master's degree, this time in Statistics, and right after that another one in Data Science from UC Berkeley.

Like the "Seattle rain" problem: click here to read my solution.

It turns out that it only is at the 20% significance level. For such a small sample, 72.0% (11) of the 14 laptops would have to be Macs for their proportion to be significantly higher than 0.5 at the usual 5% level.
In any case, that sample, apart from being very small, is not representative of the whole population! With a random sample of a thousand laptops, 52.6% of Macs would be enough to say that they are significantly more prevalent than PCs at the 5% level; with a sample of a million laptops, fewer than 50.1% of Macs would be enough. If you are curious about the minimum number of positive cases—as I was—, this is the expression I obtained (click here to read how):
$$\begin{aligned} N_1 &\geq \left \lceil \frac{N}{2} \cdot \left(1 + \frac{t_{.95,N-1}}{\sqrt{N+t_{.95,N-1}^2-1}}\right) \right \rceil \\ &\simeq \left \lceil \frac{N}{2} + 0.822 \cdot \sqrt{N} \right \rceil\end{aligned}$$

What happened since I created the first version of this website, after graduating from Berkeley? In December 2016 I joined Yahoo, now Oath / Verizon Media, where I have had the opportunity to work with very talented people in many exciting projects, mainly for Web Search and also, lately, for Mail and Verizon (returning, after several years, to the telecom industry).

Portfolio

Master of Information and Data Science
University of California, Berkeley – School of Information

SmartCam

Website – Backend Code – Frontend Code

Course: Capstone
Computer Vision, Scalable, Classification – Python, OpenCV, TensorFlow, AWS (ECS, S3, DynamoDB), Flask, JavaScript, Google Charts

A scalable cloud-based video monitoring system (with motion detection, face counting, and image recognition) using Raspberry Pis. Most of the processing is done in the cloud, so the computing power of the Raspberry Pi is not a limitation.

Shortest path and PageRank algorithms applied to the Wikipedia graph dataset

Shortest Path – PageRank

Course: Machine Learning at Scale
Shortest path, PageRank – MrJob, Python, AWS EC2, AWS S3

Two of the assignments of the course, using a graph dataset of almost half a million nodes.

Forest Cover Type Prediction

IPython Notebook – GitHub

Course: Machine Learning
kNN, NB, Random Forests, SVMs, SGD, GMMs, Feature engineering – Python, Scikit-Learn, Matplotlib

A Kaggle competition: predictions of the predominant kind of tree cover in four wilderness areas located in the Roosevelt National Forest of northern Colorado, from strictly cartographic variables such as elevation and soil type.

Redefining the job search process

GitHub – Paper – Website

Course: Storing and Retrieving Data
Information retrieval – Hadoop, Hive, Spark, Python, AWS EC2, Tableau

A pipeline that combines data from Indeed API and the U.S. Census Bureau to select the best locations for data scientists based on the number of job postings, housing cost, etc.

A fresh perspective on Citi Bike

Website – GitHub

Course: Data Visualization and Communication
Data visualization – Tableau, SQLite

An interactive website to visualize NYC Citi Bike bicycle sharing service.

Investigating the Effect of Competition on the Ability to Solve Arithmetic Problems

Paper – GitHub

Course: Field Experiments
Causal inference (Experiments, Heterogeneous treatment effects, Attrition) – R

A randomized controlled trial in which 300+ participants were assigned to a control group or one of two test groups to evaluate the effect of competition (being compared to no one or someone better or worse).

M.S. in Statistical & Computational Information Processing
Universidad Politécnica de Madrid – School of Telecom. Eng.

Experimental design to measure the impact of online advertising on the sales of a car manufacturer’s dealer network

Paper – Slides – GitHub

Course: Capstone
Research design, Experiments, Wavelets, Clustering – R

A project for Google (working at Conento) to measure the effect of YouTube ads. Responsible for the whole project (excluding the econometric model to estimate the increase in advertising spend in the test group): designed a matched-pair, cluster-randomized experiment, which involved selecting the test and control groups from a sample of 50+ cities in Spain based on their sales-wise similarity over time.

Prediction of customer churn for a mobile network carrier

Paper

Course: Data Mining
Neural networks, Tree decisions, Logistic regression – SAS Enterprise Miner

Predictions from a sample of 45,000+ customers.

Different models of Harmonized Index of Consumer Prices (HICP) in Spain

Paper

Course: Time Series
ARIMA, Transfer functions – SPSS, Demetra+

Forecasts based on exponential smoothing, ARIMA, and transfer function (using petrol price as independent variable) models.

Other Projects

Master of Information and Data Science
University of California, Berkeley – School of Information

Machine Learning at Scale

Parallel computing for unsupervised & supervised learning of Big Data using Hadoop, MrJob, and Spark (all in Python).
k-means clustering, gradient descent, shortest path, SVM, PageRank...
13 assignments (including the 2 ones listed in the Portfolio).

Machine Learning

Nearest neighbors, naive Bayes, decision trees, logistic regression, gradient descent, neural networks, (k-means & hierarchical) clustering, GMMs, dimensionality reduction, network analysis, ..., using Python (mainly Scikit-Learn).
3 assignments (and the final project listed in the Portfolio).

Applied Regression and Time Series Analysis

Classical Linear Regression (assumptions, causality, instrumental variables, ...) and Time Series models (ARIMA and GARCH).
8 weekly assignments and 3 Labs, using R.

Field Experiments (a.k.a. Experiments and Causal Inference)

Blocking, clustering, covariates, heterogeneous treatment effects, one-sided non-compliance, natural experiments, difference in differences, regression discontinuity, attrition, mediation, ...
5 problem sets (and the final project listed in the Portfolio), using R

Exploring and Analyzing Data

Introduction to different types of quantitative research methods and statistical techniques for analyzing data (using R).
3 lab assignments and a final exam.

Research Design and Applications for Data Analysis

An introduction to the data sciences landscape, with a particular focus on learning how to apply data science techniques to uncover, enrich, and answer the questions typically found in industry.
Final project: "Field Experiments in Online Advertising."

Paper
Slides

Referrals

From my LinkedIn profile

Data Scientist – Conento

«Juanjo's main strengths were his ability to solve complex problems by developing innovative ideas, combined with his smart focus on time and resources to execute them. He worked hard to make sure results were achieved, adapting very well to our work environment and building an excellent relationship with his mates. He is intelligent, positive, and rigorous, and a very generous person that is just a pleasure to work with.»

Macarena Estevez, CEO.

Master of Information and Data Science – UC Berkeley

«Juanjo was an excellent student in my course on experiments and causal inference. He asked terrific questions, turned in great problem sets, and did a very conscientious job on a team project experimenting with how perceptions of competition affect productivity in math problems.»

David Reiley, professor.

«Juanjo was my student at UC Berkeley in a master's-level class on the design and analysis of randomized experiments for corporate research. He was without a doubt one of the stand-out students. I would highly recommend his skills. His performance was excellent.»

David Broockman, professor.

«I was impressed by the presentation Juanjo made in my class on research design and applications for data analysis. Juanjo demonstrated great storytelling skills, giving a convincing demonstration of the need for his proposal. He can articulate a good research question – simple, direct, specific, and yet still challenging – and can handle tough questions from an audience. He has a strong grasp of research design, with really great description of why he recommended the adoption of an experimental design and good discussion of the method itself.»

Peter Norlander, professor.

«Juanjo is that wonderful combination of capable, smart, hard-working guy who also happens to be highly personable. I'm happy to recommend him.»

Annette Greiner, professor.

M.S. in Statistical and Computational Information Processing – Univ. Politecnica de Madrid

«Juanjo was an excellent student in my Master's course "Neural Networks and Statistical Learning" at the Technical University of Madrid. He showed a very high motivation and capability not only to grasp fundamental aspects but also to address real problems. In addition, Juanjo is a very open a communicative person which makes him easy to share ideas and organize tasks.»

Pedro J. Zufiria, professor.

Sales Director, T&M – Yokogawa

«I had the opportunity to work with Juan Jose and it was really easy to choose and run the installation. Not a lot of people understand the difficulties to choose an appropriated equipment to do a specialized job. Juan Jose does! Congratulations!»

Ramon Santos Yus, client.

«Juan Jose is easy to work with, has a strategic mind, a genuine interest in people and other cultures. He easily understands complex matters and is a real team player.»

Johan Waldelius, colleague.

Seattle Rain

You’re about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it’s raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you that “Yes” it is raining. What is the probability that it’s actually raining in Seattle?

Facebook interview question posted on Glassdoor.

Most answers posted on the link above are:

$\left(\frac{2}{3}\right)^3 = \frac{8}{27}$

which is the probability that all friends say it is raining provided that it’s true… but we are interested in the opposite, the probability that it is raining provided that all friends say that.

Another common answer is based on a correct idea:

$\begin{aligned} \Pr\left(H \mid E\right) & = \frac{\Pr\left(H, E\right)}{\Pr\left(E\right)} \\ & = \frac{\Pr\left(H, E\right)}{\displaystyle\sum_{h \in \mathbb{H}} \Pr\left(h, E\right)} \end{aligned}$

where

$\begin{aligned} H & : \text{it is raining in Seattle} \\ E & : \text{all friends say it is raining} \end{aligned}$

$h$ is a binary event, so:

$\Pr\left(H \mid E\right) = \frac{\Pr\left(H, E\right)}{\Pr\left(H, E\right) + \Pr\left(\bar{H}, E\right)}$

But those answers then mix joint and conditional probabilities, getting yet a wrong result (which is correct only if the probability of rain in Seattle is $0.5$ ):

$\frac{8/27}{8/27 + 1/27} = \frac{8}{9}$

$\left(2/3\right)^3 = 8/27$ is not the joint probability that all friends say the same thing (it is raining) and that is true, but the probability that all friends say that it is raining conditional on the fact that it is raining.

Let’s consider the two extreme cases to confirm that the solution above is not always correct. If it is always raining in Seattle,

$\begin{aligned}H = \Omega & \Rightarrow \Pr(H) = 1 \\ & \Rightarrow \left\{\begin{matrix} \Pr(H, E) & = & \Pr(E) & = & 8/27 \\ \Pr\left(\bar{H}, E \right) & = & 0 \end{matrix}\right. \\ & \Rightarrow \Pr(H \mid E) = 1 \end{aligned}$

If it never rains in Seattle,

$\begin{aligned}H = \varnothing & \Rightarrow \Pr(H) = 0 \\ & \Rightarrow \left\{\begin{matrix} \Pr(H, E) & = & 0 \\ \Pr\left(\bar{H}, E \right) & = & \Pr(E) & = & 1/27 \end{matrix}\right. \\ & \Rightarrow \Pr(H \mid E) = 0 \end{aligned}$

The tricky part of the problem, since we may be biased to think that the solution is fixed, is that there is an unknown parameter ( $p$ , the probability of rain in Seattle), which the solution depends on. We need to apply the Bayes’ theorem, which states that the posterior probability (the probability of an hypothesis $H$ conditional on a given body of data (the evidence, $E$ ) can be calculated as:

$\begin{aligned} \Pr\left(H \mid E\right) & = \frac{\Pr\left(E \mid H\right) \cdot \Pr\left(H\right)}{\Pr\left(E\right)} \\ & = \frac{\Pr\left(E \mid H\right) \cdot \Pr\left(H\right)}{\displaystyle\sum_{h \in \mathbb{H}} \Pr\left(E \mid h\right) \cdot \Pr\left(h\right)} \\ & = \frac{\Pr\left(E \mid H\right) \cdot \Pr\left(H\right)}{\Pr\left(E \mid H\right) \cdot \Pr\left(H\right) + \Pr\left(E \mid \bar{H}\right) \cdot \Pr\left(\bar{H}\right)} \end{aligned}$

Let $q$ the probability that one friend tells the truth. Then, if there are $N$ friends:

$\Pr\left(H \mid E\right) = \frac{q^N \cdot p}{q^N \cdot p + (1 - q)^N \cdot (1 - p)}$

We are told that $3$ is the number of friends and that $q = 2/3$ . Hence:

$\begin{aligned} \Pr\left(H \mid E\right) & = \frac{\left(\frac{2}{3}\right)^3 \cdot p}{\left(\frac{2}{3}\right)^3 \cdot p + \left(\frac{1}{3}\right)^3 \cdot (1 - p)} \\ & = \frac{8p}{8p + (1-p)} \\ & = \mathbf{\frac{8p}{7p + 1}} \end{aligned}$

Let’s see what the posterior probability would be for a few possible values of the prior probability:

$prior = \begin{Bmatrix} 0 \\ 1/4 \\ 1/2 \\ 3/4 \\ 1 \end{Bmatrix} \Rightarrow posterior = \left\{\begin{matrix} 0 \\ 8/11 & \approx & 0.73 \\ 8/9 & \approx & 0.89 \\ 24/25 & = & 0.96 \\ 1 \end{matrix}\right.$

Also, note that the posterior reaches a value of $1/2$ for a prior as low as $1/9 \approx 0.11$ .

Let’s confirm that the solution above is right, simulating the problem in R (for all possibles values of $p$ from $0$ to $1$ , in steps of $0.05$ ):

library(dplyr)
set.seed(12345)
# Simulate N cases for each probability of rain in Seattle (p)
# in the whole [0, 1] range, in steps of 0.05
N <- 5e3 # simulations (per value of p)
q <- 2/3 # prob of a friend telling the truth
data <- data.frame(p = rep(seq(from = 0, to = 1, by = 0.05), each = N))
data <- data %>% rowwise() %>% mutate(r = as.logical(rbinom(1, 1, p))) %>% 
  mutate(A = as.logical(ifelse(r == 1, rbinom(1, 1, q), rbinom(1, 1, 1 - q))), 
         B = as.logical(ifelse(r == 1, rbinom(1, 1, q), rbinom(1, 1, 1 - q))), 
         C = as.logical(ifelse(r == 1, rbinom(1, 1, q), 
                               rbinom(1, 1, 1 - q)))) %>% ungroup()
# p: probability that it is raining in Seattle
# r: it is raining in Seattle
# A, B, C: friend A, B, C says it is raining
  # (not to be confused with he or she is telling the truth)
data %>% sample_n(10) %>% print(n = 10) # show 10 of the 21*N rows

Source: local data frame [10 x 5]

       p     r     A     B     C
   <dbl> <lgl> <lgl> <lgl> <lgl>
1   0.70  TRUE  TRUE  TRUE  TRUE
2   0.65  TRUE  TRUE FALSE  TRUE
3   0.85  TRUE  TRUE  TRUE  TRUE
4   0.90  TRUE FALSE  TRUE FALSE
5   0.80  TRUE FALSE  TRUE  TRUE
6   0.75  TRUE  TRUE  TRUE  TRUE
7   0.95  TRUE  TRUE  TRUE FALSE
8   0.70  TRUE FALSE  TRUE  TRUE
9   0.55  TRUE  TRUE FALSE  TRUE
10  0.60 FALSE  TRUE FALSE  TRUE

# Check proportions of r (compared to p)
data %>% group_by(p) %>% summarize(Real_Prob_Rain = mean(r)) %>% 
  rename(Prob_Rain = p) %>% print(n = Inf)

Source: local data frame [21 x 2]

   Prob_Rain Real_Prob_Rain
       <dbl>          <dbl>
1       0.00         0.0000
2       0.05         0.0460
3       0.10         0.0978
4       0.15         0.1494
5       0.20         0.1938
6       0.25         0.2466
7       0.30         0.2956
8       0.35         0.3390
9       0.40         0.3934
10      0.45         0.4524
11      0.50         0.4994
12      0.55         0.5430
13      0.60         0.5900
14      0.65         0.6586
15      0.70         0.6908
16      0.75         0.7406
17      0.80         0.8004
18      0.85         0.8504
19      0.90         0.8942
20      0.95         0.9492
21      1.00         1.0000

# Check proportions of friends telling the truth (compared to q)
data %>% rename(Raining = r) %>% group_by(Raining) %>% 
  summarize("%Cases_All_Friends_Say_Raining" = mean(A + B + C) / 3) %>% 
  print(n = Inf)
data %>% rename(Prob_Rain = p, Raining = r) %>% 
  group_by(Prob_Rain, Raining) %>% 
  summarize("#Cases" = n(), 
            "%Cases_All_Friends_Say_Raining" = mean(A + B + C) / 3) %>% 
  print(n = 11)

Source: local data frame [2 x 2]

  Raining %Cases_All_Friends_Say_Raining
    <lgl>                          <dbl>
1   FALSE                      0.3322547
2    TRUE                      0.6649218

Source: local data frame [40 x 4]
Groups: Prob_Rain [?]

   Prob_Rain Raining #Cases %Cases_All_Friends_Say_Raining
       <dbl>   <lgl>  <int>                          <dbl>
1       0.00   FALSE   5000                      0.3332667
2       0.05   FALSE   4770                      0.3243885
3       0.05    TRUE    230                      0.6826087
4       0.10   FALSE   4511                      0.3302298
5       0.10    TRUE    489                      0.6659850
6       0.15   FALSE   4253                      0.3359981
7       0.15    TRUE    747                      0.6604195
8       0.20   FALSE   4031                      0.3304391
9       0.20    TRUE    969                      0.6584107
10      0.25   FALSE   3767                      0.3342182
11      0.25    TRUE   1233                      0.6580157
..       ...     ...    ...                            ...

# Filter the evidence: all 3 say it rains
# And group by each value of the prior Prob(it rains), p
data_interest <- data %>% filter(A * B * C == TRUE) %>% group_by(p)
# For each value of the prior (p), what is the posterior?
# (i.e., the mean number of cases where it rain, r == 1)
data_interest %>% summarize(Posterior = round(mean(r), 3)) %>% 
  mutate(Posterior_Theoretical = round(8*p / (1 + 7*p), 3)) %>% 
  rename(Prior = p) %>% print(n = Inf)

Source: local data frame [21 x 3]
    
       Prior Posterior Posterior_Theoretical
       <dbl>     <dbl>                 <dbl>
    1   0.00     0.000                 0.000
    2   0.05     0.279                 0.296
    3   0.10     0.498                 0.471
    4   0.15     0.587                 0.585
    5   0.20     0.643                 0.667
    6   0.25     0.705                 0.727
    7   0.30     0.778                 0.774
    8   0.35     0.814                 0.812
    9   0.40     0.845                 0.842
    10  0.45     0.850                 0.867
    11  0.50     0.900                 0.889
    12  0.55     0.903                 0.907
    13  0.60     0.928                 0.923
    14  0.65     0.949                 0.937
    15  0.70     0.933                 0.949
    16  0.75     0.952                 0.960
    17  0.80     0.974                 0.970
    18  0.85     0.977                 0.978
    19  0.90     0.986                 0.986
    20  0.95     0.995                 0.993
    21  1.00     1.000                 1.000

Now let’s consider a more generic case, where not only $p$ , but also the number of friends, $N$ , and the probability that any of them tells you the truth, $q$ , are not fixed. This analysis will give us the morality of this problem, which could be: “Always trust your friends (especially the more you have!), provided that they tell the truth more often than not.”

Without loss of generality, let’s assume that $q$ is a positive rational number, and hence can be written as $a/b$ , where $a,b \in \mathbb{N}; b \geq a$ . We want to focus on the case where our friends are more likely to tell the truth, so the following condition must hold:

$q \geq \frac{1}{2} \Rightarrow \frac{b}{a} \leq 2 \Rightarrow \frac{b}{a} - 1 \leq 1$

We can write the posterior probability that we want to calculate as:

$\begin{aligned}\Pr\left(H \mid E\right) & = \frac{\left(\frac{a}{b}\right)^N \cdot p}{\left(\frac{a}{b}\right)^N \cdot p + \left(1 - \frac{a}{b}\right)^N \cdot (1 - p)} \\ & = \frac{a^N \cdot p}{a^N \cdot p + (b - a)^N \cdot (1-p)} \\ & = \frac{1}{1 + \left(\frac{b}{a} - 1\right)^N \cdot \frac{1-p}{p}} \\ \end{aligned}$

Since $b/a - 1 \leq 1$ , the limit of the posterior probability as $N$ approaches infinity is $1$ , regardless of the value of $p$ (for $p > 0$ ).

I.e., if a sufficiently large number of friends tell you that it is raining in the desert, and the chances that each one of them is messing with you are less than $1/2$ , bring an umbrella with you. For $N = 10$ (and $q = 2/3$ ), the posterior is greater than $0.9$ for a prior as low as $0.009$ .

Let’s finish by plotting the posterior against the prior for a few possible values of $q$ and $N$ (our case of interest, $N=3, q=2/3$ corresponds to the light blue line in the upper-right graph). As expected, if $N = 0$ (i.e., we have no evidence), the posterior equals the prior.

posterior_prob <- function(q, N, p) {
  q^N * p / (q^N * p + (1 - q)^N * (1-p))
}
p <- seq(0, 1,  0.01)
library(MASS)
df <- data.frame(Prior = rep(p, each = 24), 
                 N = rep(c(0:3, 5, 10), each = 4), 
                 q = c(0.51, 2/3, 3/4, 4/5)) %>% 
  mutate(Posterior = posterior_prob(q, N, Prior), 
         Friends = as.factor(N), 
         Q = factor(as.character(fractions(q)), 
                    levels = as.character(fractions(unique(q)))))
library(ggplot2)
ggplot(data = df, aes(x = Prior, y = Posterior, colour = Friends)) + 
  geom_line() + 
  scale_color_hue(c = 240) + 
  labs(title = paste('Posterior vs. Prior for 4 possible\nvalues of', 
                     'the (individual) Likelihood')) + 
  facet_wrap( ~ Q, nrow = 2) + coord_fixed()
options(repr.plot.width = 8, repr.plot.height = 8)

svg

Welcome

CONTACT INFO

Bio

Portfolio

Master of Information and Data ScienceUniversity of California, Berkeley – School of Information

Shortest path and PageRank algorithms applied to the Wikipedia graph dataset

M.S. in Statistical & Computational Information ProcessingUniversidad Politécnica de Madrid – School of Telecom. Eng.

Other Projects

Master of Information and Data ScienceUniversity of California, Berkeley – School of Information

Field Experiments (a.k.a. Experiments and Causal Inference)

Research Design and Applications for Data Analysis

Referrals

From my LinkedIn profile

Data Scientist – Conento

Master of Information and Data Science – UC Berkeley

M.S. in Statistical and Computational Information Processing – Univ. Politecnica de Madrid

Sales Director, T&M – Yokogawa

Master of Information and Data Science
University of California, Berkeley – School of Information

M.S. in Statistical & Computational Information Processing
Universidad Politécnica de Madrid – School of Telecom. Eng.

Master of Information and Data Science
University of California, Berkeley – School of Information