Juan Jose Carin
Data Scientist
Welcome

Welcome

I'm Juan Jose (or just Juanjo). Here I'd like to share a little about me and some of the projects I've worked on to earn my last two master's degrees: an academic one in Statistical & Computational Information Processing from Universidad Politécnica de Madrid, and a professional one in Information and Data Science from University of California, Berkeley.

CONTACT INFO

Bio

I am originally from Madrid, Spain, but I moved to the SF Bay Area in February 2015. While I was earning my master's in Telecommunication Engineering, I worked at Ericsson and later as a tutor of differential and integral calculus, physics, and digital electronic circuits (I love teaching!). After that, I worked 9 years in the Test & Measurement industry: first as a sales engineer at a distributor of T&M solutions (for all kinds of telecom networks: mobile, Ethernet, CaTV, WiFi, ...), and later at Yokogawa (whose oscilloscopes, DAQ systems, and power analyzers are aimed at industries such as energy, transportation, and power electronics; hence I had to move from bits to volts and from dBm to megawatts), where I soon became head of T&M sales for Spain and Portugal. My work mainly involved helping organizations to get data (from telecom networks first, and then from wind farms, solar parks, nuclear plants, transformers, electric motors, and so on) to check that everything worked properly and conforming to standards, and more importantly, to find any problem and its cause. Above all, it meant caring about customers' needs.

I really liked my job, but I also wanted to be in an even more technical position. I've loved math problems[1] and puzzles since I was a kid, and I didn't even have to look for them: if I was at the local Starbucks and saw 14 people using a laptop, 9 (64.3%) of them being a Mac, I found myself wondering if it could be said that the number of Macs was significantly higher than the number of PCs[2]. And I was always trying to get valuable insights from the reduced amount of sales data we had (it was a small market). I eventually realized that I wanted to dig down deeper in the data (lots of it and of all kinds!) and be able to extract unobserved patterns, infer causality through experiments, and make reliable predictions and well-informed decisions; I wanted to be a data scientist. That's why I went back to my alma mater to earn another Master's degree, this time in Statistics, and right after that another one in Data Science from UC Berkeley.

  1. Like the "Seattle rain" problem: click here to read my solution.

  2. It turns out that it only is at the 20% significance level. For such a small sample, 72.0% (11) of the 14 laptops would have to be Macs for their proportion to be significantly higher than 0.5 at the usual 5% level.
    In any case, that sample, apart from being very small, is not representative of the whole population! With a random sample of a thousand laptops, 52.6% of Macs would be enough to say that they are significantly more prevalent than PCs at the 5% level; with a sample of a million laptops, fewer than 50.1% of Macs would be enough. If you are curious about the minimum number of positive cases—as I was—, this is the expression I obtained (click here to read how):
    $$\begin{aligned} N_1 &\geq \left \lceil \frac{N}{2} \cdot \left(1 + \frac{t_{.95,N-1}}{\sqrt{N+t_{.95,N-1}^2-1}}\right) \right \rceil \\ &\simeq \left \lceil \frac{N}{2} + 0.822 \cdot \sqrt{N} \right \rceil\end{aligned}$$

Portfolio

Master of Information and Data Science
University of California, Berkeley – School of Information

SmartCam

Website – Backend Code – Frontend Code

Course: Capstone
Computer Vision, Scalable, ClassificationPython, OpenCV, TensorFlow, AWS (ECS, S3, DynamoDB), Flask, JavaScript, Google Charts

A scalable cloud-based video monitoring system (with motion detection, face counting, and image recognition) using Raspberry Pis. Most of the processing is done in the cloud, so the computing power of the Raspberry Pi is not a limitation.

Shortest path and PageRank algorithms applied to the Wikipedia graph dataset

Shortest Path – PageRank

Course: Machine Learning at Scale
Shortest path, PageRankMrJob, Python, AWS EC2, AWS S3

Two of the assignments of the course, using a graph dataset of almost half a million nodes.

Forest Cover Type Prediction

IPython Notebook – GitHub

Course: Machine Learning
kNN, NB, Random Forests, SVMs, SGD, GMMs, Feature engineeringPython, Scikit-Learn, Matplotlib

A Kaggle competition: predictions of the predominant kind of tree cover in four wilderness areas located in the Roosevelt National Forest of northern Colorado, from strictly cartographic variables such as elevation and soil type.

Redefining the job search process

GitHub – Paper – Website

Course: Storing and Retrieving Data
Information retrievalHadoop, Hive, Spark, Python, AWS EC2, Tableau

A pipeline that combines data from Indeed API and the U.S. Census Bureau to select the best locations for data scientists based on the number of job postings, housing cost, etc.

A fresh perspective on Citi Bike

Website – GitHub

Course: Data Visualization and Communication
Data visualizationTableau, SQLite

An interactive website to visualize NYC Citi Bike bicycle sharing service.

Investigating the Effect of Competition on the Ability to Solve Arithmetic Problems

Paper – GitHub

Course: Field Experiments
Causal inference (Experiments, Heterogeneous treatment effects, Attrition)R

A randomized controlled trial in which 300+ participants were assigned to a control group or one of two test groups to evaluate the effect of competition (being compared to no one or someone better or worse).

M.S. in Statistical & Computational Information Processing
Universidad Politécnica de Madrid – School of Telecom. Eng.

Experimental design to measure the impact of online advertising on the sales of a car manufacturer’s dealer network

Paper – Slides – GitHub

Course: Capstone
Research design, Experiments, Wavelets, ClusteringR

A project for Google (working at Conento) to measure the effect of YouTube ads. Responsible for the whole project (excluding the econometric model to estimate the increase in advertising spend in the test group): designed a matched-pair, cluster-randomized experiment, which involved selecting the test and control groups from a sample of 50+ cities in Spain based on their sales-wise similarity over time.

Prediction of customer churn for a mobile network carrier

Paper

Course: Data Mining
Neural networks, Tree decisions, Logistic regressionSAS Enterprise Miner

Predictions from a sample of 45,000+ customers.

Different models of Harmonized Index of Consumer Prices (HICP) in Spain

Paper

Course: Time Series
ARIMA, Transfer functionsSPSS, Demetra+

Forecasts based on exponential smoothing, ARIMA, and transfer function (using petrol price as independent variable) models.

Other Projects

Master of Information and Data Science
University of California, Berkeley – School of Information

Machine Learning at Scale

  • Parallel computing for unsupervised & supervised learning of Big Data using Hadoop, MrJob, and Spark (all in Python).
  • k-means clustering, gradient descent, shortest path, SVM, PageRank...
  • 13 assignments (including the 2 ones listed in the Portfolio).

Machine Learning

  • Nearest neighbors, naive Bayes, decision trees, logistic regression, gradient descent, neural networks, (k-means & hierarchical) clustering, GMMs, dimensionality reduction, network analysis, ..., using Python (mainly Scikit-Learn).
  • 3 assignments (and the final project listed in the Portfolio).

Applied Regression and Time Series Analysis

  • Classical Linear Regression (assumptions, causality, instrumental variables, ...) and Time Series models (ARIMA and GARCH).
  • 8 weekly assignments and 3 Labs, using R.

Field Experiments (a.k.a. Experiments and Causal Inference)

  • Blocking, clustering, covariates, heterogeneous treatment effects, one-sided non-compliance, natural experiments, difference in differences, regression discontinuity, attrition, mediation, ...
  • 5 problem sets (and the final project listed in the Portfolio), using R

Exploring and Analyzing Data

  • Introduction to different types of quantitative research methods and statistical techniques for analyzing data (using R).
  • 3 lab assignments and a final exam.

Research Design and Applications for Data Analysis

  • An introduction to the data sciences landscape, with a particular focus on learning how to apply data science techniques to uncover, enrich, and answer the questions typically found in industry.
  • Final project: "Field Experiments in Online Advertising."

Referrals

From my LinkedIn profile

Data Scientist – Conento

«Juanjo's main strengths were his ability to solve complex problems by developing innovative ideas, combined with his smart focus on time and resources to execute them. He worked hard to make sure results were achieved, adapting very well to our work environment and building an excellent relationship with his mates. He is intelligent, positive, and rigorous, and a very generous person that is just a pleasure to work with.»

Macarena Estevez, CEO.


Master of Information and Data Science – UC Berkeley

«Juanjo was an excellent student in my course on experiments and causal inference. He asked terrific questions, turned in great problem sets, and did a very conscientious job on a team project experimenting with how perceptions of competition affect productivity in math problems.»

David Reiley, professor.

«Juanjo was my student at UC Berkeley in a master's-level class on the design and analysis of randomized experiments for corporate research. He was without a doubt one of the stand-out students. I would highly recommend his skills. His performance was excellent.»

David Broockman, professor.

«I was impressed by the presentation Juanjo made in my class on research design and applications for data analysis. Juanjo demonstrated great storytelling skills, giving a convincing demonstration of the need for his proposal. He can articulate a good research question – simple, direct, specific, and yet still challenging – and can handle tough questions from an audience. He has a strong grasp of research design, with really great description of why he recommended the adoption of an experimental design and good discussion of the method itself.»

Peter Norlander, professor.

«Juanjo is that wonderful combination of capable, smart, hard-working guy who also happens to be highly personable. I'm happy to recommend him.»

Annette Greiner, professor.


M.S. in Statistical and Computational Information Processing – Univ. Politecnica de Madrid

«Juanjo was an excellent student in my Master's course "Neural Networks and Statistical Learning" at the Technical University of Madrid. He showed a very high motivation and capability not only to grasp fundamental aspects but also to address real problems. In addition, Juanjo is a very open a communicative person which makes him easy to share ideas and organize tasks.»

Pedro J. Zufiria, professor.


Head of Sales, T&M, Spain and Portugal – Yokogawa

«I had the opportunity to work with Juan Jose and it was really easy to choose and run the installation. Not a lot of people understand the difficulties to choose an appropriated equipment to do a specialized job. Juan Jose does! Congratulations!»

Ramon Santos Yus, client.

«Juan Jose is easy to work with, has a strategic mind, a genuine interest in people and other cultures. He easily understands complex matters and is a real team player.»

Johan Waldelius, colleague.