I'm Juan Jose (or just Juanjo). Here I'd like to share a little about me and some of the projects I've worked on to earn my last two master's degrees: an academic one in **Statistical & Computational Information Processing** from *Universidad Politécnica de Madrid*, and a professional one in **Information and Data Science** from University of California, Berkeley.

I am originally from Madrid, Spain, but I moved to the SF Bay Area in February 2015. While I was earning my master's in Telecommunication Engineering, I worked at Ericsson and later as a tutor of differential and integral calculus, physics, and digital electronic circuits (I love teaching!). After that, I worked 9 years in the Test & Measurement industry: first as a sales engineer at a distributor of T&M solutions (for all kinds of telecom networks: mobile, Ethernet, CaTV, WiFi, ...), and later at Yokogawa (whose oscilloscopes, DAQ systems, and power analyzers are aimed at industries such as energy, transportation, and power electronics; hence I had to move from bits to volts and from dBm to megawatts), where I soon became sales director of the T&M division for Spain and Portugal. My work mainly involved helping organizations to get data (from telecom networks first, and then from wind farms, solar parks, nuclear plants, transformers, electric motors, and so on) to check that everything worked properly and conforming to standards, and more importantly, to find any problem and its cause. Above all, it meant caring about customers' needs.

I really liked my job, but I also wanted to be in an even more technical position. I've loved math problems^{[1]} and puzzles since I was a kid, and I didn't even have to look for them: if I was at Starbucks and saw 14 people using a laptop, 9 (64.3%) of them being a Mac, I found myself wondering if it could be said that the number of Macs was significantly higher than the number of PCs^{[2]}. And I was always trying to get valuable insights from the reduced amount of sales data we had (it was a small market). I eventually realized that I wanted to dig down deeper in the data (lots of it and of all kinds!) and be able to extract unobserved patterns, infer causality through experiments, and make reliable predictions and well-informed decisions; I wanted to be a data scientist. That's why I went back to my *alma mater* to earn another Master's degree, this time in Statistics, and right after that another one in Data Science from UC Berkeley.

Like the "Seattle rain" problem: click

**here**to read my solution.It turns out that it only is at the 20% significance level. For such a small sample, 72.0% (11) of the 14 laptops would have to be Macs for their proportion to be significantly higher than 0.5 at the

*usual*5% level.

In any case, that sample, apart from being very small, is not representative of the whole population! With a**random**sample of a thousand laptops, 52.6% of Macs would be enough to say that they are significantly more prevalent than PCs at the 5% level; with a sample of a million laptops, fewer than 50.1% of Macs would be enough. If you are curious about the minimum number of positive cases—as I was—, this is the expression I obtained (click**here**to read how):

$$\begin{aligned} N_1 &\geq \left \lceil \frac{N}{2} \cdot \left(1 + \frac{t_{.95,N-1}}{\sqrt{N+t_{.95,N-1}^2-1}}\right) \right \rceil \\ &\simeq \left \lceil \frac{N}{2} + 0.822 \cdot \sqrt{N} \right \rceil\end{aligned}$$

University of California, Berkeley – School of Information

**Website** – **Backend Code** – **Frontend Code**

Course: *Capstone***Computer Vision, Scalable, Classification** – **Python, OpenCV, TensorFlow, AWS (ECS, S3, DynamoDB), Flask, JavaScript, Google Charts**

A scalable cloud-based video monitoring system (with motion detection, face counting, and image recognition) using Raspberry Pis. Most of the processing is done in the cloud, so the computing power of the Raspberry Pi is not a limitation.

Course: *Machine Learning at Scale***Shortest path, PageRank** – **MrJob, Python, AWS EC2, AWS S3**

Two of the assignments of the course, using a graph dataset of almost half a million nodes.

Course: *Machine Learning***kNN, NB, Random Forests, SVMs, SGD, GMMs, Feature engineering** – **Python, Scikit-Learn, Matplotlib**

A Kaggle competition: predictions of the predominant kind of tree cover in four wilderness areas located in the Roosevelt National Forest of northern Colorado, from strictly cartographic variables such as elevation and soil type.

Course: *Storing and Retrieving Data***Information retrieval** – **Hadoop, Hive, Spark, Python, AWS EC2, Tableau**

A pipeline that combines data from Indeed API and the U.S. Census Bureau to select the best locations for data scientists based on the number of job postings, housing cost, etc.

Course: *Data Visualization and Communication***Data visualization** – **Tableau, SQLite**

An interactive website to visualize NYC Citi Bike bicycle sharing service.

Course: *Field Experiments***Causal inference (Experiments, Heterogeneous treatment effects, Attrition)** – **R**

A randomized controlled trial in which 300+ participants were assigned to a control group or one of two test groups to evaluate the effect of competition (being compared to no one or someone better or worse).

Course: *Capstone***Research design, Experiments, Wavelets, Clustering** – **R**

A project for Google (working at Conento) to measure the effect of YouTube ads. Responsible for the whole project (excluding the econometric model to estimate the increase in advertising spend in the test group): designed a matched-pair, cluster-randomized experiment, which involved selecting the test and control groups from a sample of 50+ cities in Spain based on their sales-wise similarity over time.

Course: *Data Mining***Neural networks, Tree decisions, Logistic regression** – **SAS Enterprise Miner**

Predictions from a sample of 45,000+ customers.

Course: *Time Series***ARIMA, Transfer functions** – **SPSS, Demetra+**

Forecasts based on exponential smoothing, ARIMA, and transfer function (using petrol price as independent variable) models.

University of California, Berkeley – School of Information

- Parallel computing for unsupervised & supervised learning of Big Data using
**Hadoop**,**MrJob**, and**Spark**(all in**Python**). *k*-means clustering, gradient descent, shortest path, SVM, PageRank...- 13 assignments (including the 2 ones listed in the Portfolio).

- Nearest neighbors, naive Bayes, decision trees, logistic regression, gradient descent, neural networks, (
*k*-means & hierarchical) clustering, GMMs, dimensionality reduction, network analysis, ..., using**Python**(mainly**Scikit-Learn**). - 3 assignments (and the final project listed in the Portfolio).

- Classical Linear Regression (assumptions, causality, instrumental variables, ...) and Time Series models (ARIMA and GARCH).
- 8 weekly assignments and 3 Labs, using
**R**.

- Blocking, clustering, covariates, heterogeneous treatment effects, one-sided non-compliance, natural experiments, difference in differences, regression discontinuity, attrition, mediation, ...
- 5 problem sets (and the final project listed in the Portfolio), using
**R**

- Introduction to different types of quantitative research methods and statistical techniques for analyzing data (using
**R**). - 3 lab assignments and a final exam.

**«**Juanjo's main strengths were his ability to solve complex problems by developing innovative ideas, combined with his smart focus on time and resources to execute them. He worked hard to make sure results were achieved, adapting very well to our work environment and building an excellent relationship with his mates. He is intelligent, positive, and rigorous, and a very generous person that is just a pleasure to work with.**»**

**Macarena Estevez**, CEO.

Master of Information and Data Science – UC Berkeley

**«**Juanjo was an excellent student in my course on experiments and causal inference. He asked terrific questions, turned in great problem sets, and did a very conscientious job on a team project experimenting with how perceptions of competition affect productivity in math problems.**»**

**David Reiley**, professor.

**«**Juanjo was my student at UC Berkeley in a master's-level class on the design and analysis of randomized experiments for corporate research. He was without a doubt one of the stand-out students. I would highly recommend his skills. His performance was excellent.**»**

**David Broockman**, professor.

**«**I was impressed by the presentation Juanjo made in my class on research design and applications for data analysis. Juanjo demonstrated great storytelling skills, giving a convincing demonstration of the need for his proposal. He can articulate a good research question – simple, direct, specific, and yet still challenging – and can handle tough questions from an audience. He has a strong grasp of research design, with really great description of why he recommended the adoption of an experimental design and good discussion of the method itself.**»**

**Peter Norlander**, professor.

**«**Juanjo is that wonderful combination of capable, smart, hard-working guy who also happens to be highly personable. I'm happy to recommend him.**»**

**Annette Greiner**, professor.

M.S. in Statistical and Computational Information Processing –

**«**Juanjo was an excellent student in my Master's course "Neural Networks and Statistical Learning" at the Technical University of Madrid. He showed a very high motivation and capability not only to grasp fundamental aspects but also to address real problems. In addition, Juanjo is a very open a communicative person which makes him easy to share ideas and organize tasks.**»**

**Pedro J. Zufiria**, professor.

Sales Director, T&M – Yokogawa

**«**I had the opportunity to work with Juan Jose and it was really easy to choose and run the installation. Not a lot of people understand the difficulties to choose an appropriated equipment to do a specialized job. Juan Jose does! Congratulations!**»**

**Ramon Santos Yus**, client.

**«**Juan Jose is easy to work with, has a strategic mind, a genuine interest in people and other cultures. He easily understands complex matters and is a real team player.**»**

**Johan Waldelius**, colleague.