May 09, 2019

8:00 AM - 9:00 AM

Registration, Breakfast & Opening Remarks

9:00 AM - 5:00 PM

Machine Learning with Caret
Max Kuhn, RStudio

Join Max Kuhn on a tour through Machine Learning in R. You'll learn about data preparation, model fitting, model assessment and predictions. Prior experience with lm is enough to get started and learn advanced modeling techniques.

9:00 AM - 5:00 PM

Geospatial Statistics and Mapping in R
Kaz Sakamoto, Lander Analytics

Geospatial expert and Columbia Professor Kaz Sakamoto is leading this class on all things GIS. You'll learn how about map projections, spatial regression, plotting interactive heatmaps with leaflet and working with shapefiles.

9:00 AM - 5:00 PM

Introduction to Survival Analysis
Elizabeth Sweeney, Weill Cornell

Time-to-event outcomes are common in a variety of statistical applications, but the statistical techniques needed to appropriately analyze data in the presence of censoring or when predictor variables are not observed at baseline are not always taught as part of a standard statistics curriculum. This workshop will introduce the statistical techniques needed to address common questions in the context of time-to-event outcomes. Topics covered will include types of censoring, the Kaplan-Meier estimator of the survival function, Cox proportional hazards regression, analysis of time-dependent covariates, and competing risks methods to handle situations where more than one type of event is possible. All common statistical analyses will be demonstrated in R, including use of the survival and ggsurvplot packages.

9:00 AM - 5:00 PM

Git for Data Science
Dan Chen, Virginia Tech

Daniel Chen, author of Pandas for Everyone, has given multiple talks at the New York R Conference about the data science workflow. In this workshop he'll teach how to use Git and project management for better organization and faster iteration.

May 10, 2019

8:00 AM - 8:50 AM

Breakfast & Open Registration

8:50 AM - 9:00 AM

Opening Remarks

9:00 AM - 9:20 AM

Building the Tidyverse From Scratch: Teaching Data Cleaning and Visualization with R-inspired Custom Scratch Blocks
Ludmila Janda, Amplify

At Amplify, we are developing a series of middle school computer science lessons that will teach students to clean data and make graphs using a new visual programming interface based on Scratch by MIT. In the lessons, students will use manipulable blocks rather than written code. We have developed a new set of code blocks that are inspired by both the verbs from the dplyr package and the grammar of graphics approach used by ggplot. This talk will discuss how I have helped a diverse team of stakeholders draw from the principles of the tidyverse and develop this unique approach to teaching data science practices.

9:25 AM - 9:45 AM

Leveraging Player Tracking Data to Contextualize the Difficulty of Passes in the NFL
Mike Band, NFL Next Gen Stats & Lander Analytics

9:50 AM - 10:10 AM

From Tangled Lassos to Boosted Trees: Iterative Research in Practice
Emily Dodwell, AT&T Labs Research

The development of a machine learning-based media targeting strategy for television advertising campaigns introduces computational challenges inherent in the scale of training data. Features derived from customer viewership records necessitate a robust and scalable solution. Emily will discuss potential solutions her team considered to tackle this business problem in R, as well as the theoretical intuition for the final two machine learning algorithms they chose to compare for implementation.

10:10 AM - 10:40 AM

Break & Networking

10:40 AM - 11:00 AM

Everything You Wanted to Know About Making R Packages but Were Afraid to Ask
Emily Robinson, DataCamp

What tools can help you write an R package? Should you write documentation and tests or just focus on the functions? How can encourage people to use and trust your package? How should you handle bugs and feature requests? Should you submit it to CRAN, and if so, when? Do you need to have a hex sticker??Whether you’re thinking about making your first package, sharing one on GitHub, or wondering what’s changed about making packages in the last few years, this talk is for you.

11:05 AM - 11:25 AM

R: Then and Now
Jared P. Lander, Lander Analytics

R has changed a lot since the meetup was founded 10 years ago. Back then we were using base graphics (or lattice) and the apply family of functions and we didn't have pipes. At the time there was an impressive 1800 packages on CRAN, now there are over 15,000 extending R's reach far beyond its traditional domain of statistics and machine learning into publishing, website building and video generation. The community has grown and changed dramatically during that time, with the New York meetup alone going from 25 to over 10,000 members. During this talk we go through a then-and-now of R code and community to palpably see how everything has changed.

11:30 AM - 11:50 AM

Building Reproducible and Replicable Projects
Dan Chen, Virginia Tech

An organized project makes a happy data scientist and data science team. How should your projects be organized? What about the reports your project creates? We hear a lot about pipelines, but how do you make them? How do you start to automate your pipelines, and how do you make your datasets, figures, and reports repliciable and reproducible as new or updated datsets come in? We talk about the various ways you can organize your project from the new R user to seasoned R users, and how to use build systems to keep track of various parts of your pipeline.

11:50 AM - 1:00 PM

Lunch & Networking

1:00 PM - 1:20 PM

Using Statistical Methods to Estimate Coefficients in Allometric Models
Krista Watts, United States Military Academy

Allometric models are relevant to a variety of fields; for instance, they are used to classify overfatness in children, to determine energy needs for the Army, and how much of a drug is required to dose different sized patients. We will examine three recent studies that used statistical methods to estimate coefficients from allometric models with applications in diverse fields. Body Mass Index (BMI) uses the fact that in adult, European populations weight is generally proportional to height squared. It has long been recognized that BMI is highly correlated with percent body fat in adult, European populations, however, it is not well tested in other populations. Using data from a nationwide study, we examine the appropriate scaling relationship in children. With data from a similar nationwide study in India, we consider whether weight scales to height squared in tribal populations and the general Indian population. Finally, we investigate the relationship between a variety of biometric measurements (head circumference, leg length, etc.) and height and how those measurements might differ based on gender.

1:25 PM - 2:05 PM

Solve All Your Statistics Problems Using P-Values
Andrew Gelman, Columbia

There's been a lot of hype in recent years about Bayes, machine learning, etc., using statistics to solve problems from protein folding to survey weighting, from reading CAT scans to recognizing cat pictures, prediction and causal inference. But can we really trust any of these claims? Only if p < 0.05. In this series of slides, we present a method for determining statistical significance for any problem in statistics or machine learning, and we discuss how the so-called replication crisis in science could be resolved, if people would just treat all statistically significant results as real, and all non-significant results as zero.

2:05 PM - 2:35 PM

Break & Networking

2:35 PM - 2:55 PM

Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney, Ursa Labs

3:00 PM - 3:20 PM

Artificial Intelligence Driven Drug Discovery
Michelle Gill, BenevolentAI

At BenevolentAI, we use machine learning to facilitate drug discovery. This talk will introduce the drug discovery process and explain how machine learning maps to these stages. We will then cover challenges specifically related to using machine learning for scientific discovery, and conclude with a specific application of reinforcement learning to generate novel compounds in silico.

3:25 PM - 3:45 PM

Towards systematic evidence generation from real-world healthcare data
David Madigan, Columbia

In practice, our learning healthcare system relies primarily on observational studies generating one effect estimate at a time using customized study designs with unknown operating characteristics and publishing – or not – one estimate at a time. When we investigate the distribution of estimates that this process has produced, we see clear evidence of its shortcomings, including an apparent over-abundance of estimates where the confidence interval does not include one (i.e. statistically significant effects). We propose a standardized process for performing observational research that can be evaluated, calibrated and applied at scale to generate a more reliable and complete evidence base than previously possible, fostering a truly learning healthcare system. We demonstrate this new paradigm by generating evidence about all pairwise comparisons of treatments for hypertension for a relevant set of health outcomes using nine large electronic healthcare record databases from 3 continents. In total, we estimate >1M hazard ratios, each using a comparative effectiveness study design and propensity score stratification on par with current state-ofthe-art, albeit one-off, observational studies. Moreover, the process enables us to employ negative and positive controls to evaluate and calibrate estimates ensuring, for example, that the 95% confidence interval includes the true effect size approximately 95% of time. The result set consistently reflects current established knowledge where known, and its distribution shows no evidence of the faults of the current process.

3:45 PM - 4:15 PM

Break & Networking

4:15 PM - 4:35 PM

Personalizing mental healthcare at scale
Adam Chekroud, Spring Health

Depression is the worlds leading cause of disability, and almost 1 in 4 people will suffer some kind of mental illness each year. However, most people don’t get a diagnosis, don’t get treatment, or don’t fully recover. Adam Chekroud will talk about how Spring Health uses data to improve mental healthcare at scale, and how statistics help guide clinical decisions throughout the process.

4:40 PM - 5:00 PM

Deep learning isn’t Hard, I Promise
Jacqueline Nolis, Nolis, LLC

Deep learning sounds complicated and difficult, but it’s really not. Thanks to packages like Keras, you can get started with only a few lines of R code. Once you understand the basic concepts, you will able to use deep learning to make AI-generated humorous content! In this talk I’ll give an introduction to deep learning by showing how you can use it to make a model that generates weird pet names like: Shurper, Tunkin Pike, and Jack Odins. If you understand how to make a linear regression in R, you can understand how to create fun deep learning projects.

5:00 PM - 5:10 PM

Closing Remarks

May 11, 2019

9:00 AM - 9:50 AM

Breakfast & Open Registration

9:50 AM - 10:00 AM

Opening Remarks

10:00 AM - 10:20 AM

I’ll Have What She’s Having (and Other Models of Consumer Behavior)
Gabriela Hempfling, Chop't

This talk will focus on predictive models in R and more broadly their use in industry. As our experience online gets more personalized, organizations are increasingly making assumptions about who we are and how to engage us. Meanwhile, consumers have limited access to the context that drives their online experience. Gabriela sets out to use that context to build some personal analytics and ask if we can use our understanding of data science to become more self aware.

10:25 AM - 10:45 AM

Rn’t u glad u put R in prod
Heather Nolis, T-Mobile

Congrats! You built a model that the business wants to run in production! But what even is production? How will your model get there? What potential pitfalls will you hit? How can you advocate for putting your R model in production, instead of rewriting it in a different language? How can you do this all yourself as a data scientist??? In this talk, I will walk through the steps involved in preparing an R model for production using containers (Docker) and container orchestration (Kubernetes) at a big company like T-Mobile or for a side project you desperately want to share with the world in a scalable, fault-tolerant manner. You’ll go from having R code that runs in RStudio your laptop to that same code running safely on a server in the cloud.

10:45 AM - 11:15 AM

Break & Networking

11:15 AM - 11:35 AM

Hockey Analysis in R: Public and Private Perspectives
Namita Nandakumar, Philadelphia Eagles

There have been many catalysts for data-driven research in sports. Fans want to know if a certain player is secretly bad, if their team has a real shot at winning a championship, and if their front office is making good decisions. Teams want to know how to optimize in-game and player personnel decision-making throughout the year, from the regular season and playoffs to free agency and the amateur draft. We will explore and discuss both perspectives by analyzing publicly available NHL data in R to create a simple win probability model as well as identify team draft tendencies.

11:40 AM - 12:00 PM

Neuroimaging Analysis in R
Elizabeth Sweeney, Weill Cornell

Brain structural magnetic resonance imaging (sMRI) is a tool that uses a magnetic field to produce detailed images of the brain. Brain sMRI are most commonly used to diagnose disease and monitor disease progression and are critically important for disease research. I will introduce the basics of working with brain sMRI data in R. The R packages for these processing steps are housed on neuroconductor ( Neuroconductor is an open-source platform for rapid testing and dissemination of reproducible computational imaging software. To conclude I will introduce two packages that I authored that are also housed on neuroconductor, sublime and oasis. These packages are both used on brain sMRI in patients with multiple sclerosis. These packages identify or ‘segment’ areas of the brain that contain white matter lesions.

12:05 PM - 12:25 PM

An Introduction to Statistical Decision Theory, or: #ABYLFOYPE - Always Be Integrating Your Loss Function Over Your Posterior Estimate
Jim Savage, Schmidt Futures

Making sound choices can be thought of as choosing between competing uncertain forecasts, each being the consequence of the choice. Statistical decision theory offers a principled method for making choices: we simply compare how we feel about each forecast--in a formal setting. In this talk, Jim walks through the ingredients required for conducting formal decision analysis in R and Stan. He provides examples from his career in making frontier markets and social impact investments.

12:25 PM - 1:35 PM

Lunch & Networking

1:35 PM - 1:55 PM

Reproducibility in an Office World: Tools for Crossing the Abyss
Noam Ross, ROpenSci & EcoHealth Alliance

Many data scientists operate at the interface between two cultures and workflows: programmatic data science and WYSIWYG office applications. This noisy interface impedes reproducibility and is often maddening to practitioners in both camps. I will discuss failures and successes of crossing this uncanny valley and present a series of new packages for working collaboratively in mixed teams for reproducibility, joy and harmony.

2:00 PM - 2:20 PM

Becoming a better finance practitioner
Soumya Kalra, R-Ladies NYC

There are a number of different ways academic work can be applied in the field of finance. In this talk we will explore how to understand and replicate results from academia in finance. I will provide the techniques that have worked well for me and provide some working examples. Because we will be using R we will also explore tools that allow us model flexibility and reproducibility.

2:25 PM - 2:45 PM

parsnip: a tidy interface for models
Max Kuhn, RStudio

The tidyverse has mostly been about data ingestion, manipulation, and visualization. The tidymodels packages are designed to bring modern interfaces to modeling. parsnip is a package that can be used to create models in R that can generalize to different computations engines (e.g., R, python, tensorflow, spark, etc.) using a common syntax.

2:45 PM - 3:15 PM

Break & Networking

3:15 PM - 3:35 PM

Using R to defend immigrant's rights at the ACLU
Brooke Watson, ACLU

The ACLU uses litigation and advocacy to protect and expand the civil rights of immigrants. Often, this involves extracting, cleaning, and analyzing data from federal agencies to fact-check government statements, detect patterns of civil rights abuse, and ensure that immigration agencies are following the law. Through the lens of 3 immigration cases, this talk will walk through 10 functions that enable R users to extract, validate, and visualize patterns and anomalies in data.

3:40 PM - 4:00 PM

This Talk is on Fire: Using Twitter and Google to Track Fires in NYC
Amanda Dobbyn, EarlyBird Software

Fires happen. In this talk we'll use the rtweet (Twitter) and ggmap (Google Maps) API packages to find out where and when they occur in NYC, all wrapped into a drake pipeline.

4:05 PM - 4:25 PM

Cooking Up Statistics: The Science & The Art
Letisha Smith, Teachers Pay Teachers

After adopting the New Year's Resolution to be healthier, I immediately recognized a common flaw with recommended meal plans. Daily dishes often contain their own set of ingredients that are rarely reused and consequently go to waste. Traditionally, this problem has been solved by finding recipes that use leftover ingredients. However, this approach is reactive, not proactive. Therefore, I began to ponder if machine learning could be used to optimize meal prep. In this talk, I will share lessons learned while using R to maximize the number of meals made with a minimal amount of ingredients.

4:25 PM - 4:35 PM

Closing Remarks