Speakers

Andrew Gelman

Professor,
Department of Statistics and Department of Political Science, Columbia University
@StatModeling

Asmae Toumi

Director of Analytics and Research,
PursueCare
@asmae_toumi

Max Kuhn

Scientist,
RStudio
@topepos

Caitlin Hudon

Principal Data Scientist,
OnlineMedEd
@beeonaposy

Jared P. Lander

Chief Data Scientist,
Lander Analytics
@jaredlander

Megan Robertson

Senior Data Scientist,
Nike
@leggomymeggo4

David Robinson

Principal Data Scientist,
Heap
@drob

Danielle Oberdier

Founder,
DiKayo Data
@dikayodata

Alexa Fredston

Postdoctoral Associate,
Ecology, Evolution, and Natural Resources, Rutgers University
@AFredston

Wes McKinney

CEO, Founder,
Ursa Computing
@wesmckinn

Reenah Nahum Muldavski

Chief Data Scientist,
Data Science Services

Jonathan Bratt

Senior Data Scientist,
Macmillan Learning

COL Krista Watts

Vice Dean for Operations,
United States Military Academy

Adam Chekroud

President & Co-founder,
Spring Health
@itschekkers

Sonia Ang

Microsoft Advanced Analytics and AI, Sr. Cloud Solutions Architect,
Microsoft
@galleontrade

Mike Band

Analyst,
NFL Next Gen Stats
@MBandNFL

Rachael Tatman

Senior Developer Advocate,
Rasa
@rctatman

Jonathan Hersh

Assistant Professor of Economics & Management Science,
Chapman University Argyros School of Business
@DogmaticPrior

Chrys Wu

Consultant & Community Builder,
Matchstrike
@MacDiva

Daniel Chen

Doctoral Candidate,
Virginia Tech
@chendaniely

Sarah Catanzaro

Partner,
Amplify Partners
@sarahcat21

Jeroen Janssens

CEO & Principal Trainer,
Data Science Workshops
@jeroenhjanssens

Mayari Montes de Oca

Research Scientist,
NYU Global TIES for Children
@Mayari_MOca

Bernardo Lares

Marketing Science Partner,
Facebook
@LaresDJ

Igor Skokan

Marketing Science Partner,
Facebook

Workshops

Workshops will be held virtually a week prior to the conference on Wednesday, September 1st.

Join Max Kuhn on a tour through Machine Learning in R, with emphasis on using the software as opposed to general explanations of model building. This workshop is an abbreviated introduction to the tidymodels framework for modeling. You'll learn about data preparation, model fitting, model assessment and predictions. The focus will be on data splitting and resampling, data pre-processing and feature engineering, model creation, evaluation, and tuning. Pre-requisites: some experience with modeling in R and the tidyverse (don't need to be experts); prior experience with lm is enough to get started and learn advanced modeling techniques. In case participants can’t install the packages on their machines, RStudio Server Pro instances will be available that are pre-loaded with the appropriate packages and GitHub repository.

Geospatial expert and Columbia Professor Kaz Sakamoto is leading this class on all things GIS. You'll learn all about map projections, spatial regression, plotting interactive heatmaps with leaflet and working with shapefiles. This course is designed for those who have familiarity with R and want to explore working spatial data into their work. The AM session will be an introduction to Geographic Information Systems(GIS), spatial features (sf package), Coordinate Reference Systems(CRS), and map making basics. The PM session will introduce spatial operations, geometric operations, statistical geography, spatial point pattern analysis and geostatistics. By the end of the day participants should be able to read/work with spatial data, understand projections, utilize geoprocessing techniques, and gain basic spatial statistics comprehension.

Daniel Chen, author of Pandas for Everyone, has given multiple talks at the New York R Conference about the data science workflow. In this workshop he'll teach how to use Git and project management for better organization and faster iteration. This workshop will have four parts: 1) Git on Your Own, 2) Working with Remotes, and 3) Git with Branches, and 4) Collaborating with Git. Part I will cover creating a git repository, adding and committing files, looking at differences between files, looking at your history, moving around your history, reverting changes, and undelete files. Part II will go over going from your computer to a remote (e.g., GitHub, BitBucket, GitLab), syncing your files by pushing and pulling, and conflicts. Part III will cover creating branches, moving around different branches, making commits in branches, merging branches, using branches with remotes, pull requests (aka, merge requests), merging pull requests, and syncing up with your remote. In Part IV, we will discuss how the skills you learned directly apply to collaboration with other people.

The tidyverse is a powerful collection of packages following a standard set of principles for usability. During this workshop David will demonstrate an exploratory data analysis in R using tidy tools. He will demonstrate the use of tools such as dplyr and ggplot2 for data transformation and visualization, as well as other packages from the tidyverse as they're needed. He'll narrate his thought process as attendees follow along and offer their own solutions. The workshop expects some familiarity with dplyr and ggplot2—enough to work with data using functions like mutate, group_by, and summarize and to create graphs like scatterplots or bar plots in ggplot2. These concepts will be re-introduced to ensure a smooth workshop, but it isn't designed for brand new R programmers. The workshop is designed to be interactive and participants are expected to type along on their own keyboards.

In both data science and academic research, prediction modeling is often not enough; to answer many questions, we need to approach them causally. In this workshop, we’ll teach the essential elements of answering causal questions in R through causal diagrams, and causal modeling techniques such as propensity scores, inverse probability weighting, and matching. We’ll also show that by distinguishing predictive models from causal models, we can better take advantage of both tools; prediction modeling plays a role in establishing many causal models, such as propensity scores. You’ll be able to use the tools you already know--the tidyverse, regression models, and more--to answer the questions that are important to your work. (Virtual Only Workshop)

The Unix command line, although invented decades ago, is an amazing environment for efficiently performing tedious but essential data science tasks. By combining small, powerful, command-line tools (like grep, jq, and parallel), you can quickly obtain, scrub, and explore your data. This hands-on workshop is based on the second edition of the book Data Science at the Command Line (coming out in September), written by instructor Jeroen Janssens. You'll learn how to build fast data pipelines, how to leverage R at the command line (and vice versa), and how to create ad-hoc data visualizations. There'll be a Docker image available with all the command-line tools pre-installed, so you can follow along regardless of which operating system you're running. (Virtual Only Workshop)

Agenda

Registration & Opening Remarks: 8:30 AM - 9:00 AM EST

Join Max Kuhn on a tour through Machine Learning in R, with emphasis on using the software as opposed to general explanations of model building. This workshop is an abbreviated introduction to the tidymodels framework for modeling. You’ll learn about data preparation, model fitting, model assessment and predictions. The focus will be on data splitting and resampling, data pre-processing and feature engineering, model creation, evaluation, and tuning. Pre-requisites: some experience with modeling in R and the tidyverse (don’t need to be experts); prior experience with lm is enough to get started and learn advanced modeling techniques. In case participants can’t install the packages on their machines, RStudio Server Pro instances will be available that are pre-loaded with the appropriate packages and GitHub repository.

The tidyverse is a powerful collection of packages following a standard set of principles for usability. During this workshop David will demonstrate an exploratory data analysis in R using tidy tools. He will demonstrate the use of tools such as dplyr and ggplot2 for data transformation and visualization, as well as other packages from the tidyverse as they’re needed. He’ll narrate his thought process as attendees follow along and offer their own solutions. The workshop expects some familiarity with dplyr and ggplot2—enough to work with data using functions like mutate, group_by, and summarize and to create graphs like scatterplots or bar plots in ggplot2. These concepts will be re-introduced to ensure a smooth workshop, but it isn’t designed for brand new R programmers. The workshop is designed to be interactive and participants are expected to type along on their own keyboards.

In both data science and academic research, prediction modeling is often not enough; to answer many questions, we need to approach them causally. In this workshop, we’ll teach the essential elements of answering causal questions in R through causal diagrams, and causal modeling techniques such as propensity scores, inverse probability weighting, and matching. We’ll also show that by distinguishing predictive models from causal models, we can better take advantage of both tools; prediction modeling plays a role in establishing many causal models, such as propensity scores. You’ll be able to use the tools you already know–the tidyverse, regression models, and more–to answer the questions that are important to your work. (Virtual Only Workshop)

The Unix command line, although invented decades ago, is an amazing environment for efficiently performing tedious but essential data science tasks. By combining small, powerful, command-line tools (like grep, jq, and parallel), you can quickly obtain, scrub, and explore your data. This hands-on workshop is based on the second edition of the book Data Science at the Command Line (coming out in September), written by instructor Jeroen Janssens. You’ll learn how to build fast data pipelines, how to leverage R at the command line (and vice versa), and how to create ad-hoc data visualizations. There’ll be a Docker image available with all the command-line tools pre-installed, so you can follow along regardless of which operating system you’re running. (Virtual Only Workshop)

Daniel Chen, author of Pandas for Everyone, has given multiple talks at the New York R Conference about the data science workflow. In this workshop he’ll teach how to use Git and project management for better organization and faster iteration. This workshop will have four parts: 1) Git on Your Own, 2) Working with Remotes, and 3) Git with Branches, and 4) Collaborating with Git. Part I will cover creating a git repository, adding and committing files, looking at differences between files, looking at your history, moving around your history, reverting changes, and undelete files. Part II will go over going from your computer to a remote (e.g., GitHub, BitBucket, GitLab), syncing your files by pushing and pulling, and conflicts. Part III will cover creating branches, moving around different branches, making commits in branches, merging branches, using branches with remotes, pull requests (aka, merge requests), merging pull requests, and syncing up with your remote. In Part IV, we will discuss how the skills you learned directly apply to collaboration with other people.

Geospatial expert and Columbia Professor Kaz Sakamoto is leading this class on all things GIS. You’ll learn all about map projections, spatial regression, plotting interactive heatmaps with leaflet and working with shapefiles. This course is designed for those who have familiarity with R and want to explore working spatial data into their work. The AM session will be an introduction to Geographic Information Systems(GIS), spatial features (sf package), Coordinate Reference Systems(CRS), and map making basics. The PM session will introduce spatial operations, geometric operations, statistical geography, spatial point pattern analysis and geostatistics. By the end of the day participants should be able to read/work with spatial data, understand projections, utilize geoprocessing techniques, and gain basic spatial statistics comprehension.

Open Registration: 8:30 AM - 8:50 AM EST
Opening Remarks: 8:50 AM - 9:00 AM EST

A data scientist writes code throughout every stage of a project from exploratory data analysis to evaluating models and summarizing results. Once you’ve developed a proof of concept or minimally viable product it can be a daunting task to put it into production. How do you organize and adapt all the code that you created? What can you do to make sure the code catches errors and alerts you to them? Do you feel overwhelmed by everything you need to do? By attending this presentation you will learn tips and strategies to organize your own code during a project to make creating production code easier. You will also learn how to optimize your code to catch errors and create effective documentation.

If your data analyses involve coding, then you know how liberating it is to use and create functions. They hide complexity, improve testability, and enable reusability. In this talk I explain how you can really set your R code free: by turning it into a command-line tool. The command line can be a very flexible and efficient environment for working with data. It’s specialized in combining tools that are written in all sorts of languages (including R and Python), running them in parallel, and applying them to massive amounts of (streaming) data. Although the command line itself has quite a learning curve, turning your existing R code into a tool is, as I demonstrate, a matter of a few steps. I discuss how your new tool can be combined with existing tools in order to obtain, scrub, explore, and model data at the command line. Finally, I share some best practices regarding interface design.

Whether you’re trapped in a quiet place with no headphones or simply don’t want to pause your favorite TV show, written R tutorials often rival video tutorials in terms of efficiency and of course the ability to quickly transfer strings of code to one’s own platform. But what makes a written tutorial stand out amongst all the resources out there? In this talk, I will show you how to determine the right length for a given tutorial, which R packages are best for this type of teaching and most importantly, how to make your written tutorials personal and engaging.

Break & Networking: 10:10 AM - 10:40 AM EST

In the rapidly growing field of environmental data science, R is the language of choice for many researchers seeking to forecast the impacts of extreme weather events, chronicle global biodiversity loss, or map injustice in environmental health. In this talk, I’ll showcase some of the challenges frequently encountered by environmental data scientists, and the R tools we use to solve them — from data wrangling to spatial analysis and Bayesian models. Through a series of real examples that required fitting models to messy data on biodiversity, oceans, and climate change, I’ll demonstrate how ecological and environmental researchers are leveraging R to help save the planet.

Parallel computing has become easier and easier in R over the years thanks to packages like parallel and future. But the CPU measures cores in the single or double digits. With GPUs we can access thousands of cores, significantly speeding up our work. Taking advantage of the GPU for machine learning has never been easier thanks to torch, xgboost, catboost and Stan. We look at how to fit those models on the GPU and how to use some lower level code to perform custom operations with the GPU.

Finding data science learning and teaching materials is not what educators and learners will find difficult these days. Rather, finding domain-specific materials that will resonate with learners is the current challenge. In the medical sciences, many of our learners only know about spreadsheets, and treat our data as a visualization, using colors, spaces, one-off tables, and side calculations. They lack the vocabulary to talk and work with data in a programmatic manner that integrates with other data scientists.

This is a talk intended for data science educators and the education community. We adapted surveys from The Carpentries, “How Learning Works”, and “Teaching Tech Together” to create a learner self-assessment survey to discover learner personas in the biomedical sciences by clustering survey results. These personas and findings were used to create a data science curriculum that is grounded in data literacy topics around spreadsheets and good data practices.

Lunch & Networking: 11:50 AM - 1:00 PM EST

Data scientists are uniquely empowered to solve big business problems, and analyzing and understanding churn is one area which lends itself to impactful analysis. Using my recent work on churn as a case study, this talk will cover: -How to get buy-in for “big business problem” projects, and how to structure analysis projects such that you’re adding and delivering value at multiple checkpoints -How to turn exploratory data analysis into useful deliverables -How to tackle subscription churn through holistic analysis and thoughtful, actionable recommendations

Before any given 4th down play, an NFL head coach must decide between keeping the offense on the field and going for it, or calling for the special teams unit to attempt a field goal or punt. Nearly every team has at least one staff member crunching the numbers in these situations over the course of a game. Our team at Next Gen Stats, in collaboration with Amazon Web Services, are taking 4th down and two-point decision analytics to the next level. Powered by a series of machine learning models, the Next Gen Stats Decision Guide analyzes crucial coaching decisions in real-time. Should the team go for it, or kick? Let’s see what the numbers say…

How does a group of strangers with minimal football strategy knowledge go on to win sport’s biggest data science competition during a pandemic? Asmae Toumi will share her and her group’s process, lessons learned, and how they leveraged the tidyverse, tidymodels ecosystem, and other R packages to gain a competitive edge.

Break & Networking: 2:10 PM - 2:40 PM EST

In recent years data science models have increased the efficiency and value of products across many industries. Data science models in many cases affect the product backend logic in a fundamental manner (such as directing customers in queues), and classic “split testing” is hard or even impossible to implement. AB testing methodology for frontend features is solid and well defined, however, the methodology for testing the value of backend enhancements is not as firm and well applied. In this talk we will cover experimental design best practices for testing and measuring the value of data science models.

Depression is the world’s leading cause of disability, and almost 1 in 4 people will suffer some kind of mental illness each year. However, most people don’t get a diagnosis, don’t get treatment, or don’t fully recover. Adam Chekroud will talk about how Spring Health uses data to improve mental healthcare at scale, and how statistics help drive better outcomes throughout the process.

Nearly every day, data teams and venture capitalists implicitly express their priorities and outlook by making decisions on how to allocate budget or capital to advance different technology initiatives. In the past 5 years, both groups have prioritized machine learning and business intelligence initiatives, by investing the tools and platforms to support these projects. They have not; however, invested in tools and platforms to advance causal inference. In this talk, we will discuss why investments in causal inference may have a higher ROI. We’ll then study the evolution of the MLOps stack to identify opportunities to unlock increased investment in causal inference and expand adoption in industry.

Break & Networking: 3:50 PM - 4:20 PM EST

Language is fundamentally different from other types of data, and it’s inevitable that you’ll run into some language-specific issues. This talk will cover some of the most common types of errors I’ve seen data analysts and machine learning engineers make with language data, from ignoring the differences between text genres to treating text as written speech to assuming that all languages work like English. We’ll also talk about ways to avoid these common mistakes (and recover gracefully if you’ve already made them).

Modern language models use tokenizers based on subword-level vocabularies. Words not present in the vocabulary are broken into subword tokens. This subword tokenization is generally unrelated to the morphological structure of the word.

It is intuitively appealing to consider a tokenizer that uses a morpheme-level vocabulary to split words into meaningful units. Implementing such a tokenizer, while conceptually straightforward, presents a number of practical challenges.

We present an approach to solving these challenges and introduce {morphemepiece}, an R package that implements a new tokenization algorithm for breaking down (most) words into their smallest units of meaning.

Closing Remarks: 5:05 PM - 5:15 PM EST
Happy Hour with Cointreau & The Botanist Gin: 5:30 PM - 6:15 PM EST
Open Registration: 9:30 AM - 9:50 AM EST
Opening Remarks: 9:50 AM - 10:00 AM EST

The nature of how classes are executed at the United States Military Academy makes it a unique opportunity to assess the effectiveness of different pedagogical approaches. Core classes generally have dozens of small sections, often with students randomly assigned to sections, facilitating the opportunity for cluster randomized trials. I will discuss several recent studies including sectioning by demonstrated aptitude for a subject and use of technology in the classroom.

Microsoft is championing how to bring change to society in the use and implementation of AI and its impact has been felt from a social–technological point of view. A stark reminder was when it launched the Twitter chatbot, Tay, who ended up with bigoted rhetoric, hence a realization to consider the human element when designing AI systems. Along with innovation comes a responsibility to make sure that the future is secured. We need to take a thoughtful approach to ensure we create a future we want to see and not one we fear. This presentation revolves around an ethical framework with five core principles of fairness, reliability, safety, privacy and security, underpinned with transparency and accountability.

Break & Networking: 10:45 AM - 11:15 AM EST

This talk explores the many facets of K-pop, starting with groups like BTS and Blackpink and diving further into fandom to see trends and influences. In this “fun with R” / “hobby R” talk we’ll also explore things like release and promotion schedules, connections, and influences so we can enjoy and understand more about how this genre has been shaping popular culture.

My beloved Chicago train ridership data, like so many other things, was severely impacted by the pandemic. If I want to build models, what should I do? This talk will describe some approaches for mitigating the effect that the pandemic had on L train ridership.

Wes McKinney shares an update about recent developments in Apache Arrow, the multi-language toolbox for accelerated data interchange and in-memory processing. Wes introduces some new directions for the Arrow project and discusses why Ursa Computing, which he founded last year, has joined forces with BlazingSQL and the pioneers of RAPIDS and other open source projects to form Voltron Data.

9/11 Tribute: 12:25 PM - 12:30 PM EST
Lunch & Networking: 12:30 PM - 1:35 PM EST

If your organization uses a database, then you have a lot to gain by building an R package to make interfacing with that database easy and intuitive. Packages like DBI and odbc handle the creation of the database connection, and dbplyr lets you translate dplyr syntax to SQL. But there’s a missing layer in between, such that working with a connection object still doesn’t feel like exploring and joining tables in memory. In this talk I’ll introduce the dbcooper package, which wraps any database connection to turn it into a set of R functions, making it easy to create a database-specific R package. dbcooper makes the management of connections transparent so that you engage with the database through prefixed functions, as well as generating autocomplete-friendly accessors for each table for fast exploratory data analysis. I’ve used this general approach as the foundation of a data science ecosystem at several companies, and I’ll show an example of using the package to explore a public BigQuery database.

One of the benefits of a long career is that it gives us an opportunity to reflect upon all the ways our thinking has changed. In this talk I’ll go over several places where my thinking has changed, for each considering why I previously took a stance that I currently disagree with, and where I anticipate my views might change further. I hope this discussion will be useful in helping each of you to introspect on your own past and future intellectual development.

Break & Networking: 2:40 PM - 3:10 PM EST

After decades spent trying to teach computers to think, we now face the problem that AI and ML models often know more than they can communicate to us about why they make certain predictions. Interpretable machine learning, such as LIME or Shapely values, tries to shift that balance, by presenting a view towards the inner working of our complex models. My collaborator Selina Carter built some machine learning models and I did some interpretable AI, and I somehow convinced her to let me run a randomize controlled trial with 685 employees at a large firm, with half of them receiving an interpretable AI treatment. Now before you go all Andrew Gelman, I want to say that YES of course I used Bayes to analyze the data. I hadn’t used Bayes since the JAGS days and I want to say rstanarm is fantastic and the people who created it should be showered with praises. Why am I talking here? They’re the ones who should be celebrated.

Mayarí will talk about the importance of addressing missing data and of her practical experience when handling the non-response from a longitudinal study with Syrian refugee populations. After a rapid overview of multiple imputation, she will talk about some of the main challenges of addressing non-response comprehensively for data that caters to multiple analyses and researchers; in contexts where little is known about the DGP of each variable, missing data is prevalent, and hundreds of potentially important features are collected. She will share some of the software and settings that were helpful throughout the process, such as the Boruta, randomForest, mice, and miceadds packages. Lastly, she will illustrate the importance of transparency in analysis results regarding the uncertainty brought in by the missing data.

Robyn is an experimental, semi-automated and open-sourced Marketing Mix Modeling (MMM) package from Facebook Marketing Science. It uses various machine learning techniques (Ridge regression with cross validation, multi-objective evolutionary algorithm for hyperparameter optimisation, time-series decomposition for trend & season, gradient-based optimisation for budget allocation etc.) to define media channel efficiency and effectivity, explore adstock rates and saturation curves. It’s built for granular datasets with many independent variables and therefore especially suitable for digital and direct response advertisers with rich data sources.

Closing Remarks: 4:20 PM - 4:30 PM EST

John Krohn interviews Drew Conway live in this special post conference event.

Sponsors

Gold

Spring Health
RStudio

Silver

R Consortium

Bronze

Visiting Nurse Service of New York

Supporting

Pearson
Springer
Manning
Chapman & Hall/CRC, Taylor & Francis Group

Vibe

Cointreau
The Botanist

More sponsors to be announced.

If you are interested in being a sponsor for the 2021 New York R Conference, please contact us at info@landeranalytics.com