2019)¶

Her goals for this talk are to teach, excite, caution, and give tools to navigate the field of data science.

What is data science? Is it an emerging discipline, or is it just a fad?

A quote from Cathryn Carson at UC Berkeley hints that it is a new field emerging from the ubiquity of data collection that began with the invention of the internet, and accelerated with the invention of smartphones.

UC Berkeley is going all-in on data science, creating a huge program around data science. The end goal is that all students will eventually take an intro data science class to gain an understanding of the basics of data, their collection, and their utility.

1. Basics¶

Types of algorithms: i) Straightline algorithms: Most common. Input > program > output. ii) Rule-based algorithm: Conditional if-then statements are included. This is sometimes known as an expert system. iii) Data-driven algorithm: Use data collected from an arbitrary source to make inference about how the world works. This is sometimes called a machine learning algorithm and one type of data-driven algorithm is known as a recommender system. iv) Randomized algorithms: Involve some form of randomization. Think MCMC. v) Deep learning: Sometimes called deep neural networks. A data-driven algorithm that is designed for large data sets in which inference sacrificed for predictive power. These are something of a black box because the model structure is abstract, with no direct connection between inputs and outputs.

2. Research¶

Data are complex and may be structured in many different ways. Often, data are taken out of whatever natural structure they may start in (think tensors) and get flattened into a data table. This may lead to some incorrect assumptions, like inferential independence. Once way to combat the shortcomings precipitating from these assumptions is to use collective reasoning, which attempts to account for inferential dependence.

One such method of collective reasoning is referred to as information integration. Many data science projects pull data from many different sources and in many different formats. One of the biggest challenges in using relational information across data sources is that this has a level of uncertainty that is inherent in data-generating processes, and that this must be done at scale in order to harness all pertinent information.

Example: Inferring participant positions on a topic in an online debate. One challenge is to maintain the privacy of the participants.

Example: Article recommendation. Challenge: Similarity between users and between articles. There are lots of complex interdependencies and uncertainty here.

Project: Probabilistic Soft Logic (PSL): a programming language that combines logic and probability. Weighted rules capture soft dependencies and hard constraints. This produces code that is legible and can be used to take advantage of convex optimization. This gives the ability to write scalable models that allow for inference and prediction for degrees of similarity.

Project: Cyber-bullying detection (done by Sabina Tomkins): Was able to read Twitter and identify cyber-bullying better than the state-of-the-art model.

Project: Social Trust Models: Infers trust relationships between individuals. One model classified social groups based on who was a friend and who was antagonistic. Another classified social status in an ordinal context. Another model produced inference on trust between individuals.

Several other projects were mentioned, which involved many different types of online interactions using social media, games, and message boards.

3. Caution¶

What can go wrong? (A lot)

Example: Amazon Prime Now delivery areas. An analysis of the area being offered the Now service showed that predominantly white areas were given the service, and predominantly black areas were not. This was done in Atlanta, Chicago, and Boston.

Example: Amazon trained a resume-sorting model on resumes of current employees, which are predominantly male. This skewed the model, and resulted in gender discrimination.

Example: Google translate gender-associates words, like male-doctor, female-nurse.

Example: Recidivism risk model discriminates against black defendants. Recidivism was underpredicted for whites, but overpredicted for blacks.

Whay did these things happen?

Biased data: Selection bias, societal bias, institutional bias. Garbage in, garbage out.
Automation bias: Humans trust algorithms disproportionately, even in the face of contradictory evidence.

These can result in algorithmic discrimination. Algorithms can amplify, operationalize, and legitimize biases.

Now fairness is a huge topic in data science. Research articles in fairness in data science have been rapidly proliferating in the past few years. .Now there is even some legislation that helps to combat this by assigning humans to review algorithmic decisions.

What else is wrong?

Predictive accuracy for many of these models is very low
Magical thinking: People think of these models as magical and infallible
Frame problem: The models are typically crude, so they lose the complexity of the real world, leading to big errors.
Values: These models are optimizing a metric. Who is incontrol of this metric? What about their biases?
Buggy software: Programmers make mistakes. An incorrect line of code can have damaging reprocussions.
Algorithms shape people

4. Advice¶

To CS: There are things to keep in mind. Data are not objective. Technology is not neutral. There is a moral imperative to fully understand what your model is doing and why it is doing it. New research is needed to boost understanding of how many of these algorithms work. People should be educated about the flaws in many of these models.

To collaborators of CS: They love abstractions and simplifications. They like logic. They may not be have high social acuity. They are not likely to be trained in ethics. By and large, they want to make the world a better place.

5. What is responsible data science?¶

Responsibility means that we should be working on hard societal problems. Things like the Wicked problems, and others.

6. Take-aways¶

We need to improve statistical literacy, we need to be able to design and critique these systems, and we need to collaborate to make sure that these methods work well to not just explain how the world works, but to shape the world in a way that promotes equality and better quality of life for all humans.