An edition of Doing Data Science (2013)

Doing Data Science

  • 28 Want to read
  • 2 Currently reading
  • 1 Have read

My Reading Lists:

Create a new list

  • 28 Want to read
  • 2 Currently reading
  • 1 Have read

Buy this book

Last edited by MARC Bot
October 6, 2024 | History
An edition of Doing Data Science (2013)

Doing Data Science

  • 28 Want to read
  • 2 Currently reading
  • 1 Have read

This work doesn't have a description yet. Can you add one?

Publish Date
Language
English
Pages
408

Buy this book

Previews available in: English

Edition Availability
Cover of: Doing Data Science
Doing Data Science
2016, O'Reilly Media, Incorporated
in English
Cover of: Doing Data Science
Doing Data Science: Straight Talk from the Frontline
2013, O'Reilly Media, Incorporated
in English
Cover of: Doing Data Science
Doing Data Science
2013, O'Reilly Media, Inc, USA
in English

Add another edition?

Book Details


Table of Contents

Preface
Page xiii
1. Introduction: What Is Data Science?
Page 1
Big Data and Data Science Hype
Page 1
Getting Past the Hype
Page 3
Why Now?
Page 4
Datafication
Page 5
The Current Landscape (with a Little History)
Page 6
Data Science Jobs
Page 10
A Data Science Profile
Page 10
Thought Experiment: Meta-Definition
Page 13
OK, So What Is a Data Scientist, Really?
Page 14
In Academia
Page 14
In Industry
Page 15
2. Statistical Inference, Exploratory Data Analysis, and the Data Science Process
Page 17
Statistical Thinking in the Age of Big Data
Page 17
Statistical Inference
Page 18
Populations and Samples
Page 19
Populations and Samples of Big Data
Page 21
Big Data Can Mean Big Assumptions
Page 24
Modeling
Page 26
Exploratory Data Analysis
Page 34
Philosophy of Exploratory Data Analysis
Page 36
Exercise: EDA
Page 37
The Data Science Process
Page 41
A Data Scientist's Role in This Process
Page 43
Thought Experiment: How Would You Simulate Chaos?
Page 44
Case Study: RealDirect
Page 46
How Does RealDirect Make Money?
Page 47
Exercise: RealDirect Data Strategy
Page 48
3. Algorithms
Page 51
Machine Learning Algorithms
Page 52
Three Basic Algorithms
Page 54
Linear Regression
Page 55
k-Nearest Neighbors (k-NN)
Page 71
k-means
Page 81
Exercise: Basic Machine Learning Algorithms
Page 85
Solutions
Page 86
Summing It All Up
Page 90
Thought Experiment: Automated Statistician
Page 91
4. Spam Filters, Naive Bayes, and Wrangling
Page 93
Thought Experiment: Learning by Example
Page 93
Why Won't Linear Regression Work for Filtering Spam?
Page 95
How About k-Nearest Neighbors?
Page 96
Naive Bayes
Page 98
Bayes' Law
Page 98
A Spam Filter for Individual Words
Page 99
A Spam Filter That Combines Words: Naive Bayes
Page 101
Fancy It Up: Laplace Smoothing
Page 103
Comparing Naive Bayes to k-NN
Page 105
Sample Code in bash
Page 105
Scraping the Web: APIs and Other Tools
Page 106
Jake's Exercise: Naive Bayes for Article Classification
Page 109
Sample R Code for Dealing with the NYT API
Page 110
5. Logistic Regression
Page 113
Thought Experiments
Page 114
Classifiers
Page 115
Runtime
Page 116
You
Page 117
Interpretability
Page 117
Scalability
Page 117
Media 6 Degrees Logistic Regression Case Study
Page 118
Click Models
Page 118
The Underlying Math
Page 120
Estimating α and β
Page 122
Newton's Method
Page 124
Stochastic Gradient Descent
Page 124
Implementation
Page 124
Evaluation
Page 125
Media 6 Degrees Exercise
Page 128
Sample R Code
Page 129
6. Time Stamps and Financial Modeling
Page 135
Kyle Teague and GetGlue
Page 135
Timestamps
Page 137
Exploratory Data Analysis (EDA)
Page 138
Metrics and New Variables or Features
Page 142
What's Next?
Page 142
Cathy O'Neil
Page 144
Thought Experiment
Page 144
Financial Modeling
Page 145
In-Sample, Out-of-Sample, and Causality
Page 146
Preparing Financial Data
Page 148
Log Returns
Page 149
Example: The S&P Index
Page 151
Working out a Volatility Measurement
Page 153
Exponential Downweighting
Page 155
The Financial Modeling Feedback Loop
Page 156
Why Regression?
Page 158
Adding Priors
Page 158
A Baby Model
Page 159
Exercise: GetGlue and Timestamped Event Data
Page 162
Exercise: Financial Data
Page 163
7. Extracting Meaning from Data
Page 165
William Cukierski
Page 165
Background: Data Science Competitions
Page 166
Background: Crowdsourcing
Page 167
The Kaggle Model
Page 170
A Single Contestant
Page 170
Their Customers
Page 172
Thought Experiment: What Are the Ethical Implications of a Robo-Grader?
Page 174
Feature Selection
Page 176
Example: User Retention
Page 177
Filters
Page 181
Wrappers
Page 181
Embedded Methods: Decision Trees
Page 184
Entropy
Page 186
The Decision Tree Algorithm
Page 187
Handling Continuous Variables in Decision Trees
Page 188
Random Forests
Page 190
User Retention: Interpretability Versus Predictive Power
Page 192
David Huffaker: Google's Hybrid Approach to Social Research
Page 193
Moving from Descriptive to Predictive
Page 194
Social at Google
Page 196
Privacy
Page 196
Thought Experiment: What Is the Best Way to Decrease Concern and Increase Understanding and Control?
Page 197
8. Recommendation Engines: Building a User-Facing Data Product at Scale
Page 199
A Real-World Recommendation Engine
Page 200
Nearest Neighbor Algorithm Review
Page 202
Some Problems with Nearest Neighbors
Page 202
Beyond Nearest Neighbor: Machine Learning Classification
Page 204
The Dimensionality Problem
Page 206
Singular Value Decomposition (SVD)
Page 207
Important Properties of SVD
Page 208
Principal Component Analysis (PCA)
Page 209
Alternating Least Squares
Page 211
Fix V and Update U
Page 212
Last Thoughts on These Algorithms
Page 213
Thought Experiment: Filter Bubbles
Page 213
Exercise: Build Your Own Recommendation System
Page 214
Sample Code in Python
Page 214
9. Data Visualization and Fraud Detection
Page 217
Data Visualization History
Page 217
Gabriel Tarde
Page 218
Mark's Thought Experiment
Page 219
What Is Data Science, Redux?
Page 220
Processing
Page 221
Franco Moretti
Page 221
A Sample of Data Visualization Projects
Page 222
Mark's Data Visualization Projects
Page 227
New York Times Lobby: Moveable Type
Page 227
Project Cascade: Lives on a Screen
Page 230
Cronkite Plaza
Page 231
eBay Transactions and Books
Page 232
Public Theater Shakespeare Machine
Page 234
Goals of These Exhibits
Page 235
Data Science and Risk
Page 235
About Square
Page 236
The Risk Challenge
Page 237
The Trouble with Performance Estimation
Page 240
Model Building Tips
Page 244
Data Visualization at Square
Page 248
Ian's Thought Experiment
Page 249
Data Visualization for the Rest of Us
Page 250
Data Visualization Exercise
Page 251
10. Social Networks and Data Journalism
Page 253
Social Network Analysis at Morning Analytics
Page 254
Case-Attribute Data versus Social Network Data
Page 254
Social Network Analysis
Page 255
Terminology from Social Networks
Page 256
Centrality Measures
Page 257
The Industry of Centrality Measures
Page 258
Thought Experiment
Page 259
Morningside Analytics
Page 260
How Visualizations Help Us Find Schools of Fish
Page 262
More Background on Social Network Analysis from a Statistical Point of View
Page 263
Representations of Networks and Eigenvalue Centrality
Page 264
A First Example of Random Graphs: The Erdos-Renyi Model
Page 265
A Second Example of Random Graphs: The Exponential Random Graph Model
Page 266
Data Journalism
Page 269
A Bit of History on Data Journalism
Page 269
Writing Technical Journalism: Advice from an Expert
Page 270
11. Causality
Page 273
Correlation Doesn't Imply Causation
Page 274
Asking Causal Questions
Page 274
Confounders: A Dating Example
Page 275
OK Cupid's Attempt
Page 276
The Gold Standard: Randomized Clinical Trials
Page 279
A/B Tests
Page 280
Second Best: Observational Studies
Page 283
Simpson's Paradox
Page 283
The Rubin Causal Model
Page 285
Visualizing Causality
Page 286
Definition: The Causal Effect
Page 287
Three Pieces of Advice
Page 289
12. Epidemiology
Page 291
Madigan's Background
Page 291
Thought Experiment
Page 292
Modern Academic Statistics
Page 293
Medical Literature and Observational Studies
Page 293
Stratification Does Not Solve the Confounder Problem
Page 294
What Do People Do About Confounding Things in Practice?
Page 295
Is There a Better Way?
Page 296
Research Experiment (Observational Medical Outcomes Partnership)
Page 298
Closing Thought Experiment
Page 302
13. Lessons Learned from Data Competitions: Data Leakage and Model Evaluation
Page 303
Claudia's Data Scientist Profile
Page 304
The Life of a Chief Data Scientist
Page 304
On Being a Female Data Scientist
Page 305
Data Mining Competitions
Page 305
How to Be a Good Modeler
Page 307
Data Leakage
Page 307
Market Predictions
Page 308
Amazon Case Study: Big Spenders
Page 308
A Jewelry Sampling Problem
Page 309
IBM Customer Targeting
Page 309
Breast Cancer Detection
Page 310
Pneumonia Prediction
Page 311
How to Avoid Leakage
Page 313
Evaluating Models
Page 313
Accuracy: Meh
Page 315
Probabilities Matter, Not 0s and 1s
Page 315
Choosing an Algorithm
Page 318
A Final Example
Page 319
Parting Thoughts
Page 320
14. Data Engineering: MapReduce, Pregel, and Hadoop
Page 321
About David Crawshaw
Page 322
Thought Experiment
Page 323
MapReduce
Page 324
Word Frequency Problem
Page 325
Enter MapReduce
Page 328
Other Examples of MapReduce
Page 329
What Can't MapReduce Do?
Page 330
Pregel
Page 331
About Josh Wills
Page 331
Thought Experiment
Page 332
On Being a Data Scientist
Page 332
Data Abundance Versus Data Scarcity
Page 332
Designing Models
Page 333
Economic Interlude: Hadoop
Page 333
A Brief Introduction to Hadoop
Page 334
Cloudera
Page 334
Back to Josh: Workflow
Page 335
So How to Get Started with Hadoop?
Page 335
15. The Students Speak
Page 337
Process Thinking
Page 337
Naive No Longer
Page 339
Helping Hands
Page 340
Your Mileage May Vary
Page 342
Bridging Tunnels
Page 345
Some of Our Work
Page 345
16. Next-Generation Data Scientists, Hubris, and Ethics
Page 347
What Just Happened?
Page 347
What Is Data Science (Again)?
Page 348
What Are Next-Gen Data Scientists?
Page 350
Being Problem Solvers
Page 350
Cultivating Soft Skills
Page 351
Being Question Askers
Page 352
Being an Ethical Data Scientist
Page 354
Career Advice
Page 359
Index
Page 361

Edition Identifiers

Open Library
OL26184456M
ISBN 13
9781449358655
LCCN
2015301588
OCLC/WorldCat
827841776, 868083954, 904949420

Work Identifiers

Work ID
OL17581302W

Community Reviews (0)

No community reviews have been submitted for this work.

Lists

Download catalog record: RDF / JSON