Coursera: Introduction to Data Science
Introduction to Data Science
Instructor: Bill Howe, University of Washington
Zeitraum: 30 Juni 2014 – 10 September 2014
Status: mache ich gerade, inkl. Exams und Zertifikat
Anmerkung: Introduction to Data Science war insgesamt ein durchaus netter Kurs. Die Lektionen sind eher abstrakt ohne dabei ins Unverständliche abzutriften, die Assignments sind dafür sehr praktisch und – falls man die jeweilige Technologie überhaupt nicht kennt – ein Wurf in das Wasser, der aber auch machbar ist, wenn man zumindest eine Basics kennt. Verwendete Technologien sind virtuelle Maschinen (es gibt eine für den Kurs), Github, Python, SQL, SQlite, MapReduce, Pig, Elastic Map Reduce, R, Kaggle und Tableau. Bei der Fülle ist klar, dass man nirgends davon ins Detail geht, aber man macht sich die Finger nass und kann auch selbst darauf aufbauen.
Größter Kritikpunkt wäre, dass sie mit etwas mehr Struktur beim Freischalten der Lektionen und Assignments den Studenten viel Verwirrung und Unsicherheit ersparen hätten können. Eine Seite mit einem Überblick und die erwartbare gleichzeitige Veröffentlichung von Lektion und Aufgabe jeden Dienstag hätte echt schon gereicht.
Course Syllabus
Part 0: Introduction
Examples, data science articulated, history and context, technology landscape
- Flavor network and the principles of food pairing
- The Expression of Emotions in 20th Century Books
- Google Flu Trends
- Google Flu Trends: The Limits of Big Data
- Italy scientists guilty of manslaughter
- data science venn diagram
- What is data science?
- the seven secrets of successful data scientists
- Deja VVVu: Others Claiming Gartner’s Construct for Big Data
- The Fourth Paradigm: Data-Intensive Scientific Discovery
- The End of Theory: The Data Deluge Makes the Scientific Method Obsolete
- Responses to the end of theory
Part 1: Data Manipulation at Scale
Databases and the relational algebra
Parallel databases, parallel query processing, in-database analytics
MapReduce, Hadoop, relationship to databases, algorithms, extensions, languages
Key-value stores and NoSQL; tradeoffs of SQL and NoSQL
- How Vertica Was the Star of the Obama Campaign, and Other Revelations
- 1981 Turing Award Lecture, Relational Database: A Practical Foundation for Productivity
- MAD Skills: New Analysis Practices for Big Data (pdf)
- Mining of Massive Datasets, Chapter 3
- MapReduce and Parallel DBMS’s: Friends or Foes? (pdf)
- MapReduce: A Flexible Data Processing Tool
- Scalable SQL and NoSQL Data Stores (pdf)
- The Hadoop Distributed File System
- Record Linkage: Similarity Measures and Algorithms (pdf)
Part 2: Analytics
Topics in statistical modeling: basic concepts, experiment design, pitfalls
Topics in machine learning: supervised learning (rules, trees, forests, nearest neighbor, regression), optimization (gradient descent and variants), unsupervised learning
- A Handbook of Statistical Analyses Using R, Chapter 3 (pdf)
- Gregory Park on overfitting to the leaderboard in a Kaggle Competition
- Why Most Published Research Findings Are False
- Benford’s Law
- Frequentism and Bayesianism: A Practical Introduction
- Frequentism and Bayesianism II: When Results Differ
- Frequentism and Bayesianism III: Confidence, Credibility, and why Frequentism and Science do not Mix
- Frequentism and Bayesianism IV: How to be a Bayesian in Python
- Top 10 Algorithms in Data Mining, Knowledge and Information Systems
- Mining of Massive Datasets, Chapter 1
- A Few Useful Things to Know about Machine Learning
- Top 10 Algorithms in Data Mining, Knowledge and Information Systems
Part 3: Communicating Results
Visualization, data products, visual data analytics
Provenance, privacy, ethics, governance
- The Joy of Stats
- Tools for Data Enthusiasts (vimeo)
- A Tour through the Visualization Zoo
- Big Ethics for Big Data
- Unreported Side Effects of Drugs Are Found Using Internet Search Data
- Data Skepticism
- Eight, No, Nine, Problems With Big Data
- Big data: are we making a big mistake?
- The backlash against big data
- Gartner Hype cycle
- New Truths That Only One Can See
- Why Most Published Research Findings Are False
- Whom the Gods Would Destroy, they First Give Real-Time Analytics
Part 4: Special Topics
Graph Analytics: structure, traversals, analytics, PageRank, community detection, recursive queries, semantic web
Quizzes
There will be eight total assignments of which two are optional.
There will be four structured programming assignments: two in Python, one in SQL, and one in R.
There will also be two open-ended assignments graded by peer assessment: one in visualization, and one in which you will participate in a Kaggle competition.
Finally, there will be two optional assignments: One involving an open-ended real-world project submitted by external organizations with real needs, and one involving processing a large dataset on AWS.