Building a team to follow together the new Datascience Coursera courses starting next week

coursera            jhu_new_logo_large


Hendrik D’Oosterlinck is taking the lead to organize this initiative.


Coursera and Johns Hopkins University starts a full Data Science training. Why don’t we build a teal and do this together. Please comment on this post if you want to participate.


Let’s do this together !

Learn Data Science from one of the world’s top universities.
Johns Hopkins professors developed the Data Science Specialization to guide you from fundamental principles to advanced competency.
  • Gain hands-on Data Science experience with a Capstone Project
  • Showcase your knowledge with a Verified Certificate on your LinkedIn profile and resume
  • Adapt to your schedule with courses repeating monthly
  • Have unlimited retries for up to two years while available
Happy learning,
Coursera Team

Coursera – Social Media Analysis – Michigan Univerity

University of Michigan

The Social Network Analysis MOOC started this week on Coursera.
The course is given by Lada Adamic, an assiciate professor at MU who took a sabbatical year to go and work at Facebook. A year later she’s back with this inspiring course.
Lada Adamic will introduce you to social network mechanics and concepts. The tool of choice in this case is Gephi, which is a free to use graph/network visualisation tool.
This 8 week course combines video lectures with homework assignments during which you will learn to use Gephi and apply the freshly acquired knowledge on real data sets.
The course offers the possibility to apply for a certificate.

As a personal note from Glenn Vanderlinden:

I already went through the first couple of units and it looks rather interesting. It makes use of Gephi, which is to an extent an alternative to Neo4j. Might be interested for people who attended the last Meetup or who are interested in graph/network analysis. I hope this is useful for the community.


Lada Adamic

Lada Adamic

Coursera – Process Mining -TU Eindhoven – starts Nov 12th



Process Mining: Data science in Action

Process mining is the missing link between model-based process analysis and data-oriented analysis techniques. Through concrete data sets and easy to use software the course provides data science knowledge that can be applied directly to analyze and improve processes in a variety of domains.

Course at a Glance

4-6 hours of work / week
English subtitles

Datasciencebe comes second in the study from @marc_smith using NodeXL SNA Map

Selection criteria: data science OR #datascience Twitter NodeXL SNA Map and Report for Tuesday, 08 July 2014 at 17:54 UT

From: marc_smith,  Uploaded on: July 08, 2014
  • The graph represents a network of 6,564 Twitter users whose tweets in the requested range contained “data science OR #datascience”, or who were replied to or mentioned in those tweets. The network was obtained from the NodeXL Graph Server on Tuesday, 08 July 2014 at 17:59 UTC.
  • The requested start date was Tuesday, 08 July 2014 at 23:59 UTC and the maximum number of tweets (going backward in time) was 10,000.
  • The tweets in the network were tweeted over the 17-day, 1-hour, 40-minute period from Friday, 20 June 2014 at 21:48 UTC to Monday, 07 July 2014 at 23:28 UTC.
  • There is an edge for each “replies-to” relationship in a tweet, an edge for each “mentions” relationship in a tweet, and a self-loop edge for each tweet that is not a “replies-to” or “mentions”.
  • The graph is directed.
  • The graph’s vertices were grouped by cluster using the Clauset-Newman-Moore cluster algorithm.
  • The graph was laid out using the Harel-Koren Fast Multiscale layout algorithm.
  • The edge colors are based on edge weight values. The edge widths are based on edge weight values. The edge opacities are based on edge weight values. The vertex sizes are based on followers values. The vertex opacities are based on followers values.

Overall Graph Metrics:

  • Vertices: 6564
  • Unique Edges: 7487
  • Edges With Duplicates: 4294
  • Total Edges: 11781
  • Self-Loops: 5169
  • Reciprocated Vertex Pair Ratio: 0.0284219703574542
  • Reciprocated Edge Ratio: 0.0552729738894541
  • Connected Components: 2411
  • Single-Vertex Connected Components: 1890
  • Maximum Vertices in a Connected Component: 3054
  • Maximum Edges in a Connected Component: 7070
  • Maximum Geodesic Distance (Diameter): 19
  • Average Geodesic Distance: 5.396585
  • Graph Density: 0.000136909565312827
  • Modularity: 0.537045
  • NodeXL Version:

Top 10 Vertices, Ranked by Betweenness Centrality:

  1. kirkdborne
  2. datasciencebe
  3. kdnuggets
  4. analyticbridge
  5. jackwmson
  6. wsj
  7. datasciencedojo
  8. coursera
  9. zeynep
  10. data_nerd

Introduction to Data Science

Introduction to Data Science

Introduction to Data Science

Join the data revolution. Companies are searching for data scientists. This specialized field demands multiple skills not easy to obtain through conventional curricula. Introduce yourself to the basics of data science and leave armed with practical experience extracting value from big data. #uwdatasci

Preview Lectures

About the Course

Commerce and research are being transformed by data-driven discovery and prediction. Skills required for data analytics at massive levels – scalable data management on and off the cloud, parallel algorithms, statistical modeling, and proficiency with a complex ecosystem of tools and platforms – span a variety of disciplines and are not easy to obtain through conventional curricula. Tour the basic techniques of data science, including both SQL and NoSQL solutions for massive data management (e.g., MapReduce and contemporaries), algorithms for data mining (e.g., clustering and association rule mining), and basic statistical modeling (e.g., linear and non-linear regression).

Course Syllabus

Part 0: Introduction 

  • Examples, data science articulated, history and context, technology landscape

Part 1: Data Manipulation at Scale

  • Databases and the relational algebra  
  • Parallel databases, parallel query processing, in-database analytics 
  • MapReduce, Hadoop, relationship to databases, algorithms, extensions, languages  
  • Key-value stores and NoSQL; tradeoffs of SQL and NoSQL

Part 2: Analytics 

  • Topics in statistical modeling: basic concepts, experiment design, pitfalls
  • Topics in machine learning: supervised learning (rules, trees, forests, nearest neighbor, regression), optimization (gradient descent and variants), unsupervised learning

Part 3: Communicating Results 

  • Visualization, data products, visual data analytics  
  • Provenance, privacy, ethics, governance 

Part 4: Special Topics

  • Graph Analytics: structure, traversals, analytics, PageRank, community detection, recursive queries, semantic web
  • Guest Lectures

Recommended Background

We expect you to have intermediate programming experience and familiarity with databases, roughly equivalent to two college courses.  We will have four programming assignments: two in Python, one in SQL, and one in R. The target audience is undergraduate students across disciplines who wish to build proficiency working with large datasets and a range of tools to perform predictive analytics.

After taking this course, you may be interested in participating in the three-course Certificate in Data Science offered through the University of Washington Professional and Continuing Education program.  This online course will provide an overview and introduction to the more extensive material covered in that program, which offers classroom-based instruction by data scientists from Microsoft and other Seattle players, networking opportunities with peers, case studies from the “front lines,” and deep dives into selected topics.


Suggested Readings

There will be selected readings each week.  

We recommend, but do not require, that students refer to the book Mining of Massive Datasets by Anand Rajaraman and Jeff Ullman

Course Format

The class will consist of lecture videos about 8 to 10 minutes in length. These will contain 1-2 integrated quizzes per video. Some of these videos will be given by guest lecturers from the data science community. 

There will be no formal exams or standalone quizzes. 

There will be eight total assignments of which two are optional. 

We will provide a virtual machine equipped with all necessary software, but you are permitted (and encouraged) to install software in your own environment as well. 

There will be four structured programming assignments: two in Python, one in SQL, and one in R. 

There will also be two open-ended assignments graded by peer assessment: one in visualization, and one in which you will participate in a Kaggle competition. 

Finally, there will be two optional assignments: One involving an open-ended real-world project submitted by external organizations with real needs, and one involving processing a large dataset on AWS.


Will I get a Statement of Accomplishment after completing this class?

Yes. Students who successfully complete the class will receive a Statement of Accomplishment signed by the instructor.  

What resources will I need for this class? 

For this course, you will need an Internet connection and either a) the ability to run a virtual machine locally or b) the ability and knowledge to install the appropriate software yourself.  The software will include Python 2.7 (including various libraries), R, SQLite (or another database you are comfortable using).  You will also have the opportunity to install and work with Hadoop, but for logistics reasons, we will not require its use in an assignment.  Some assignments will be open-ended.

What level of programming experience should I have? 

We expect intermediate programming experience in some language and some familiarity with database concepts.  There will be programming assignments, but these are not designed to test knowledge of the language itself and will not involve using any esoteric features.  The languages we will use are Python, R, and SQL.



Course at a Glance

10-12 hours of work / week
English subtitles

Instructors Bill Howe – University of Washington