Nurturing your Data Scientists : 70 years of accumulated experience at your service!

The Data Science community is proud to announce the entrance in of a startup dedicated to data scientists coaching and nurturing: WeLoveDataScience

Data scientist… Where are you? Do you really exist?

That’s the question many managers do currently face. Data Scientists is a scarce 5-legs sheeps: difficult to find, difficult to hire and difficult to keep.

 

logo_WeLOveDataScience_width2000

 

WeLoveDataScience is a brand new business unit, hosted at the European Data Innovation Hub, dedicated to searching, selecting, training, coaching and nurturing data science talents. The Belgian market does not propose enough candidates: we will take care to train next-gen data scientists for you.

Whatever your projects are, we propose to prepare for you the data scientist(s) you need, following these 7 steps:

  1. Together we prepare a detailed job description corresponding to your real needs: is this about data analysis, basic queries, reporting, data mining, big data or new technologies?
  2. We identify candidates on the market in particular through close collaborations with Belgian universities (Ghent, ULB/VUB, UCLouvain…)
  3. You hire the right candidate!
  4. He/she assists a 12 weeks data science bootcamp , including: high-level overview, data science for business, technical stack (SAS, R, Python…), introduction to specific technologies/topics (NoSQL or graph databases, social network analysis, text-mining..)
  5. Then he/she works for you at our offices during 4 tot 6 month, coached by one of our expert. On the job coaching on real projects: your projects but also hackathons, technology workshops, meetups…
  6. After those 10 intensive months, (s)he is ready to work for you on site. (S)He will demonstrate his/her knowledge by giving a course on a specific topic and/or writing entries in specialised blogs, giving a presentation at a conference…
  7. We assist you in yearly evaluation and follow-up.

Our intentions for 2016 are to help companies create and develop data science teams and to build a data science culture… And WeLoveDataScience: this is 70 years of accumulated experience at your disposal!

Want to know more? Visit: www.welovedatascience.com , send an email to info@welovedatascience.com or simply fill in this contact form. We will visit and explain what we do and brainstorm about your specific needs.

 

Coursera – Process Mining -TU Eindhoven – starts Nov 12th

TUeLOG_P_CMYK-2

 

https://www.coursera.org/course/procmin

 

Process Mining: Data science in Action

Process mining is the missing link between model-based process analysis and data-oriented analysis techniques. Through concrete data sets and easy to use software the course provides data science knowledge that can be applied directly to analyze and improve processes in a variety of domains.

Course at a Glance

4-6 hours of work / week
English
English subtitles

Graphs for HR Analytics by Rik Van Bruggen

 

Graphs for HR Analytics

Yesterday, I had the pleasure of doing a talk at the Brussels Data Science meetup. Some really cool people there, with interesting things to say. My talk was about how graph databases like Neo4j can contribute to HR Analytics. Here are the slides of the talk:

I truly had a lot of fun delivering the talk, but probably even more preparing for it.

My basic points that I wanted to get across where these:

  • the HR function could really benefit from a more real world understanding of how information flows in its organization. Information flows through the *real* social network of people in your organization – independent of your “official” hierarchical / matrix-shaped org chart. Therefore it follows logically that it would really benefit the HR function to understand and analyse this information flow, through social network analysis.
  • In recruitment, there is a lot to be said to integrate social network information into your recruitment process. This is logical: the social network will tell us something about the social, friendly ties between people – and that will tell us something about how likely they are to form good, performing teams. Several online recruitment platforms are starting to use this – eg. Glassdoor uses Neo4j to store more than 70% of the Facebook sociogram – to really differentiate themselves. They want to suggest and recommend the jobs that people really want.
  • In competence management, large organizations can gain a lot by accurately understanding the different competencies that people have / want to have. When putting together multi-disciplinary, often times global teams, this can be a huge time-saver for the project offices chartered to do this.

For all of these 3 points, a graph database like Neo4j can really help. So I put together a sample dataset that should explain this. Broadly speaking, these queries are in three categories:

  1. “Deep queries”: these are the types of queries that perform complex pattern matches on the graph. As an example, that would something like: “Find me a friend-of-a-friend of Mike that has the same competencies as Mike, has worked or is working at the same company as Mike, but is currently not working together with Mike.” In Neo4j cypher, that would something like this
 match (p1:Person {first_name:"Mike"})-[:HAS_COMPETENCY]->(c:Competency)<-[:HAS_COMPETENCY]-(p2:Person),  
 (p1)-[:WORKED_FOR|:WORKS_FOR]->(co:Company)<-[:WORKED_FOR]-(p2)  
 where not((p1)-[:WORKS_FOR]->(co)<-[:WORKS_FOR]-(p2))  
 with p1,p2,c,co  
 match (p1)-[:FRIEND_OF*2..2]-(p2)  
 return p1.first_name+' '+p1.last_name as Person1, p2.first_name+' '+p2.last_name as Person2, collect(distinct c.name), collect(distinct co.name) as Company;  
  1. “Pathfinding queries”: this allows you to explore the paths from a certain person to other people – and see how they are connected to eachother. For example, if I wanted to find paths between two people, I could do
 match p=AllShortestPaths((n:Person {first_name:"Mike"})-[*]-(m:Person {first_name:"Brandi"}))  
 return p;  

and get this:

Which is a truly interesting and meaningful representation in many cases.

  1. Graph Analysis queries: these are queries that look at some really interesting graph metrics that could help us better understand our HR network. There are some really interesting measures out there, like for example degree centrality, betweenness centrality, pagerank, and triadic closures. Below are some of the queries that implement these (note that I have done some of these also for the Dolphin Social Network). Please be aware that these queries are often times “graph global” queries that can consume quite a bit of time and resources. I would not do this on truly large datasets – but in the HR domain the datasets are often quite limited anyway, and we can consider them as valid examples.
 //Degree centrality  
 match (n:Person)-[r:FRIEND_OF]-(m:Person)  
 return n.first_name, n.last_name, count(r) as DegreeScore  
 order by DegreeScore desc  
 limit 10;  
   
 //Betweenness centrality  
 MATCH p=allShortestPaths((source:Person)-[:FRIEND_OF*]-(target:Person))  
 WHERE id(source) < id(target) and length(p) > 1  
 UNWIND nodes(p)[1..-1] as n  
 RETURN n.first_name, n.last_name, count(*) as betweenness  
 ORDER BY betweenness DESC  
   
 //Missing triadic closures  
 MATCH path1=(p1:Person)-[:FRIEND_OF*2..2]-(p2:Person)  
 where not((p1)-[:FRIEND_OF]-(p2))  
 return path1  
 limit 50;  
   
 //Calculate the pagerank  
 UNWIND range(1,10) AS round  
 MATCH (n:Person)  
 WHERE rand() < 0.1 // 10% probability  
 MATCH (n:Person)-[:FRIEND_OF*..10]->(m:Person)  
 SET m.rank = coalesce(m.rank,0) + 1;  

I am sure you could come up with plenty of other examples. Just to make the point clear, I also made a short movie about it:

The queries for this entire demonstration are on Github. Hope you like it, and that everyone understands that Graph Databases can truly add value in an HR Analytics contect.

Feedback, as always, much appreciated.

Rik

Coursera starts a free Mooc called Mining of Massive Datasets from Stanford University.

Coursera starts a free Mooc called  Mining of Massive Datasets from Stanford University.

This is a popular course at Stanford and goes along with the book by the same name.

The FREE course starts September 29, 2014, and runs for 7 weeks.

The prerequisites are some SQL, algorithms, and data structures knowledge.

 

How Sears Became a Real-Time Digital Enterprise Due to Big Data

Original post here
Sears,_Robuck_&_Co._letterhead_1907

Sears is a large retailer from the US that is a true Big Data pioneer for quite some years already. They have learned, made mistakes and achieved success by hands-on effort. It currently operates a very large enterprise deployment of Hadoop.

Sears was founded in 1893 and started as a shipping and mail order company. In 2005 it was acquired by Kmart, but continued to be operating under its own brand. In 2013 they had 798 stores and had revenue of over $ 21 billion. They are the fourth largest department store in the US and they offer millions of products across their stores. They have data of over 100 million customers, which they analyse to offer real-time, relevant offers to their customers. They are deep into Big Data and combine massive amounts of data to become a real-time digital enterprise.

 

Sears was ahead of its time, and its competitors, regarding Big Data. Already in 2010 they had a 10-nodes Hadoop cluster, which Walmart only reached in 2012. These days, Sears has a Hadoop cluster of 300-nodes that is populated with over 2 petabytes of structure customer transaction data, sales data and supply chain data. They used to have data in silos in many locations, but now their objective is to get all data in one place in order to achieve a single point of truth about the customer. But that’s not all; Sears applies Big Data also to combat fraud, track the effectiveness of marketing campaigns, optimize (personal) pricing, the supply chain as well as promotional campaigns.

Personalized Pricing

Sears combines and mixes vast amounts of data to help set (personal) prices in near real-time. Data from product information, local economic conditions, competitor prices etc. are combined and analysed using a price elasticity algorithm, which enables Sears to find the best price for the right product at the right moment in time and location via customized coupons. These coupons are given to loyal shoppers and are also used to move inventory if necessary. Just a few years ago this would have been still a dream scenario, as it used to take Sears up to 8 weeks to find the best price due to legacy systems, but nowadays this can be done almost in real-time.

In the past years, Sears went from nation wide pricing strategies to regional and now also personal pricing. The coupons that customers receive are based on where the customers live, the amount of products that are available as well as the products that need to go and which products Sears believes the customer will like and consequently will buy.

Shop Your Way Rewards loyalty program

In 2011, Sears launched a new loyalty program called the Shop your Way Rewards loyalty program. Also this program runs on Hadoop and that enables them to make use of 100% of the data that is collected. This results in better-targeted customers for certain online and mobile scenarios.

They key for Sears is to maximize multi-channel customer engagement through the loyalty program. Customers are providing their personal data in return for relevant interaction with that customer through the right channel, according to Dr. Phil Shelley, CTO at Sears Holdings Corporations, in an interview on Forbes.

Sears’ Big Data platform

In the past, Sears used all different kinds of tools on the data that was across the organisation in silos. These legacy systems prevented Sears from offering the right product at the right moment for the right price. Sears started by experimenting and innovating with Big Data, exactly as companies should when starting with Big Data. They began with a Hadoop cluster running on a net book computer and from there on they started experimenting. They have learned the hard way, through trial and error, among others due to the few outside Big Data experts they had that could guide them with the platform. They have managed to build a large centralized platform where all data is stored. The platform uses a variety of (open-source) tools such as Hive, Pig, Hbase, Solr, Lucene and MapReduce. This offers them all possibilities to have personalized interactions with the customer as well as use their data for different applications across the company.

Sears Big Data platform

 

Next to Hadoop, Sears also uses Datameer, a data exploration tool that enables visualization directly on top of Hadoop, for their ad-hoc queries, without the need for IT to be involved. Previously these jobs required ETL jobs that could take up to a few weeks. At the moment, Sears only gives their users access to Hadoop data via Datameer.

Sears started using Big Data because of declined sales, while major competitors such as Amazon kept growing. In the past years they have managed to move rapidly into the Big Data era and are turning their company into a real-time digital enterprise. A great achievement for a company that is over a century old.