Data Science Teams

Firstly, the term “data science” has been used in many contexts and it is hard to use it any more without stating more precisely which of the uses one is referring to. It is still, however, a useful term because it attempts to describe a new and critical type of work that is happening in most tech companies, which itself warrants further differentiation.

There are many applicable organizational models in other companies who do data science which we can use here at Strategic Risk. These common threads in data science are captured partly in its roles:

  • Business Analytics
    • Business oriented analytics/reporting built on the work on data engineers
    • Business analytics is often done by “domain experts” or in concert with them
  • Data Engineering
    • Data extraction, transformation and staging for analysis purposes
    • This work is commonly done by database, software developers
  • Machine Learning Development
    • Predictive analysis through the use of algorithms and model development.
    • Done by scientifically-minded individuals with strong statistics and coding skills.
  • Technical Project Management
    • Ensures focus of data exploration activities, alignment with business plans and tight communication among the team

One thing that is often assumed is that data science is done only by data scientists. There certainly are individuals who poses pretty deep understanding of all of the roles above and who should the title of a data scientist, but they are very rare. The quest for this unicorn (as data scientists are now often called) rests at a very good alternative of creating a tightly coupled data science team.

The individuals in this team should assume one or more of these roles. However, attempting to find individuals who have most of these skills is still critical because data science is a highly iterative process and cannot work with many organizational or role barriers. It is then suggested that the team members must have a culture of learning, curiosity and confidence so that they can learn and step into as many roles as possible. This culture must, of course, be supported by management because one of the main reasons why people confine themselves in roles is to reject accountability and to fit the org chart titles they have been assigned. Corporate training is also almost always reserved for specialization and not cross-training. Other reasons have to do with marketing themselves as experts, but these reasons are growing weaker because the tech culture is changing rapidly to recognize multi-talented individuals and understand what it really takes to be a part of a data science team. Lastly, being defining oneself as a statistician or a data engineer, for example, implies that a person has more in common with other statisticians and data engineers than their teammates. These people might prefer the idea of universality of their skills and ability to be plugged into any team. In fact, they would prefer to take on work with other teams in order to develop their track record of doing just that.

This is why high focus on teamwork over role association is a key success of a data science team. If this takes place then perhaps we won’t capture the unicorn, but will no longer be looking for it.

Other References:

Microsoft Azure ML vs. SSAS Mining Structures

Late in July 2014 Microsoft released a “preview” version of a new machine learning application for their Azure cloud platform. Appropriately called Azure ML, this platform offers a data analyst or data scientist a visual workspace UI to design data experiments from loading, transformation, training, evaluating and consuming in applications. Even though it is in its infancy, Azure ML could become the best data science tool that is both very accessible and scalable. This article, however, focuses on the motivation behind this product in order to understand its future. This is not Microsoft’s first venture into the machine learning space. So, what has the machine learning journey been for Microsoft, how has it changed and where is it going?

Mining Structures is a product that shipped with Microsoft SQL Server (a part of its SSAS Business Intelligence service) for over a decade, including the latest SQL Server 2014. Mining Structures also allows analysts a fully featured UI to conduct data mining projects. Once a training dataset is defined, 9 different mining algorithms (such as clustering, logistic regression, decision tree and neural networks) can be applied to it in order to gain insights into underlying data structures/patterns. The fact that every database developer and BI analyst already has this product freely available and the fact that it is tightly integrated with the rest of the SQL Server products makes Mining Structures a very attractive product. However, Mining Structures in SQL Server is an entirely different product from Azure ML. Why release another product for similar solutions that does not leverage the previous products? Why continue to support both of these products at the same time? What exactly are the differences between these products and does Mining Structures have any advantages over Azure ML?

The Cloud Bet

 

Microsoft has developed their own cloud platform called Azure on which they are making available some previously on-premises software like SQL Server. In the cloud users can benefit from cheap entry into high-end software, flexible billing and scalable performance. So what does this mean for on-premises software like Mining Structures? The future of Mining Structures is similar as SQL Server. Microsoft has declared that the priority will be the cloud and has recently appointed a “cloud guy” as their CEO (Satya Nadella). The future of SQL Server and Mining Structures looks to be the cloud. Development and significant support for Mining Structures stopped with the 2008 version. SQL Server will undoubtedly continue on-premises, but Mining Structures doesn’t have nearly as big of a user base and could suffer a blow in the next version or two of SQL Server.

Despite being GUI driven and fully featured, SSAS Mining Structures serves a unique audience. It is packaged with the multidimensional BI product to be used by presumably by programmers, database or data warehouse developers. However, data mining is not suitable for such technicians, but rather advanced analysts. And this will partly be the cause of this product’s downfall, despite the fact that it is a good product that is tightly integrated in a data environment which data scientists today value, but perhaps didn’t jump on early enough. Microsoft was ahead of the more recent data analytics wave, didn’t see the transition of database developers to analysts they had hoped, and perhaps didn’t do enough to foster the community.

It is uncertain if the cloud will be the only place where advanced analytics that Microsoft shop users will use. They, and the rest of the analytics community certainly don’t currently and there are numerous reasons for that. Most popular analytics tools don’t live in the cloud (SAS, R, SPSS, etc.) and most organizations haven’t moved their data to a cloud platform like Azure. There will be a gulf where either the pull of products like Azure ML will be sufficient for companies to move to the cloud or to force them to go with a different vendor, because Microsoft will soon be out of the on-premises high analytics game.

The Comparison Between Azure ML and SSAS Mining Structures

Features

Mining Structures is packaged in a supportive ecosystem with tools like SQL Server, .NET and SSIS (SQL Server Integration Services). Therefore, in that context it is a fully featured, end-to-end product. It is also entirely GUI based for more non-coders but fully programmable for developers wishing to automate things like model training, testing, prediction/segmentation and even model creation. The set of canned algorithms (decision trees, neural networks, clustering, market basket analysis, logistic regression, linear regression, and Naïve Bayes) it sports is diverse and suitable for almost any data mining task. However, the versions of these algorithms are fairly out of date and underpowered. Their relative simplicity does have some advantages in that Mining Structures allows for analysis of the developed models themselves and an ability to see what you are mining – which is why Microsoft emphasized “data mining” over “machine learning” in the Mining Structures product.

Azure ML, on the other hand, doesn’t have such model assessment capabilities but does come with the most current and popular machine learning algorithms. Despite being a part of a growing ecosystem like Azure, Azure ML is not as tightly integrated. This is why it had to incorporate some aspects of other necessary tools like SSIS for ETL work inside it. Azure ML has a “workflow” canvas that is similar to the SSIS, but not as feature rich. It abandoned the SSAS DMX language for its still underpowered web API – in Mining Structures DMX and .NET libraries allowed users to do anything you can do visually through code while the Azure web API does not. The web API does allow tighter integration in across platform applications that can make HTTP calls, however.

Azure ML also promises to integrate with R in a variety of ways and eventually be able to handle custom or 3rd party algorithm libraries.

Perhaps the greatest feature of Azure ML is that it promises to scale. While most current machine learning problems can be handled on on-premises solutions, some aren’t and likely won’t be down the line. Azure is a cloud and Azure ML should be able to scale at cloud levels.

Usability

Mining Structures is not very easy to start off with because it shared a lot of UI components with multidimensional data warehouse design (through the Visual Studio interface anyway). This confuses analysts. Once you’re up and running, however, it is pretty smooth sailing. Mining Structures can be accessed through Excel, which makes its usability and accessibility better than any analytical tool out there. Mining Structures also has very nice visuals to help interpret models which, as mentioned, Azure ML does not have.

Azure ML is pretty. The workflow metaphor works well and it is intuitive. There are some aspects of the drag-and-drop UI that do not need to be dragged-and-dropped and make the workspace messy. The customization of the modules which can be used is currently both cryptic and unpolished. The polishing is surely to take places since this is just a Preview version, but the terminology used in Azure ML for similar concepts in Mining Structures is different and perhaps not as friendly to non-veteran data scientists. Azure ML, however, is a tool for just such people. This tool is most appealing to analysts who can’t use or don’t know how to use tools like SAS, Python or R which is why this terminology is inappropriate.

Cost

If you purchased SQL Server, you get Mining Structures for free. So, it’s great because SQL Server is a very popular product. If you haven’t then you can’t use it. Azure ML has a lower entry barrier because it is a pay as you go model – but watch out, those tiny processing chargers can rack up quickly.

Support

Support is a thorny issue here for both products. One is likely on its way out and the other hasn’t fully arrived. Mining Structures, has only 1 book completely devoted to it and it also has a relatively small online community to ask questions. Azure ML documentation is very sparse currently the user base is non-existent. The first book is due out later this month. So, one can only expect that the support will get better for Azure ML but for somebody currently evaluating what to use for mission critical projects I would say to use something else or wait and see.

Johns Hopkins’ Online Data Science Certifiation Via Coursera

A top rated school, Johns Hopkins Bloomberg School of Public Health has decided to create an online certification in Data Science and has partnered with Coursera to deliver it. Coursera is one of the leading online education companies that offers massive open online courses (MOOCs) by top universities. They recently began to offer accreditation through rigorous standards in student identity and coursework validation.

The Data Science certification comes at a time where jobs in that field are in high demand and, although the term “Data Science” is still evolving and is somewhat buzz-worthy, the discipline has strong roots and the standards in curriculum have emerged. Data science positions generally belong in industries such as medical, marketing, risk, sales, quality control, security and finance. Becoming a data scientist requires a background in traditional statistics methods, applied knowledge of modern machine learning, general programming expertise and an ability to work with the latest big-data technologies. Softer skills such as, an entrepreneurial spirit, hacking skills and great communication/presentation skills are also high in demand. The rare combination of these traits and background is what makes a data scientist so much in demand. Johns Hopkins and Coursera now attempt to offer sufficient training for such a career — or at least the beginning of one.

Even though possessing certifications is not generally sufficient qualification in itself, they should ideally offer some minimum level of competence that at least weeds out the general job applicants. This certification meets those criteria. That being said, there are some odd omissions from the curriculum.  The courses miss focus on big data solutions like Hadoop, working with cloud processing/storage solutions and even SQL. These would all be very good additions, but perhaps this is nitpicking.

Even though this certification goes through the bulk of the core topics in data science most companies would not hire a person with just this certificate without a significant (3+ years) real-world experience in some related positions like statistician, financial/data analyst, data engineer or software developer. The aforementioned “softer” skills are even harder to just pick up and are harder to hire for, but are likely considered to be icing on the cake for a good candidate.

In terms of quality of the available curriculum, teaching staff and the online interface this certification does very well. There does appear to be a slight difference in quality of instructors. While the theoretical knowledge is very likely solid for all of the professors, Prof. Peng does seem to be a lot clearer in his presentation style. Students taking any of the first few courses in the series that are not taught by Prof. Peng might need to reach out more to the discussion forums provided though the excellent Coursera site interface. The quizzes, and especially projects, test the student knowledge deeply and there aren’t many opportunities to skate by or cheat. This is why the successful completion of the certification truly represents the students’ understanding very well.

You can read other interesting reviews of the certification on these blogs:

Tech Visionary Elon Musk Warns About Artificial Intelligence

When techno-philes show optimism about technology or when techno-phobes warn about technology, it is not surprising. However, when a leader in several tech industries warns about the evolution of technology you should listen. This week, PayPal, Tesla and SpaceX founder, Elon Musk, expressed his concern around Artificial Intelligence (AI) in a tweet by saying, “We need to be super careful with AI. Potentially more dangerous than nukes.” He adds, “Hope we’re not just the biological boot loader for digital superintelligence. Unfortunately, that is increasingly probable”.

A fear of AI is nothing new in the media, but it certainly matters who feels it. Musk has made his fortune through creating companies based around pushing the boundaries of technology. It is very hard to find anybody in high tech who feels this way, let alone the people at the top. Even more surprising is that Musk has a financial stake in the future of A.I. as well through investments in A.I. development companies DeepMind and Vicarious. He recently told CNBC that his interest comes “not from the standpoint of actually trying to make any investment return… I like to just keep an eye on what’s going on with artificial intelligence. I think there is potentially a dangerous outcome there.”
Musk is certainly thinking a lot about the topic because the future of two of his companies would involve antonymous space robotics and self-driving cars. So what makes Musk cautious, at a time when few from the tech world are not blissfully optimistic about the future of AI?
Although there are many criticisms of artificial sentience, Musk alludes to the idea that we could be reduced some day to a “boot loader” – a metaphor of a low level program that initiates and enables a higher level system. The concern in this scenario is that the artificial intelligence would become super-intelligence and, despite what the motivation of such a being is, anything like that would at least dismiss the lower level humans like we do ants. The predictions by popular science-fiction writers and futurists on the timeline when this might happen range from decades to millennia. Some think it will come suddenly, like a Frankenstein waking but most believe that it will happen in stages – each depriving humans of its humanity. Some futurists believe that this eventuality will be a war between humans and artificial super-intelligence, while others think it will be a transcendence or an evolution of humans into such beings. Whatever the case, it is something which probably deserves some healthy fear by at least one tech leader today.