class: center, middle, inverse, title-slide # ETC1010: Data Modelling and Computing ## Guest lecture ### Dr James McKeone ### Brightstar ### 2019-10-15 --- exclude: true <style type="text/css"> code.r{ font-size: 16px; } pre { font-size: 16px !important; } </style> --- # Contents .huge[ * Data science practice * I've seen R now, so what else? * Databases? Is SQL really necessary? * Languages? I gotta learn Python/Spark/Julia/C#/f#/Haskell/C/C++/(SAS?) too? * Version control? I need version control? * Must do DeepLearning^{TM}! Hmmm. * Test driven development (live demo) ] --- # James .huge[ * ð Bachelors of Business, Mathematics, Applied Science (hons) QUT * ð PhD in Statistics QUT * Research: Functional data analysis, model choice, Bayesian methods, design, max-stable processes * Research applications: Spinal injury and motor neurone disease ðĶī, clinical trials ðĨž, climate models ð * Industry applications: ðĶ, ð·, ðĄ, âïļ ] --- class: bg-indigo .white.huge[Thoughts on data science practices] .blockquote.huge.white[ What you need to learn (Remember Week 1?) Data preparation accounts for about 80% of the work of data scientists -- [Gil Press, Forbes 2016](https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/##47cbbbf46f63) ] --- class: bg-indigo .white.huge[Thoughts on data science practices] .white.huge[ * In my experience, data preparation is at least 80% * The "model" is often fit in 1 line of code * Most industry problems are "solved" before the data science team touches them * Are we enablers? * Company jargon that I find objectionable - Resources. Capacity. High-level plans. ] --- # I've seen R now, so what else? ## Databases? I need SQL? .huge[ * Structured query language (SQL) -- the method to access most databases. * Very easy to learn the basics ~ 1 week? * Quite nuanced to master * Leave databases to DBA's and developers ] --- # I've seen R now, so what else? ### How a data scientist should use it .huge[ * Poorly written SQL code can crash the server * Write the simplest query that returns the smallest dataset you need, preferably in a single query statement * Then use R and all of the skills you have learned in this course! ] --- # I've seen R now, so what else? ## Databases? I need SQL? .vlarge[ * Data scientists generally are pretty poor at SQL * Just like in R, use a style and follow or set best practices ] ```sql SELECT [Date] ,[Model] ,[Capacity] ,[Colour] ,[Region] ,[Price] ,[Quantity] FROM [dbo].[GlobalPriceTable] as P INNER JOIN [dbo].[PhoneInfo] as I ON P.[PhoneId] = I.[PhoneId] WHERE [Date] > '2000-01-01' ``` --- # I've seen R now, so what else? ## Languages? I need to learn Python/Spark/Julia/C#/Haksell/C++/C .vlarge[ * Use what your colleagues use * What out for R-shamers (& R fanbois!) * Search is your friend. All the tools learned in this course have synonyms in other languages, eg. "gather in python pandas" Search! * R studio is an excellent IDE for R don't let it be a crutch to learning other languages * Wait for the right project to learn a new language ] --- # I've seen R now, so what else? ## Version control. I need version control? .huge[ * If you want to build data science products, you must use version control * If a company doesn't have it / wont let you have it, don't join * Tools: * GIT GUIs: SourceTree, git Kraken, git tower, et al. * Difftools: Meld, p4Merge, TortiseMerge, et al. * Command line -- easiest in the long-term ] --- # I've seen R now, so what else? ## Must do DeepLearning! Hmmm .huge[ * Have their place - image detection, massive datasets * Try simple first * Linear(!) * Linear with feature engineering * The simplest model that lets the business make the decision is the best. It can always be improved later ] --- # I've seen R now, so what else? ## Code Broken? Testing and Test Driven Development (TDD) .blockquote.large[ * You are not allowed to write any production code unless it is to make a failing unit test pass * You are not allowed to write more of a unit test than is sufficient to fail * You are not allowed to write any more production code than is sufficient to pass the failing test -- Uncle Bob: The Three Laws of TDD (https://www.youtube.com/watch?v=qkblc5WRn-U) ] .huge[ Demo... ]