+ - 0:00:00
Notes for current slide
Notes for next slide

ETC1010: Data Modelling and Computing

Guest lecture

Dr James McKeone

Brightstar

2019-10-15

1 / 12

Contents

  • Data science practice
  • I've seen R now, so what else?
    • Databases? Is SQL really necessary?
    • Languages? I gotta learn Python/Spark/Julia/C#/f#/Haskell/C/C++/(SAS?) too?
    • Version control? I need version control?
    • Must do DeepLearning^{TM}! Hmmm.
  • Test driven development (live demo)
2 / 12

James

  • 🎓 Bachelors of Business, Mathematics, Applied Science (hons) QUT
  • 🎓 PhD in Statistics QUT
  • Research: Functional data analysis, model choice, Bayesian methods, design, max-stable processes
  • Research applications: Spinal injury and motor neurone disease ðŸĶī, clinical trials ðŸĨž, climate models 🍃
  • Industry applications: ðŸĶ, 💷, ðŸ“Ą, ☎ïļ
3 / 12

Thoughts on data science practices

What you need to learn (Remember Week 1?)

Data preparation accounts for about 80% of the work of data scientists

-- Gil Press, Forbes 2016

4 / 12

Thoughts on data science practices

  • In my experience, data preparation is at least 80%
  • The "model" is often fit in 1 line of code
  • Most industry problems are "solved" before the data science team touches them
    • Are we enablers?
  • Company jargon that I find objectionable - Resources. Capacity. High-level plans.
5 / 12

I've seen R now, so what else?

Databases? I need SQL?

  • Structured query language (SQL) -- the method to access most databases.
  • Very easy to learn the basics ~ 1 week?
  • Quite nuanced to master
  • Leave databases to DBA's and developers
6 / 12

I've seen R now, so what else?

How a data scientist should use it

  • Poorly written SQL code can crash the server
    • Write the simplest query that returns the smallest dataset you need, preferably in a single query statement
    • Then use R and all of the skills you have learned in this course!
7 / 12

I've seen R now, so what else?

Databases? I need SQL?

  • Data scientists generally are pretty poor at SQL
  • Just like in R, use a style and follow or set best practices
SELECT
[Date]
,[Model]
,[Capacity]
,[Colour]
,[Region]
,[Price]
,[Quantity]
FROM [dbo].[GlobalPriceTable] as P
INNER JOIN [dbo].[PhoneInfo] as I
ON P.[PhoneId] = I.[PhoneId]
WHERE [Date] > '2000-01-01'
8 / 12

I've seen R now, so what else?

Languages? I need to learn Python/Spark/Julia/C#/Haksell/C++/C

  • Use what your colleagues use
  • What out for R-shamers (& R fanbois!)
  • Search is your friend. All the tools learned in this course have synonyms in other languages, eg. "gather in python pandas" Search!
  • R studio is an excellent IDE for R don't let it be a crutch to learning other languages
  • Wait for the right project to learn a new language
9 / 12

I've seen R now, so what else?

Version control. I need version control?

  • If you want to build data science products, you must use version control
  • If a company doesn't have it / wont let you have it, don't join
  • Tools:
    • GIT GUIs: SourceTree, git Kraken, git tower, et al.
    • Difftools: Meld, p4Merge, TortiseMerge, et al.
    • Command line -- easiest in the long-term
10 / 12

I've seen R now, so what else?

Must do DeepLearning! Hmmm

  • Have their place - image detection, massive datasets
  • Try simple first
    • Linear(!)
    • Linear with feature engineering
  • The simplest model that lets the business make the decision is the best. It can always be improved later
11 / 12

I've seen R now, so what else?

Code Broken? Testing and Test Driven Development (TDD)

  • You are not allowed to write any production code unless it is to make a failing unit test pass
  • You are not allowed to write more of a unit test than is sufficient to fail
  • You are not allowed to write any more production code than is sufficient to pass the failing test

-- Uncle Bob: The Three Laws of TDD (https://www.youtube.com/watch?v=qkblc5WRn-U)

Demo...

12 / 12

Contents

  • Data science practice
  • I've seen R now, so what else?
    • Databases? Is SQL really necessary?
    • Languages? I gotta learn Python/Spark/Julia/C#/f#/Haskell/C/C++/(SAS?) too?
    • Version control? I need version control?
    • Must do DeepLearning^{TM}! Hmmm.
  • Test driven development (live demo)
2 / 12
Paused

Help

Keyboard shortcuts

↑, ←, Pg Up, k Go to previous slide
↓, →, Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow