Data Analysis From Scratch With Python Beginner Guide using Python

До загрузки: 30 сек.



Благодарим, что скачиваете у нас :)

Если, что - то:

  • Поделится ссылкой:
  • Документ найден в свободном доступе.
  • Загрузка документа - бесплатна.
  • Если нарушены ваши права, свяжитесь с нами.
Формат: pdf
Найдено: 06.08.2020
Добавлено: 30.09.2020
Размер: 2.92 Мб

D ATA A NALY SIS F R O M S C RAT C H W IT H P Y TH ON
Ste p B y S te p G uid e



Pete rs M org an

How t o c o n ta ct u s
If y ou f in d a n y d am ag e, e d itin g i s su es o r a n y o th er i s su es i n t h is b ook c o nta in
ple ase i m med ia te ly n otif y o ur c u sto m er s e rv ic e b y e m ail a t:
co n ta ct@ ais c ic en ces.c o m

Our g oal i s t o p ro vid e h ig h-q uality b ooks f o r y o ur t e ch nic a l l e a rn in g i n
co m pute r s c ie n ce s u bje cts .
Thank y o u s o m uch f o r b uyin g t h is b ook.

Pre fa ce
“ H um anity is o n th e v erg e o f d ig ita l s la very a t th e h ands o f A I a nd b io m etr ic te ch nolo gie s. O ne w ay to
p re ven t th at is to develo p in built m odule s of deep fe elin gs of lo ve and co m passio n in th e le a rn in g
a lg orith m s.”
― A m it R ay, C om passio n ate A rtif ic ia l S uperin te llig en ce A I 5 .0 - A I w ith B lo ck ch ain , B M I, D ro n e, I O T,
a n d B io m etr ic T ech nolo gie s
I f y ou a re lo okin g fo r a c o m ple te g uid e to th e P yth on la n guag e a n d its lib ra ry
t h at w ill h elp y ou t o b eco m e a n e ff e ctiv e d ata a n aly st, t h is b ook i s f o r y ou.
T his b ook c o nta in s t h e P yth on p ro gra m min g y ou n eed f o r D ata A naly sis .
Why t h e A I S cie n ces B ook s a re d if fe re n t?
T he A I S cie n ces B ooks e x plo re e v ery a sp ect o f A rtif ic ia l I n te llig en ce a n d D ata
S cie n ce u sin g c o m pute r S cie n ce p ro gra m min g la n guag e s u ch a s P yth on a n d R .
O ur b ooks m ay b e th e b est o ne fo r b eg in ners ; it's a s te p -b y-s te p g uid e fo r a n y
p ers o n w ho w an ts to s ta rt le arn in g A rtif ic ia l I n te llig en ce a n d D ata S cie n ce f ro m
s c ra tc h . I t w ill h elp y ou i n p re p arin g a s o lid f o undatio n a n d l e arn a n y o th er h ig h-
l e v el c o urs e s w ill b e e asy t o y ou.
Ste p B y S te p G uid e a n d V is u al I llu str a tio n s a n d E xam ple s
T he B ook g iv e co m ple te in str u ctio ns fo r m an ip ula tin g, p ro cessin g, cle an in g,
m odelin g an d cru nch in g data se ts in P yth on. T his is a han ds-o n guid e w ith
p ra ctic al case stu die s of data an aly sis pro ble m s eff e ctiv ely . Y ou w ill le arn
p an das, N um Py, I P yth on, a n d J u pite r i n t h e P ro cess.
Who S hou ld R ea d T his ?
T his b ook is a p ra ctic al in tr o ductio n to d ata s c ie n ce to ols in P yth on. It is id eal
f o r an aly st’s b eg in ners to P yth on an d fo r P yth on p ro gra m mers n ew to d ata
s c ie n ce an d co m pute r sc ie n ce. In ste ad of to ugh m ath fo rm ula s, th is book
c o nta in s s e v era l g ra p hs a n d i m ag es.

© C opyrig ht 2 016 b y A I S cie n ces L LC
A ll r ig hts r e se rv ed .
F ir s t P rin tin g, 2 016

E dite d b y D av ie s C om pan y
E book C onverte d a n d C over b y P ix els S tu dio
Publis e d b y A I S cie n ces L LC

I S B N -1 3: 9 78-1 721942817
I S B N -1 0: 1 721942815

T he c o nte n ts o f th is b ook m ay n ot b e re p ro duced , d uplic ate d o r tr a n sm itte d w ith out th e d ir e ct w ritte n
p erm is sio n o f t h e a u th or.
U nder n o cir c u m sta n ces w ill an y le g al re sp onsib ility o r b la m e b e h eld ag ain st th e p ublis h er fo r an y
r e p ara tio n, d am ag es, o r m oneta ry l o ss d ue t o t h e i n fo rm atio n h ere in , e ith er d ir e ctly o r i n dir e ctly .

L eg al N otic e:
Y ou c an not a m en d, d is tr ib ute , s e ll, u se , q uote o r p ara p hra se a n y p art o r t h e c o nte n t w ith in t h is b ook w ith out
t h e c o nse n t o f t h e a u th or.
D is c la im er N otic e:
P le ase n ote th e in fo rm atio n c o nta in ed w ith in th is d ocu m en t is f o r e d ucatio nal a n d e n te rta in m en t p urp ose s
o nly . N o w arra n tie s o f a n y k in d a re e x pre sse d o r im plie d . R ead ers a ck now le d ge th at th e a u th or is n ot
e n gag in g in th e re n derin g o f le g al, fin an cia l, m ed ic al o r p ro fe ssio nal a d vic e. P le ase c o nsu lt a lic en se d
p ro fe ssio nal b efo re a tte m ptin g a n y t e ch niq ues o utlin ed i n t h is b ook.
B y r e ad in g th is d ocu m en t, th e r e ad er a g re es th at u nder n o c ir c u m sta n ces is th e a u th or r e sp onsib le f o r a n y
l o sse s, d ir e ct o r in dir e ct, w hic h a re in cu rre d a s a re su lt o f th e u se o f in fo rm atio n c o nta in ed w ith in th is
d ocu m en t, i n clu din g, b ut n ot l im ite d t o , e rro rs , o m is sio ns, o r i n accu ra cie s.

Fro m A I S cie n ces P ublis h er

To m y w ife M ela nia
and m y c h ild re n T a nner a nd D anie l
with out w hom t h is b ook w ould h ave
been c o m ple te d .

Auth or B io gra p hy
P ete rs M org an is a lo ng-tim e u se r a n d d ev elo per o f th e P yth on. H e is o ne o f th e
c o re d ev elo pers o f s o m e d ata s c ie n ce lib ra rie s in P yth on. C urre n tly , P ete r w ork s
a s M ach in e L earn in g S cie n tis t a t G oogle .

Tab le o f C on te n ts
P re fa ce
Why t h e A I S cie n ces B ook s a re d if fe re n t?
Ste p B y S te p G uid e a n d V is u al I llu str a tio n s a n d E xam ple s
Who S hou ld R ea d T his ?
F ro m A I S cie n ces P ublis h er
A uth or B io gra p hy
T ab le o f C on te n ts
In tr o d uctio n
2. W hy C hoose P yth on f o r D ata S cie n ce & M ach in e L ea rn in g
Pyth on v s R
Wid esp re ad U se o f P yth on i n D ata A naly sis
Cla rity
3. P re re q uis it e s & R em in ders
Pyth on & P ro gra m min g K now le d ge
In sta lla tio n & S etu p
Is M ath em atic al E xpertis e N ecessa ry ?
4. P yth on Q uic k R ev ie w
Tip s f o r F aste r L earn in g
5. O verv ie w & O bje ctiv es
Data A naly sis v s D ata S cie n ce v s M ach in e L earn in g
Possib ilitie s
Lim ita tio ns o f D ata A naly sis & M ach in e L earn in g
Accu ra cy & P erfo rm an ce
6. A Q uic k E xam ple
Iris D ata se t
Pote n tia l & I m plic atio ns
7. G ettin g & P ro cessin g D ata
CSV F ile s
Featu re S ele ctio n
Onlin e D ata S ourc es
In te rn al D ata S ourc e

8. D ata V is u aliz a tio n
Goal o f V is u aliz atio n
Im portin g & U sin g M atp lo tlib
9. S uperv is e d & U nsu perv is e d L ea rn in g
What i s S uperv is e d L earn in g?
What i s U nsu perv is e d L earn in g?
How t o A ppro ach a P ro ble m
10. R eg re ssio n
Sim ple L in ear R eg re ssio n
Multip le L in ear R eg re ssio n
Decis io n T re e
Ran dom F ore st
11 . C la ssif ic a tio n
Logis tic R eg re ssio n
K-N eare st N eig hbors
Decis io n T re e C la ssif ic atio n
Ran dom F ore st C la ssif ic atio n
12. C lu ste rin g
Goals & U se s o f C lu ste rin g
K-M ean s C lu ste rin g
Anom aly D ete ctio n
13. A sso cia tio n R ule L earn in g
Expla n atio n
Aprio ri
14. R ein fo rc em en t L ea rn in g
What i s R ein fo rc em en t L earn in g?
Com paris o n w ith S uperv is e d & U nsu perv is e d L earn in g
Apply in g R ein fo rc em en t L earn in g
15. A rtif ic ia l N eu ra l N etw ork s
An I d ea o f H ow t h e B ra in W ork s
Pote n tia l & C onstr a in ts
Here ’s a n E xam ple
16. N atu ra l L an gu age P ro cessin g
Analy zin g W ord s & S en tim en ts
Usin g N LT K

Than k y o
u 
!
Sou rc es & R efe re n ces
Softw are , l ib ra rie s, & p ro gra m min g l a n gu age
Data se ts
Onlin e b ook s, t u to ria ls , & o th er r e fe re n ces
Than k y o
u 
!

I n tr o d uctio n
W hy r e ad o n? F ir s t, y ou’ll le arn h ow to u se P yth on in d ata a n aly sis ( w hic h is a
b it c o ole r a n d a b it m ore a d van ced th an u sin g M ic ro so ft E xcel) . S eco nd, y ou’ll
a ls o le arn how to gain th e m in dse t of a re al data an aly st (c o m puta tio nal
t h in kin g).
M ore im porta n tly , y ou’ll le arn h ow P yth on a n d m ach in e le arn in g a p plie s to r e al
w orld p ro ble m s ( b usin ess, s c ie n ce, m ark et r e se arc h , te ch nolo gy, m an ufa ctu rin g,
r e ta il, fin an cia l) . W e’ll p ro vid e se v era l e x am ple s o n h ow m odern m eth ods o f
d ata a n aly sis f it i n w ith a p pro ach in g a n d s o lv in g m odern p ro ble m s.
T his is im porta n t b ecau se th e m assiv e in flu x o f d ata p ro vid es u s w ith m ore
o pportu nitie s to g ain in sig hts an d m ak e an im pact in alm ost an y fie ld . T his
r e cen t p hen om en on a ls o p ro vid es n ew c h alle n ges th at r e q uir e n ew te ch nolo gie s
a n d ap pro ach es. In ad ditio n, th is als o re q uir e s new sk ills an d m in dse ts to
s u ccessfu lly n av ig ate th ro ugh th e ch alle n ges an d su ccessfu lly ta p th e fu lle st
p ote n tia l o f t h e o pportu nitie s b ein g p re se n te d t o u s.
F or n ow , f o rg et a b out g ettin g t h e “ se x ie st j o b o f t h e 2 1st c en tu ry ” ( d ata s c ie n tis t,
m ach in e le arn in g en gin eer, etc .) . Forg et ab out th e fe ars ab out artif ic ia l
i n te llig en ce e ra d ic atin g j o bs a n d t h e e n tir e h um an r a ce. T his i s a ll a b out l e arn in g
( in t h e t r u est s e n se o f t h e w ord ) a n d s o lv in g r e al w orld p ro ble m s.
W e a re h ere to c re ate s o lu tio ns a n d ta k e a d van ta g e o f n ew te ch nolo gie s to m ak e
b ette r d ecis io ns a n d h opefu lly m ak e o ur l iv es e asie r. A nd t h is s ta rts a t b uild in g a
s tr o ng fo undatio n so w e can b ette r fa ce th e ch alle n ges an d m aste r ad van ced
c o ncep ts .

2 . W hy C hoose P yth on f o r D ata S cie n ce & M ach in e L ea rn in g
P yth on is s a id to b e a s im ple , c le ar a n d in tu itiv e p ro gra m min g la n guag e. T hat’s
w hy m an y en gin eers an d sc ie n tis ts ch oose P yth on fo r m an y sc ie n tif ic an d
n um eric a p plic atio ns. P erh ap s th ey p re fe r g ettin g in to th e c o re ta sk q uic k ly ( e .g .
f in din g o ut th e eff e ct o r co rre la tio n o f a v aria b le w ith an o utp ut) in ste ad o f
s p en din g h undre d s o f h ours le arn in g th e n uan ces o f a “ co m ple x ” p ro gra m min g
l a n guag e.
T his a llo w s s c ie n tis ts , e n gin eers , r e se arc h ers a n d a n aly sts to g et in to th e p ro je ct
m ore q uic k ly , th ere b y g ain in g v alu ab le in sig hts in th e le ast a m ount o f tim e a n d
r e so urc es. It doesn ’t m ean th ough th at Pyth on is perfe ct an d th e id eal
p ro gra m min g la n guag e o n w here to d o d ata an aly sis an d m ach in e le arn in g.
O th er la n guag es s u ch a s R m ay h av e a d van ta g es a n d fe atu re s P yth on h as n ot.
B ut s till, P yth on is a g ood s ta rtin g p oin t a n d y ou m ay g et a b ette r u nders ta n din g
o f d ata a n aly sis i f y ou u se i t f o r y our s tu dy a n d f u tu re p ro je cts .
P yth on v s R
Y ou m ig ht h av e a lr e ad y e n co unte re d th is in S ta ck O verflo w , R ed dit, Q uora , a n d
o th er f o ru m s a n d w eb site s. Y ou m ig ht h av e a ls o s e arc h ed f o r o th er p ro gra m min g
l a n guag es b ecau se a fte r a ll, le arn in g P yth on o r R (o r a n y o th er p ro gra m min g
l a n guag e) re q uir e s s e v era l w eek s a n d m onth s. It’s a h uge tim e in vestm en t a n d
y ou d on’t w an t t o m ak e a m is ta k e.
T o g et th is o ut o f th e w ay, ju st s ta rt w ith P yth on b ecau se th e g en era l s k ills a n d
c o ncep ts are easily tr a n sfe ra b le to o th er la n guag es. W ell, in so m e case s y ou
m ig ht h av e to a d opt a n e n tir e ly n ew w ay o f th in kin g. B ut in g en era l, k now in g
h ow to u se P yth on in d ata a n aly sis w ill b rin g y ou a lo ng w ay to w ard s s o lv in g
m an y i n te re stin g p ro ble m s.
M an y sa y th at R is sp ecif ic ally d esig ned fo r sta tis tic ia n s (e sp ecia lly w hen it
c o m es to e asy a n d s tr o ng d ata v is u aliz atio n c ap ab ilitie s). I t’s a ls o r e la tiv ely e asy
t o le arn e sp ecia lly if y ou’ll b e u sin g it m ain ly fo r d ata a n aly sis . O n th e o th er
h an d, P yth on is s o m ew hat fle x ib le b ecau se it g oes b ey ond d ata a n aly sis . M an y
d ata sc ie n tis ts an d m ach in e le arn in g pra ctitio ners m ay hav e ch ose n P yth on
b ecau se th e co de th ey w ro te can b e in te g ra te d in to a liv e an d d ynam ic w eb
a p plic atio n.
A lth ough it’s a ll d eb ata b le , P yth on is still a p opula r c h oic e e sp ecia lly a m ong

beg in ners o r a n yone w ho w an ts to g et th eir f e et w et f a st w ith d ata a n aly sis a n d
mach in e le arn in g. It’s re la tiv ely e asy to le arn a n d y ou c an d iv e in to fu ll tim e
pro gra m min g l a te r o n i f y ou d ecid e t h is s u its y ou m ore .
Wid esp re a d U se o f P yth on i n D ata A naly sis
There a re n ow m an y p ack ag es a n d to ols th at m ak e th e u se o f P yth on in d ata
an aly sis a n d m ach in e le arn in g m uch e asie r. T en so rF lo w ( fro m G oogle ), T hean o,
sc ik it- le arn , n um py, an d p an das are ju st so m e o f th e th in gs th at m ak e d ata
sc ie n ce f a ste r a n d e asie r.
Als o , univ ers ity gra d uate s can quic k ly get in to data sc ie n ce becau se m an y
univ ers itie s n ow te ach in tr o ducto ry c o m pute r s c ie n ce u sin g P yth on a s th e m ain
pro gra m min g la n guag e. T he sh if t fro m co m pute r p ro gra m min g an d so ftw are
dev elo pm en t can o ccu r q uic k ly b ecau se m an y p eo ple alr e ad y h av e th e rig ht
fo undatio ns to sta rt le arn in g an d ap ply in g pro gra m min g to re al w orld data
ch alle n ges.
Anoth er r e aso n f o r P yth on’s w id esp re ad u se is th ere a re c o untle ss r e so urc es th at
will t e ll y ou h ow t o d o a lm ost a n yth in g. I f y ou h av e a n y q uestio n, i t’s v ery l ik ely
th at so m eo ne els e h as alr e ad y ask ed th at an d an oth er th at so lv ed it fo r y ou
(G oogle a n d S ta ck O verflo w a re y our frie n ds). T his m ak es P yth on e v en m ore
popula r b ecau se o f t h e a v aila b ility o f r e so urc es o nlin e.
Cla rit y
Due to th e e ase o f le arn in g a n d u sin g P yth on (p artly d ue to th e c la rity o f its
sy nta x ), p ro fe ssio nals a re a b le to fo cu s o n th e m ore im porta n t a sp ects o f th eir
pro je cts a n d p ro ble m s. F or e x am ple , t h ey c o uld j u st u se n um py, s c ik it- le arn , a n d
Ten so rF lo w t o q uic k ly g ain i n sig hts i n ste ad o f b uild in g e v ery th in g f ro m s c ra tc h .
This p ro vid es a n oth er le v el o f c la rity b ecau se p ro fe ssio nals c an fo cu s m ore o n
th e n atu re o f th e p ro ble m a n d its im plic atio ns. T hey c o uld a ls o c o m e u p w ith
more e ff ic ie n t w ay s o f d ealin g w ith th e p ro ble m in ste ad o f g ettin g b urie d w ith
th e t o n o f i n fo a c erta in p ro gra m min g l a n guag e p re se n ts .
The fo cu s sh ould alw ay s be on th e pro ble m an d th e opportu nitie s it m ig ht
in tr o duce. It o nly ta k es o ne b re ak th ro ugh to c h an ge o ur e n tir e w ay o f th in kin g
ab out a certa in ch alle n ge an d P yth on m ig ht b e ab le to h elp acco m plis h th at
becau se o f i ts c la rity a n d e ase .

3 . P re re q uis it e s & R em in ders
P yth on & P ro gra m min g K now le d ge
B y now you sh ould unders ta n d th e Pyth on sy nta x in clu din g th in gs ab out
v aria b le s, c o m paris o n o pera to rs , B oole an o pera to rs , fu nctio ns, lo ops, a n d lis ts .
Y ou d on’t h av e t o b e a n e x pert b ut i t r e ally h elp s t o h av e t h e e sse n tia l k now le d ge
s o t h e r e st b eco m es s m ooth er.
Y ou d on’t h av e to m ak e it co m plic ate d b ecau se p ro gra m min g is o nly ab out
t e llin g t h e c o m pute r w hat n eed s t o b e d one. T he c o m pute r s h ould t h en b e a b le t o
u nders ta n d a n d su ccessfu lly e x ecu te y our in str u ctio ns. Y ou m ig ht ju st n eed to
w rite f e w l in es o f c o de ( o r m odif y e x is tin g o nes a b it) t o s u it y our a p plic atio n.
A ls o , m an y o f th e th in gs th at y ou’ll d o in P yth on fo r d ata a n aly sis a re a lr e ad y
r o utin e o r p re -b uilt fo r y ou. In m an y c ase s y ou m ig ht ju st h av e to c o py a n d
e x ecu te th e co de (w ith a fe w m odif ic atio ns). B ut don’t get la zy becau se
u nders ta n din g P yth on a n d p ro gra m min g is s till e sse n tia l. T his w ay, y ou c an s p ot
a n d tr o uble sh oot p ro ble m s in c ase a n e rro r m essa g e a p pears . T his w ill a ls o g iv e
y ou c o nfid en ce b ecau se y ou k now h ow s o m eth in g w ork s.
I n sta lla tio n & S etu p
I f y ou w an t to fo llo w alo ng w ith o ur co de an d ex ecu tio n, y ou sh ould h av e
A naco nda d ow nlo ad ed a n d in sta lle d in y our c o m pute r. I t’s f re e a n d a v aila b le f o r
W in dow s, macO S, an d Lin ux. To dow nlo ad an d in sta ll, go to
h ttp s
://
www
. an aco nda
. co m
/ dow nlo ad
/ an d fo llo w th e su cceed in g in str u ctio ns
f ro m t h ere .
T he to ol w e’ll be m ostly usin g is Ju pyte r N ote b ook (a lr e ad y co m es w ith
A naco nda in sta lla tio n). It’s lite ra lly a note b ook w here in you can ty pe an d
e x ecu te y our c o de a s w ell a s a d d te x t a n d n ote s (w hic h is w hy m an y o nlin e
i n str u cto rs u se i t) .
I f you’v e su ccessfu lly in sta lle d A naco nda, you sh ould be ab le to la u nch
A naco nda P ro m pt a n d ty pe ju pyte r n ote b ook o n th e b lin kin g u nders c o re . T his
w ill th en la u nch Ju pyte r N ote b ook u sin g y our d efa u lt b ro w se r. Y ou c an th en
c re ate a new note b ook (o r ed it it la te r) an d ru n th e co de fo r outp uts an d
v is u aliz atio ns ( g ra p hs, h is to gra m s, e tc .) .
T hese a re c o nven ie n t to ols y ou c an u se to m ak e s tu dyin g a n d a n aly zin g e asie r

an d f a ste r. T his a ls o m ak es it e asie r to k now w hic h w en t w ro ng a n d h ow to f ix
th em ( th ere a re e asy t o u nders ta n d e rro r m essa g es i n c ase y ou m ess u p).
Is M ath em atic a l E xp ertis e N ecessa ry ?
Data an aly sis ofte n m ean s w ork in g w ith num bers an d ex tr a ctin g valu ab le
in sig hts fro m th em . B ut do you re ally hav e to be ex pert on num bers an d
math em atic s?
Successfu l d ata a n aly sis u sin g P yth on o fte n re q uir e s h av in g d ecen t sk ills a n d
know le d ge in m ath , p ro gra m min g, an d th e d om ain y ou’re w ork in g o n. T his
mean s y ou d on’t h av e to b e a n e x pert in a n y o f th em ( u nle ss y ou’re p la n nin g to
pre se n t a p ap er a t i n te rn atio nal s c ie n tif ic c o nfe re n ces).
Don’t le t m an y “ ex perts ” f o ol y ou b ecau se m an y o f th em a re f a k es o r ju st p la in
in ex perie n ced . W hat y ou n eed to k now is w hat’s th e n ex t th in g to d o s o y ou c an
su ccessfu lly fin is h y our p ro je cts . Y ou w on’t b e a n e x pert in a n yth in g a fte r y ou
re ad a ll th e c h ap te rs h ere . B ut th is is e n ough to g iv e y ou a b ette r u nders ta n din g
ab out P yth on a n d d ata a n aly sis .
Back to m ath em atic al ex pertis e . It’s v ery lik ely y ou’re alr e ad y fa m ilia r w ith
mean , sta n dard d ev ia tio n, a n d o th er c o m mon te rm s in sta tis tic s. W hile g oin g
deep er i n to d ata a n aly sis y ou m ig ht e n co unte r c alc u lu s a n d l in ear a lg eb ra . I f y ou
hav e th e tim e a n d in te re st to s tu dy th em , y ou c an a lw ay s d o a n ytim e o r la te r.
This m ay o r m ay n ot g iv e y ou a n e d ge o n th e p artic u la r d ata a n aly sis p ro je ct
you’re w ork in g o n.
Again , it’s ab out so lv in g pro ble m s. T he fo cu s sh ould be on how to ta k e a
ch alle n ge a n d s u ccessfu lly o verc o m e it. T his a p plie s to a ll fie ld s e sp ecia lly in
busin ess a n d s c ie n ce. D on’t le t th e h ype o r m yth s to d is tr a ct y ou. F ocu s o n th e
co re c o ncep ts a n d y ou’ll d o f in e.

4 . P yth on Q uic k R ev ie w
H ere ’s a q uic k P yth on r e v ie w y ou c an u se a s r e fe re n ce. I f y ou’re s tu ck o r n eed
h elp w ith s o m eth in g, y ou c an a lw ay s u se G oogle o r S ta ck O verflo w .
T o h av e P yth on (a n d o th er d ata a n aly sis to ols a n d p ack ag es) in y our c o m pute r,
d ow nlo ad a n d i n sta ll A naco nda.
P yth on D ata T ypes a re str in gs (“ Y ou a re a w eso m e.” ), in te g ers (-3 , 0 , 1 ), a n d
f lo ats ( 3 .0 , 1 2.5 , 7 .7 7).
Y ou c an d o
math em atic a l o p era tio n s
i n P yth on s u ch a s:
3 + 3
p rin t(3 + 3)
7 - 1
5 * 2
2 0 / 5
9 % 2
#m od ulo o p era tio n , r e tu rn s t h e r e m ain der o f t h e d iv is io n
2 * * 3
#ex p on en tia tio n , 2 t o t h e 3 rd
p ow er
Assig n in g v alu es t o v aria b le s
: myN am e = “ T hor”
p rin t(m yN am e)
#ou tp ut i s “ T hor”
x = 5
y = 6
p rin t(x + y )
#re su lt i s 1 1
p rin t(x *3)
#re su lt i s 1 5
W ork in g o n s tr in gs a n d v aria b le s:
myN am e = “ T hor”
a ge = 2 5
h ob by = “ p ro gra m min g”
p rin t('H i, m y n am e i s ' + m yn am e + ' a n d m y a ge i s ' + s tr (a ge) + ' . A nyw ay, m y h ob by i s ' + h ob by +
' .' )
Resu lt i s
Hi, m y n am e i s T hon a n d m y a ge i s 2 5. A nyw ay, m y h ob by i s p ro gra m min g.
C om men ts
# E very th in g a fte r t h e h ash ta g i n t h is l in e i s a c o m men t.
# T his i s t o k eep y ou r s a n it y .
# M ak e i t u ndersta n dab le t o y ou , l e a rn ers, a n d o th er p ro gra m mers.
C om paris o n O pera to rs
>>>8 = = 8
T ru e
> >>8 > 4

Tru e
>>>8 < 4
Fals e
>>>8 ! = 4
Tru e
>>>8 ! = 8
Fals e
>>>8 > = 2
Tru e
>>>8 < = 2
Fals e

>>>’h ello ’ = = ‘ h ello ’
Tru e
>>>’c a t’ ! = ‘ d og’
Tru e
Boole a n O pera to rs ( a n d, o r, n ot)
>>>8 > 3 a n d 8 > 4
Tru e
>>>8 > 3 a n d 8 > 9
Fals e
>>>8 > 9 a n d 8 > 1 0
Fals e
>>>8 > 3 o r 8 > 8 00
Tru e
>>>’h ello ’ = = ‘ h ello ’ o r ‘ c a t’ = = ‘ d og’
Tru e
If, E lif , a n d E ls e S ta te m en ts ( fo r F lo w C on tr o l)
prin t(“ W hat’s y ou r e m ail? ”)
myE m ail = i n put()
prin t(“ T yp e i n y ou r p assw ord .” )
ty p ed P assw ord = i n put()
if t y p ed P assw ord = = s a ved P assw ord :
prin t(“ C on gra tu la tio n s! Y ou ’r e n ow l o gged i n .” )
els e :
prin t(“ Y ou r p assw ord i s i n co rre ct. P le a se t r y a gain .” )
While l o op
in box = 0
while i n box < 1 0:
prin t(“ Y ou h ave a m essa ge.” )
in box = i n box + 1
Resu lt i s t h is :
You h ave a m essa ge.
You h ave a m essa ge.

You h ave a m essa ge.
You h ave a m essa ge.
You h ave a m essa ge.
You h ave a m essa ge.
You h ave a m essa ge.
You h ave a m essa ge.
You h ave a m essa ge.
You h ave a m essa ge.
Loop d oesn ’t e x it u ntil y ou t y ped ‘ C asa n ova’
nam e = ' '
while n am e ! = ' C asa n ova':
prin t('P le a se t y p e y ou r n am e.' )
nam e = i n put()
prin t('C on gra tu la tio n s!')
For l o op
fo r i i n r a n ge(1 0):
prin t(i * * 2 )
Here ’s t h e o utp ut:
0
1
4
9
16
25
36
49
64
81
#A ddin g n um bers f r o m 0 t o 1 00
to ta l = 0
fo r n um i n r a n ge(1 01):
to ta l = t o ta l + n um
prin t(to ta l)
When y ou r u n t h is , t h e s u m w ill b e 5 050.
#A noth er e x am ple . P osit iv e a n d n eg ativ e r e v ie w s.
all_ re v ie w s = [ 5 , 5 , 4 , 4 , 5 , 3 , 2 , 5 , 3 , 2 , 5 , 4 , 3 , 1 , 1 , 2 , 3 , 5 , 5 ]
posit iv e_ re v ie w s = [ ]
fo r i i n a ll_ re v ie w s:
if i > 3 :
prin t('P ass')

posit iv e_ re v ie w s.a p pen d(i)
els e :
prin t('F ail')
prin t(p osit iv e_ re v ie w s)
prin t(le n (p osit iv e_ re v ie w s))
ra tio _p osit iv e = l e n (p osit iv e_ re v ie w s) / l e n (a ll_ re v ie w s)
prin t('P erc en ta ge o f p osit iv e r e v ie w s: ' )
prin t(r a tio _p osit iv e * 1 00)
When y ou r u n t h is , y ou s h ould s e e:
Pass
Pass
Pass
Pass
Pass
Fail
Fail
Pass
Fail
Fail
Pass
Pass
Fail
Fail
Fail
Fail
Fail
Pass
Pass
[5 , 5 , 4 , 4 , 5 , 5 , 5 , 4 , 5 , 5 ]
10
Perc en ta ge o f p osit iv e r e v ie w s:
52.6 3157894736842
Functio n s
def h ello ():
prin t('H ello w orld !')
hello ()
Defin e t h e f u nctio n, t e ll w hat i t s h ould d o, a n d t h en u se o r c all i t l a te r.
def a d d_n um bers(a ,b ):

prin t(a + b )
ad d_n um bers(5 ,1 0)
ad d_n um bers(3 5,5 5)
#C heck i f a n um ber i s o d d o r e v en .
def e v en _ch eck (n um ):
if n um % 2 = = 0 :
prin t('N um ber i s e v en .' )
els e :
prin t('H m m, i t i s o d d.' )
ev en _ch eck (5 0)
ev en _ch eck (5 1)
Lis ts
my_lis t = [ ‘e g gs’, ‘ h am ’, ‘ b aco n ’]
#lis t w it h s tr in gs
co lo u rs = [ ‘r e d ’,
‘g re en ’, ‘ b lu e’]
co u sin _ages = [ 3 3, 3 5, 4 2]
#lis t w it h i n te g ers
mix ed _lis t = [ 3 .1 4, ‘ c ir c le ’, ‘ e g gs’, 5 00]
#lis t w it h i n te g ers
an d s tr in gs
#W ork in g w it h l is ts
co lo u rs = [ ‘r e d ’, ‘ b lu e’, ‘ g re en ’]
co lo u rs[0 ]
#in dex in g s ta rts a t 0 , s o i t r e tu rn s f ir st i t e m i n t h e l is t w hic h i s ‘ r e d ’
co lo u rs[1 ]
#re tu rn s s e co n d i t e m , w hic h i s ‘ g re en ’
#S lic in g t h e l is t
my_lis t = [ 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 ]
prin t(m y_lis t[0 :2 ])
#re tu rn s [ 0 , 1 ]
prin t(m y_lis t[1 :])
#re tu rn s [ 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 ]
prin t(m y_lis t[3 :6 ])
#re tu rn s [ 3 , 4 , 5 ]
#L en gth o f l is t
my_lis t = [ 0 ,1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9 ]
prin t(le n (m y_lis t))
#re tu rn s 1 0
#A ssig n in g n ew v alu es t o l is t i t e m s
co lo u rs = [ 'r e d ', ' g re en ', ' b lu e']
co lo u rs[0 ] = ' y ello w '
prin t(c o lo u rs)
#re su lt s h ou ld b e [ 'y ello w ', ' g re en ', ' b lu e']
#C on ca te n atio n a n d a p pen din g
co lo u rs = [ 'r e d ', ' g re en ', ' b lu e']
co lo u rs.a p pen d('p in k')
prin t(c o lo u rs)

The r e su lt w ill b e:
['r e d ', ' g re en ', ' b lu e', ' p in k']
fa ve_ se rie s = [ 'G OT', ' T W D', ' W W']
fa ve_ m ovie s = [ 'H P', ' L O TR ', ' S W ']
fa ve_ all = f a ve_ se rie s + f a ve_ m ovie s
prin t(fa ve_ all)
This p rin ts [ 'G OT', 'T W D', 'W W', 'H P', 'L O TR ', 'S W ']

Those a re ju st th e b asic s. Y ou m ig ht s till n eed to re fe r to th is w hen ev er y ou’re
doin g a n yth in g r e la te d to P yth on. Y ou c an a ls o r e fe r to
Pyth on
3
Docu m en ta tio n
fo r m ore e x te n siv e in fo rm atio n. It’s re co m men ded th at y ou b ookm ark th at fo r
fu tu re re fe re n ce. F or q uic k re v ie w , y ou c an a ls o re fe r to
Learn
pyth on
3
in
Y
Min ute s
.
Tip s f o r F aste r L ea rn in g
If y ou w an t to le arn fa ste r, y ou ju st h av e to d ev ote m ore h ours e ach d ay in
le arn in g P yth on. T ak e n ote th at p ro gra m min g a n d le arn in g h ow to th in k lik e a
pro gra m mer t a k es t im e.
There a re a ls o v ario us c h eat s h eets o nlin e y ou c an a lw ay s u se . E ven e x perie n ced
pro gra m mers d on’t k now ev ery th in g. A ls o , y ou actu ally d on’t h av e to le arn
ev ery th in g if y ou’re ju st sta rtin g o ut. Y ou can alw ay s g o d eep er an ytim e if
so m eth in g in te re sts y ou o r y ou w an t to s ta n d o ut in jo b a p plic atio ns o r s ta rtu p
fu ndin g.

5 . O verv ie w & O bje ctiv es
L et’s s e t s o m e e x pecta tio ns h ere s o y ou k now w here y ou’re g oin g. T his i s a ls o t o
i n tr o duce ab out th e lim ita tio ns of Pyth on, data an aly sis , data sc ie n ce, an d
m ach in e l e arn in g ( a n d a ls o t h e k ey d if f e re n ces). L et’s s ta rt.
D ata A naly sis v s D ata S cie n ce v s M ach in e L ea rn in g
D ata A naly sis a n d D ata S cie n ce a re a lm ost th e sa m e b ecau se th ey sh are th e
s a m e g oal, w hic h is to d eriv e in sig hts fro m d ata a n d u se it fo r b ette r d ecis io n
m ak in g.
O fte n , d ata a n aly sis is a sso cia te d w ith u sin g M ic ro so ft E xcel a n d o th er to ols f o r
s u m mariz in g d ata a n d f in din g p atte rn s. O n th e o th er h an d, d ata s c ie n ce is o fte n
a sso cia te d w ith u sin g p ro gra m min g to d eal w ith m assiv e d ata s e ts . I n f a ct, d ata
s c ie n ce b ecam e p opula r a s a r e su lt o f t h e g en era tio n o f g ig ab yte s o f d ata c o m in g
f ro m o nlin e s o urc es a n d a ctiv itie s ( s e arc h e n gin es, s o cia l m ed ia ).
B ein g a d ata s c ie n tis t s o unds w ay c o ole r th an b ein g a d ata a n aly st. A lth ough th e
j o b fu nctio ns m ig ht b e sim ila r an d o verla p pin g, it all d eals w ith d is c o verin g
p atte rn s an d g en era tin g in sig hts fro m d ata . It’s als o ab out ask in g in te llig en t
q uestio ns a b out t h e n atu re o f t h e d ata ( e .g . A re d ata p oin ts f o rm o rg an ic c lu ste rs ?
I s t h ere r e ally a c o nnectio n b etw een a g e a n d c an cer? ).
W hat ab out m ach in e le arn in g? O fte n , th e te rm s data sc ie n ce an d m ach in e
l e arn in g a re u se d in te rc h an geab ly . T hat’s b ecau se th e la tte r is a b out “ le arn in g
f ro m d ata .” W hen a p ply in g m ach in e le arn in g a lg orith m s, th e c o m pute r d ete cts
p atte rn s a n d u se s “ w hat i t l e arn ed ” o n n ew d ata .
F or in sta n ce, w e w an t to k now if a p ers o n w ill p ay h is d eb ts . L uck ily w e h av e a
s iz ab le d ata se t a b out d if f e re n t p eo ple w ho e ith er p aid h is d eb t o r n ot. W e a ls o
h av e c o lle cte d o th er d ata ( c re atin g c u sto m er p ro file s) s u ch a s a g e, i n co m e r a n ge,
l o catio n, an d occu patio n. W hen w e ap ply th e ap pro pria te m ach in e le arn in g
a lg orith m , th e c o m pute r w ill le arn fro m th e d ata . W e c an th en in put n ew d ata
( n ew in fo f ro m a n ew a p plic an t) a n d w hat th e c o m pute r le arn ed w ill b e a p plie d
t o t h at n ew d ata .
W e m ig ht th en c re ate a sim ple p ro gra m th at im med ia te ly e v alu ate s w heth er a
p ers o n w ill p ay h is d eb ts o r n ot b ase d o n h is in fo rm atio n (a g e, in co m e ra n ge,
l o catio n, a n d o ccu patio n). T his is a n e x am ple o f u sin g d ata to p re d ic t s o m eo ne’s

lik ely b eh av io r.
Possib ilit ie s
Learn in g fro m d ata o pen s a lo t o f p ossib ilitie s esp ecia lly in p re d ic tio ns an d
optim iz atio ns. T his has beco m e a re ality th an ks to av aila b ility of m assiv e
data se ts a n d s u perio r c o m pute r p ro cessin g p ow er. W e c an n ow p ro cess d ata in
gig ab yte s w ith in a d ay u sin g c o m pute rs o r c lo ud c ap ab ilitie s.
Alth ough d ata s c ie n ce a n d m ach in e l e arn in g a lg orith m s a re s till f a r f ro m p erfe ct,
th ese a re a lr e ad y u se fu l i n m an y a p plic atio ns s u ch a s i m ag e r e co gnitio n, p ro duct
re co m men datio ns, se arc h e n gin e ra n kin gs, a n d m ed ic al d ia g nosis . A nd to th is
mom en t, sc ie n tis ts an d en gin eers aro und th e glo be co ntin ue to im pro ve th e
accu ra cy a n d p erfo rm an ce o f t h eir t o ols , m odels , a n d a n aly sis .
Lim it a tio n s o f D ata A naly sis & M ach in e L ea rn in g
You m ig ht h av e re ad fro m n ew s a n d o nlin e a rtic le s th at m ach in e le arn in g a n d
ad van ced d ata a n aly sis c an c h an ge t h e f a b ric o f s o cie ty ( a u to m atio n, l o ss o f j o bs,
univ ers a l b asic i n co m e, a rtif ic ia l i n te llig en ce t a k eo ver).
In fa ct, th e so cie ty is b ein g ch an ged rig ht n ow . B eh in d th e sc en es m ach in e
le arn in g a n d c o ntin uous d ata a n aly sis a re a t w ork e sp ecia lly in s e arc h e n gin es,
so cia l m ed ia , a n d e -c o m merc e. M ach in e le arn in g n ow m ak es it e asie r a n d f a ste r
to d o t h e f o llo w in g:
● Are t h ere h um an f a ces i n t h e p ic tu re ?
● Will a u se r c lic k a n a d ? ( is i t p ers o naliz ed a n d a p pealin g t o h im /h er? )
● How t o c re ate a ccu ra te c ap tio ns o n Y ouT ube v id eo s? ( re co gnis e s p eech
an d t r a n sla te i n to t e x t)
● Will a n e n gin e o r c o m ponen t f a il? ( p re v en tiv e m ain te n an ce i n
man ufa ctu rin g)
● Is a t r a n sa ctio n f ra u dule n t?
● Is a n e m ail s p am o r n ot?
These a re m ad e p ossib le b y a v aila b ility o f m assiv e d ata se ts a n d g re at p ro cessin g
pow er. H ow ev er, a d van ced d ata a n aly sis u sin g P yth on ( a n d m ach in e le arn in g) is
not m ag ic . I t’s n ot th e s o lu tio n to a ll p ro ble m . T hat’s b ecau se th e a ccu ra cy a n d
perfo rm an ce o f o ur to ols a n d m odels h eav ily d ep en d o n th e i n te g rity o f d ata a n d
our o w n s k ill a n d j u dgm en t.

Yes, c o m pute rs a n d a lg orith m s a re g re at a t p ro vid in g a n sw ers . B ut i t’s a ls o a b out
ask in g th e rig ht questio ns. T hose in te llig en t questio ns w ill co m e fro m us
hum an s. It a ls o d ep en ds o n u s if w e’ll u se th e a n sw ers b ein g p ro vid ed b y o ur
co m pute rs .
Accu ra cy & P erfo rm an ce
The m ost c o m mon u se o f d ata a n aly sis is in s u ccessfu l p re d ic tio ns ( fo re castin g)
an d o ptim iz atio n. W ill th e d em an d fo r o ur p ro duct in cre ase in th e n ex t fiv e
years ? W hat are th e optim al ro ute s fo r deliv erie s th at le a d to th e lo w est
opera tio nal c o sts ?
That’s w hy a n a ccu ra cy im pro vem en t o f e v en ju st 1 % c an tr a n sla te in to m illio ns
of d olla rs o f a d ditio nal re v en ues. F or in sta n ce, b ig s to re s c an s to ck u p c erta in
pro ducts in a d van ce if th e r e su lts o f th e a n aly sis p re d ic ts a n in cre asin g d em an d.
Ship pin g a n d lo gis tic s c an a ls o b ette r p la n th e ro ute s a n d s c h ed ule s fo r lo w er
fu el u sa g e a n d f a ste r d eliv erie s.
Asid e fro m im pro vin g accu ra cy, an oth er prio rity is on en su rin g re lia b le
perfo rm an ce. H ow can our an aly sis perfo rm on new data se ts ? S hould w e
co nsid er o th er fa cto rs w hen an aly zin g th e d ata an d m ak in g p re d ic tio ns? O ur
work sh ould a lw ay s p ro duce c o nsis te n tly a ccu ra te re su lts . O th erw is e , it’s n ot
sc ie n tif ic a t a ll b ecau se th e r e su lts a re n ot r e p ro ducib le . W e m ig ht a s w ell s h oot
in t h e d ark i n ste ad o f m ak in g o urs e lv es e x hau ste d i n s o phis tic ate d d ata a n aly sis .
Apart fro m su ccessfu l fo re castin g an d o ptim iz atio n, p ro per d ata an aly sis can
als o h elp u s u nco ver o pportu nitie s. L ate r w e c an r e aliz e th at w hat w e d id is a ls o
ap plic ab le t o o th er p ro je cts a n d f ie ld s. W e c an a ls o d ete ct o utlie rs a n d i n te re stin g
patte rn s if w e d ig d eep e n ough. F or e x am ple , p erh ap s c u sto m ers c o ngre g ate in
clu ste rs th at a re b ig e n ough fo r u s to e x plo re a n d ta p in to . M ay be th ere a re
unusu ally h ig her co ncen tr a tio ns o f cu sto m ers th at fa ll in to a certa in in co m e
ra n ge o r s p en din g l e v el.
Those a re j u st t y pic al e x am ple s o f t h e a p plic atio ns o f p ro per d ata a n aly sis . I n t h e
nex t ch ap te r, le t’s d is c u ss o ne o f th e m ost u se d ex am ple s in illu str a tin g th e
pro m is in g p ote n tia l o f d ata a n aly sis a n d m ach in e le arn in g. W e’ll a ls o d is c u ss its
im plic atio ns a n d t h e o pportu nitie s i t p re se n ts .

6 . A Q uic k E xam ple
I r is D ata se t
L et’s q uic k ly se e h ow d ata a n aly sis a n d m ach in e le arn in g w ork in re al w orld
d ata se ts . T he g oal h ere is to q uic k ly illu str a te th e p ote n tia l o f P yth on an d
m ach in e l e arn in g o n s o m e i n te re stin g p ro ble m s.
I n th is p artic u la r e x am ple , th e g oal is to p re d ic t th e sp ecie s o f a n Iris flo w er
b ase d o n th e le n gth a n d w id th o f its s e p als a n d p eta ls . F ir s t, w e h av e to c re ate a
m odel base d on a data se t with th e flo w ers ’ measu re m en ts an d th eir
c o rre sp ondin g sp ecie s. B ase d o n o ur c o de, o ur c o m pute r w ill “ le arn fro m th e
d ata ” a n d e x tr a ct p atte rn s fro m it. It w ill th en a p ply w hat it le arn ed to a n ew
d ata se t. L et’s l o ok a t t h e c o de.
# im portin g t h e n ecessa ry l ib ra rie s
fr o m s k le a rn .d ata se ts i m port l o ad _ir is
f r o m s k le a rn i m port t r e e
f r o m s k le a rn .m etr ic s i m port a ccu ra cy _sc o re
i m port n um py a s n p
# lo ad in g t h e i r is d ata se t
i r is = l o ad _ir is ()
x = i r is .d ata # arra y o f t h e d ata
y = i r is .t a rg et # arra y o f l a b els ( i.e a n sw ers) o f e a ch d ata e n tr y
# gettin g l a b el n am es i .e t h e t h re e f lo w er s p ecie s
y _n am es = i r is .t a rg et_ n am es
# ta k in g r a n dom i n dic es t o s p lit t h e d ata se t i n to t r a in a n d t e st
t e st_ id s = n p.r a n dom .p erm uta tio n (le n (x ))
# sp lit tin g d ata a n d l a b els i n to t r a in a n d t e st
# k eep in g l a st 1 0 e n tr ie s f o r t e stin g, r e st f o r t r a in in g
x _tr a in = x [te st_ id s[:-1 0]]
x _te st = x [te st_ id s[-1 0:]]
y _tr a in = y [te st_ id s[:-1 0]]
y _te st = y [te st_ id s[-1 0:]]
# cla ssif y in g u sin g d ecis io n t r e e
c lf = t r e e.D ecis io n T re eC la ssif ie r()
# tr a in in g ( fit tin g) t h e c la ssif ie r w it h t h e t r a in in g s e t
c lf .f it (x _tr a in , y _tr a in )

#p re d ic tio n s o n t h e t e st d ata se t
pre d = c lf .p re d ic t(x _te st)
prin t(p re d ) # p re d ic te d l a b els i .e f lo w er s p ecie s
prin t(y _te st) # actu al l a b els
prin t((a ccu ra cy _sc o re (p re d , y _te st)))* 100 # p re d ic tio n a ccu ra cy
#R efe re n ce:
http
://
docs
. pyth on
-
guid e
. org
/ en
/ la te st
/ sc en ario s
/ ml
/
If w e r u n t h e c o de, w e’ll g et s o m eth in g l ik e t h is :
[0 1 1 1 0 2 0 2 2 2 ]
[0 1 1 1 0 2 0 2 2 2 ]
100.0
The f ir s t l in e c o nta in s t h e p re d ic tio ns ( 0 i s I ris s e to sa , 1 i s I ris v ers ic o lo r, 2 i s I ris
vir g in ic a). T he s e co nd lin e c o nta in s th e a ctu al f lo w er s p ecie s a s in dic ate d in th e
data se t. N otic e th e pre d ic tio n accu ra cy is 100% , w hic h m ean s w e co rre ctly
pre d ic te d e ach f lo w er’s s p ecie s.
These m ig ht a ll s e em c o nfu sin g a t f ir s t. W hat y ou n eed to u nders ta n d is th at th e
goal h ere i s t o c re ate a m odel t h at p re d ic ts a f lo w er’s s p ecie s. T o d o t h at, w e s p lit
th e d ata in to tr a in in g a n d te st s e ts . W e r u n th e a lg orith m o n th e tr a in in g s e t a n d
use i t a g ain st t h e t e st s e t t o k now t h e a ccu ra cy. T he r e su lt i s w e’re a b le t o p re d ic t
th e f lo w er’s s p ecie s o n th e te st s e t b ase d o n w hat th e c o m pute r le arn ed f ro m th e
tr a in in g s e t.
Pote n tia l & I m plic a tio n s
It’s a quic k an d sim ple ex am ple . B ut its pote n tia l an d im plic atio ns can be
en orm ous. W ith ju st a f e w m odif ic atio ns, y ou c an a p ply th e w ork flo w to a w id e
varie ty o f t a sk s a n d p ro ble m s.
For in sta n ce, w e m ig ht b e a b le to a p ply th e s a m e m eth odolo gy o n o th er f lo w er
sp ecie s, pla n ts , an d an im als . W e can als o ap ply th is in oth er C la ssif ic atio n
pro ble m s (m ore o n th is la te r) su ch as d ete rm in in g if a can cer is b en ig n o r
malig nan t, if a p ers o n is a v ery lik ely c u sto m er, o r if th ere ’s a h um an f a ce in th e
photo .
The c h alle n ge h ere is to g et e n ough q uality d ata s o o ur c o m pute r c an p ro perly
get “ g ood tr a in in g.” I t’s a c o m mon m eth odolo gy to f ir s t le arn f ro m th e tr a in in g
se t a n d th en a p ply th e le arn in g in to th e te st se t a n d p ossib ly n ew d ata in th e
fu tu re ( th is i s t h e e sse n ce o f m ach in e l e arn in g).
It’s o bvio us n ow w hy m an y p eo ple a re h yped a b out th e tr u e p ote n tia l o f d ata
an aly sis an d m ach in e le arn in g. W ith en ough data , w e can cre ate au to m ate d

sy ste m s o n p re d ic tin g e v en ts a n d c la ssif y in g o bje cts . W ith e n ough X -ra y im ag es
with c o rre cts la b els ( w ith lu ng c an cer o r n ot) , o ur c o m pute rs c an le arn f ro m th e
data a n d m ak e in sta n t c la ssif ic atio n o f a n ew u nla b ele d X -ra y im ag e. W e c an
als o a p ply a s im ila r a p pro ach t o o th er m ed ic al d ia g nosis a n d r e la te d f ie ld s.
Back th en , d ata an aly sis is w id ely u se d fo r stu dyin g th e p ast an d p re p arin g
re p orts . B ut n ow , i t c an b e u se d i n sta n ta n eo usly t o p re d ic t o utc o m es i n r e al t im e.
This is th e tr u e p ow er o f d ata , w here in w e c an u se it to m ak e q uic k a n d s m art
decis io ns.
Man y e x perts a g re e th at w e’re ju st s till s c ra tc h in g th e s u rfa ce o f th e p ow er o f
perfo rm in g d ata a n aly sis u sin g s ig nif ic an tly la rg e d ata se ts . I n th e y ears to c o m e,
we’ll b e a b le to e n co unte r a p plic atio ns n ev er b een th ought b efo re . M an y to ols
an d a p pro ach es w ill a ls o b eco m e o bso le te a s a r e su lt o f t h ese c h an ges.
But m an y th in gs w ill re m ain th e s a m e a n d th e p rin cip le s w ill a lw ay s b e th ere .
That’s w hy in th e fo llo w in g ch ap te rs , w e’ll fo cu s m ore o n g ettin g in to th e
min dse t o f a s a v vy d ata a n aly st. W e’ll e x plo re s o m e a p pro ach es in d oin g th in gs
but t h ese w ill o nly b e u se d t o i llu str a te t im ele ss a n d i m porta n t p oin ts .
For e x am ple , th e g en era l w ork flo w a n d p ro cess in d ata a n aly sis in volv e th ese
th in gs:
● Id en tif y in g th e pro ble m (a sk in g th e rig ht questio ns)
● Gettin g &
pro cessin g d ata
● Vis u aliz in g d ata
● Choosin g an ap pro ach an d alg orith m

Evalu atin g th e o utp ut
● Try in g o th er ap pro ach es & co m parin g th e re su lts

Know in g if th e re su lts a re g oo d e n ough (k now in g w hen to sto p)
It’s g ood to
dete rm in e th e o bje ctiv e o f th e p ro je ct f ir s t s o w e c an s e t c le ar e x pecta tio ns a n d
boundarie s o n o ur p ro je ct. S eco nd, le t’s th en g ath er d ata ( o r g et a ccess to it) s o
we c an s ta rt t h e p ro per a n aly sis . L et’s d o t h at i n t h e n ex t c h ap te r.

7 . G ettin g & P ro cessin g D ata
G arb ag e In , G arb ag e O ut. T his is tr u e e sp ecia lly in d ata a n aly sis . A fte r a ll, th e
a ccu ra cy o f o ur a n aly sis h eav ily d ep en ds o n t h e q uality o f o ur d ata . I f w e w e p ut
i n g arb ag e, e x pect g arb ag e t o c o m e o ut.
T hat’s w hy d ata a n aly sts a n d m ach in e le arn in g e n gin eers sp en d e x tr a tim e in
g ettin g a n d p ro cessin g q uality d ata . T o a cco m plis h t h is , t h e d ata s h ould b e i n t h e
r ig ht fo rm at to m ak e it u sa b le fo r a n aly sis a n d o th er p urp ose s. N ex t, th e d ata
s h ould b e p ro cesse d p ro perly s o w e c an a p ply a lg orith m s to it a n d m ak e s u re
w e’re d oin g p ro per a n aly sis .
C SV F ile s
C SV file s a re p erh ap s th e m ost c o m mon d ata fo rm at y ou’ll e n co unte r in d ata
s c ie n ce an d m ach in e le arn in g (e sp ecia lly w hen usin g P yth on). C SV m ean s
c o m ma-s e p ara te d valu es. T he valu es in dif f e re n t co lu m ns are se p ara te d by
c o m mas. H ere ’s a n e x am ple :
Pro d uct, P ric e
c a b bage,6 .8
l e ttu ce,7 .2
t o m ato ,4 .2
I t’s a s im ple 2 -c o lu m n e x am ple . I n m an y m odern d ata a n aly sis p ro je cts , it m ay
l o ok so m eth in g lik e th is :
R ow Num ber,C usto m erId ,S urn am e,C re d it S co re ,G eo gra p hy,G en der,A ge,T en ure … .
1 ,1 5634602,H arg ra ve,6 19,F ra n ce,F em ale ,4 2,2 ,0 ,1 ,1 ,1 ,1 01348.8 8,1
2 ,1 5647311 ,H ill,6 08,S pain ,F em ale ,4 1,1 ,8 3807.8 6,1 ,0 ,1 ,1 1 2542.5 8,0
3 ,1 5619304,O nio ,5 02,F ra n ce,F em ale ,4 2,8 ,1 59660.8 ,3 ,1 ,0 ,1 1 3931.5 7,1
4 ,1 5701354,B on i,6 99,F ra n ce,F em ale ,3 9,1 ,0 ,2 ,0 ,0 ,9 3826.6 3,0
5 ,1 5737888,M it c h ell,8 50,S pain ,F em ale ,4 3,2 ,1 25510.8 2,1 ,1 ,1 ,7 9084.1 ,0
6 ,1 5574012,C hu,6 45,S pain ,M ale ,4 4,8 ,1 1 3755.7 8,2 ,1 ,0 ,1 49756.7 1,1
7 ,1 5592531,B artle tt,8 22,F ra n ce,M ale ,5 0,7 ,0 ,2 ,1 ,1 ,1 0062.8 ,0
8 ,1 5656148,O bin na,3 76,G erm an y,F em ale ,2 9,4 ,1 1 5046.7 4,4 ,1 ,0 ,1 1 9346.8 8,1
9 ,1 5792365,H e,5 01,F ra n ce,M ale ,4 4,4 ,1 42051.0 7,2 ,0 ,1 ,7 4940.5 ,0
1 0,1 5592389,H ?,6 84,F ra n ce,M ale ,2 7,2 ,1 34603.8 8,1 ,1 ,1 ,7 1725.7 3,0
1 1 ,1 5767821,B ea rc e,5 28,F ra n ce,M ale ,3 1,6 ,1 02016.7 2,2 ,0 ,0 ,8 0181.1 2,0
1 2,1 5737173,A ndre w s,4 97,S pain ,M ale ,2 4,3 ,0 ,2 ,1 ,0 ,7 6390.0 1,0
1 3,1 5632264,K ay,4 76,F ra n ce,F em ale ,3 4,1 0,0 ,2 ,1 ,0 ,2 6260.9 8,0
1 4,1 5691483,C hin ,5 49,F ra n ce,F em ale ,2 5,5 ,0 ,2 ,0 ,0 ,1 90857.7 9,0
1 5,1 5600882,S co tt,6 35,S pain ,F em ale ,3 5,7 ,0 ,2 ,1 ,1 ,6 5951.6 5,0
1 6,1 5643966,G ofo rth ,6 16,G erm an y,M ale ,4 5,3 ,1 43129.4 1,2 ,0 ,1 ,6 4327.2 6,0
1 7,1 5737452,R om eo ,6 53,G erm an y,M ale ,5 8,1 ,1 32602.8 8,1 ,1 ,0 ,5 097.6 7,1
1 8,1 5788218,H en derso n ,5 49,S pain ,F em ale ,2 4,9 ,0 ,2 ,1 ,1 ,1 4406.4 1,0
1 9,1 5661507,M uld ro w ,5 87,S pain ,M ale ,4 5,6 ,0 ,1 ,0 ,0 ,1 58684.8 1,0
2 0,1 5568982,H ao,7 26,F ra n ce,F em ale ,2 4,6 ,0 ,2 ,1 ,1 ,5 4724.0 3,0
2 1,1 5577657,M cD on ald ,7 32,F ra n ce,M ale ,4 1,8 ,0 ,2 ,1 ,1 ,1 70886.1 7,0
2 2,1 5597945,D ellu cci,6 36,S pain ,F em ale ,3 2,8 ,0 ,2 ,1 ,0 ,1 38555.4 6,0

23,1 5699309,G era sim ov,5 10,S pain ,F em ale ,3 8,4 ,0 ,1 ,1 ,0 ,1 1 8913.5 3,1
24,1 5725737,M osm an ,6 69,F ra n ce,M ale ,4 6,3 ,0 ,2 ,0 ,1 ,8 487.7 5,0
25,1 5625047,Y en ,8 46,F ra n ce,F em ale ,3 8,5 ,0 ,1 ,1 ,1 ,1 87616.1 6,0
26,1 5738191,M acle a n ,5 77,F ra n ce,M ale ,2 5,3 ,0 ,2 ,0 ,1 ,1 24508.2 9,0
27,1 5736816,Y ou ng,7 56,G erm an y,M ale ,3 6,2 ,1 36815.6 4,1 ,1 ,1 ,1 70041.9 5,0
28,1 5700772,N eb ech i,5 71,F ra n ce,M ale ,4 4,9 ,0 ,2 ,0 ,0 ,3 8433.3 5,0
29,1 5728693,M cW illia m s,5 74,G erm an y,F em ale ,4 3,3 ,1 41349.4 3,1 ,1 ,1 ,1 00187.4 3,0
30,1 5656300,L uccia n o,4 11 ,F ra n ce,M ale ,2 9,0 ,5 9697.1 7,2 ,1 ,1 ,5 3483.2 1,0
31,1 5589475,A zik iw e,5 91,S pain ,F em ale ,3 9,3 ,0 ,3 ,1 ,0 ,1 40469.3 8,1
….
Real w orld d ata ( e sp ecia lly in e -c o m merc e, s o cia l m ed ia , a n d o nlin e a d s) c o uld
co nta in m illio ns o f r o w s a n d t h ousa n ds o f c o lu m ns.
CSV f ile s a re c o nven ie n t t o w ork w ith a n d y ou c an e asily f in d l o ts o f t h em f ro m
dif f e re n t o nlin e s o urc es. It’s s tr u ctu re d a n d P yth on a ls o a llo w s e asy p ro cessin g
of i t b y w ritin g a f e w l in es o f c o de:
im port p an das a s p d
data se t = p d.r e a d _csv ('D ata .c sv ')
This s te p is o fte n n ecessa ry b efo re P yth on a n d y our c o m pute r c an w ork o n th e
data . S o w hen ev er y ou’re w ork in g o n a C SV file a n d y ou’re u sin g P yth on, it’s
good t o i m med ia te ly h av e t h ose t w o l in es o f c o de a t t h e t o p o f y our p ro je ct.
Then , w e s e t th e in put v alu es ( X ) a n d th e o utp ut v alu es ( y ). O fte n , th e y v alu es
are o ur ta rg et o utp uts . F or e x am ple , th e c o m mon g oal is to le arn h ow c erta in
valu es o f X a ff e ct th e c o rre sp ondin g y v alu es. L ate r o n, th at le arn in g c an b e
ap plie d o n n ew X v alu es a n d s e e if th at le arn in g is u se fu l in p re d ic tin g y v alu es
(u nknow n a t f ir s t) .
Afte r th e d ata b eco m es r e ad ab le a n d u sa b le , o fte n th e n ex t s te p is to e n su re th at
th e v alu es d on’t v ary m uch in sc ale a n d m ag nitu de. T hat’s b ecau se v alu es in
certa in c o lu m ns m ig ht b e in a d if f e re n t le ag ue th an th e o th ers . F or in sta n ce, th e
ag es o f c u sto m ers c an ra n ge fro m 1 8 to 7 0. B ut th e in co m e ra n ge a re in th e
ra n ge o f 1 00000 to 9 000000. T he g ap in th e ra n ges o f th e tw o c o lu m ns w ould
hav e a h uge eff e ct o n o ur m odel. P erh ap s th e in co m e ra n ge w ill co ntr ib ute
la rg ely to th e re su ltin g p re d ic tio ns in ste ad o f tr e atin g b oth ag es an d in co m e
ra n ge e q ually .
To d o f e atu re s c alin g ( s c alin g v alu es in th e s a m e m ag nitu de), o ne w ay to d o th is
is b y u sin g th e fo llo w in g lin es o f c o de:
fr o m sk le a rn .p re p ro cessin g im port
Sta n dard Sca le r
sc _ X = S ta n dard Sca le r()
X_tr a in = s c _ X .f it _ tr a n sfo rm (X _tr a in )

X_te st = s c _ X .t r a n sfo rm (X _te st)
# s c _ y = S ta n dard Sca le r()
# y _tr a in = s c _ y.f it _ tr a n sfo rm (y _tr a in )
The g oal h ere is to s c ale th e v alu es in
th e sa m e m ag nitu de so a ll th e v alu es fro m d if f e re n t c o lu m ns o r fe atu re s w ill
co ntr ib ute t o t h e p re d ic tio ns a n d o utp uts .
In d ata a n aly sis a n d m ach in e le arn in g, it’s o fte n a g en era l r e q uir e m en t to d iv id e
th e d ata se t in to T ra in in g S et a n d T est S et. A fte r a ll, w e n eed to c re ate a m odel
an d te st its p erfo rm an ce a n d a ccu ra cy. W e u se th e T ra in in g S et s o o ur c o m pute r
can l e arn f ro m t h e d ata . T hen , w e u se t h at l e arn in g a g ain st t h e T est S et a n d s e e i f
its p erfo rm an ce i s g ood e n ough.
A co m mon w ay to acco m plis h th is is th ro ugh th e fo llo w in g co de:
fr o m
sk le a rn .m od el_ se le ctio n i m port t r a in _te st_ sp lit
X_tr a in , X _te st, y_tr a in , y_te st = tr a in _te st_ sp lit (X , y, te st_ siz e = 0.2 ,
ra n dom _sta te = 0)
Here , w e im porte d so m eth in g fro m
sc ik it
- le arn
(fre e
so ftw are m ach in e le arn in g lib ra ry fo r th e P yth on p ro gra m min g la n guag e) a n d
perfo rm a s p lit o n th e d ata se t. T he d iv is io n is o fte n 8 0% T ra in in g S et a n d 2 0%
Test S et (te st_ siz e = 0 .2 ). T he ra n dom _sta te c an b e a n y v alu e a s lo ng a s y ou
re m ain c o nsis te n t t h ro ugh t h e s u cceed in g p arts o f y our p ro je ct.
You c an a ctu ally u se d if f e re n t r a tio s o n d iv id in g y our d ata se t. S om e u se a r a tio
of 7 0-3 0 o r e v en 6 0-4 0. J u st k eep in m in d th at th e T ra in in g S et s h ould b e p le n ty
en ough fo r an y m ean in gfu l to le arn . It’s sim ila r w ith gain in g dif f e re n t lif e
ex perie n ces s o w e c an g ain a m ore a ccu ra te r e p re se n ta tio n o f r e ality ( e .g . u se o f
se v era l m en ta l m odels a s p opula riz ed b y C harlie M unger, lo ng-tim e b usin ess
partn er o f W arre n B uff e tt) .
That’s w hy it’s re co m men ded to g ath er m ore d ata to m ak e th e “ le arn in g” m ore
accu ra te . W ith sc arc e d ata , o ur sy ste m m ig ht fa il to re co gniz e p atte rn s. O ur
alg orith m m ig ht ev en overg en era liz e on lim ite d data , w hic h re su lts to th e
alg orith m f a ilin g to w ork o n n ew d ata . I n o th er w ord s, it s h ow s e x celle n t r e su lts
when w e u se o ur e x is tin g d ata , b ut i t f a ils s p ecta cu la rly w hen n ew d ata i s u se d .
There are als o case s w hen w e alr e ad y hav e su ff ic ie n t am ount of data fo r
mean in gfu l le arn in g to o ccu r. O fte n w e w on’t n eed to g ath er m ore d ata b ecau se
th e e ff e ct c o uld b e n eg lig ib le ( e .g . 0 .0 000001% a ccu ra cy im pro vem en t) o r h uge
in vestm en ts i n t im e, e ff o rt, a n d m oney w ould b e r e q uir e d . I n t h ese c ase s i t m ig ht
be b est t o w ork o n w hat w e h av e a lr e ad y t h an l o okin g f o r s o m eth in g n ew .

Fea tu re S ele ctio n
We m ig ht h av e lo ts o f d ata . B ut a re a ll o f th em u se fu l a n d re le v an t? W hic h
co lu m ns a n d f e atu re s a re l ik ely t o b e c o ntr ib utin g t o t h e r e su lt?
Ofte n , s o m e o f o ur d ata a re ju st ir re le v an t to o ur a n aly sis . F or e x am ple , is th e
nam e o f th e s ta rtu p a ff e cts its fu ndin g s u ccess? Is th ere a n y re la tio n b etw een a
pers o n’s f a v orite c o lo r a n d h er i n te llig en ce?
Sele ctin g t h e m ost r e le v an t f e atu re s i s a ls o a c ru cia l t a sk i n p ro cessin g d ata . W hy
waste pre cio us tim e an d co m putin g re so urc es on in clu din g ir re le v an t
fe atu re s/c o lu m ns in o ur a n aly sis ? W ors e , w ould th e ir re le v an t f e atu re s s k ew o ur
an aly sis ?
The a n sw er is y es. A s m en tio ned e arly in th e c h ap te r, G arb ag e I n G arb ag e O ut.
If w e i n clu de i r re le v an t f e atu re s i n o ur a n aly sis , w e m ig ht a ls o g et i n accu ra te a n d
ir re le v an t re su lts . O ur co m pute r an d alg orith m w ould b e “le arn in g fro m b ad
ex am ple s” w hic h r e su lts t o e rro neo us r e su lts .
To elim in ate th e G arb ag e an d im pro ve th e accu ra cy an d re le v an ce of our
an aly sis , Featu re Sele ctio n is ofte n done. A s th e te rm im plie s, w e se le ct
“fe atu re s” th at h av e th e b ig gest c o ntr ib utio n a n d im med ia te re le v an ce w ith th e
outp ut. T his m ak es o ur p re d ic tiv e m odel s im ple r a n d e asie r t o u nders ta n d.
For ex am ple , w e m ig ht hav e 20+ fe atu re s th at desc rib e cu sto m ers . T hese
fe atu re s in clu de a g e, in co m e r a n ge, lo catio n, g en der, w heth er th ey h av e k id s o r
not, sp en din g le v el, re cen t p urc h ase s, h ig hest e d ucatio nal a tta in m en t, w heth er
th ey o w n a h ouse o r n ot, a n d o ver a d ozen m ore a ttr ib ute s. H ow ev er, n ot a ll o f
th ese m ay h av e a n y re le v an ce w ith o ur a n aly sis o r p re d ic tiv e m odel. A lth ough
it’s p ossib le th at a ll th ese f e atu re s m ay h av e s o m e e ff e ct, th e a n aly sis m ig ht b e
to o c o m ple x f o r i t t o b eco m e u se fu l.
Featu re S ele ctio n is a w ay o f s im plif y in g a n aly sis b y f o cu sin g o n r e le v an ce. B ut
how do w e know if a certa in fe atu re is re le v an t? T his is w here dom ain
know le d ge a n d e x pertis e c o m es in . F or e x am ple , th e d ata a n aly st o r th e te am
sh ould h av e k now le d ge a b out r e ta il ( in o ur e x am ple a b ove). T his w ay, th e te am
can p ro perly s e le ct t h e f e atu re s t h at h av e t h e m ost i m pact t o t h e p re d ic tiv e m odel
or a n aly sis .
Dif f e re n t fie ld s o fte n h av e d if f e re n t re le v an t fe atu re s. F or in sta n ce, a n aly zin g
re ta il d ata m ig ht b e to ta lly d if f e re n t th an s tu dyin g w in e q uality d ata . I n r e ta il w e

fo cu s o n fe atu re s th at in flu en ce p eo ple ’s p urc h ase s (a n d in w hat q uan tity ). O n
th e o th er h an d, a n aly zin g w in e q uality d ata m ig ht re q uir e stu dyin g th e w in e’s
ch em ic al c o nstitu en ts a n d t h eir e ff e cts o n p eo ple ’s p re fe re n ces.
In ad ditio n, it re q uir e s so m e d om ain k now le d ge to k now w hic h fe atu re s are
in te rd ep en den t w ith one an oth er. In our ex am ple ab ove ab out w in e quality ,
su bsta n ces in th e w in e m ig ht re act w ith one an oth er an d hen ce aff e ct th e
am ounts o f s u ch s u bsta n ces. W hen y ou in cre ase th e a m ount o f a s u bsta n ce, it
may i n cre ase o r d ecre ase t h e a m ount o f a n oth er.
It’s a ls o t h e c ase w ith a n aly zin g b usin ess d ata . M ore c u sto m ers a ls o m ean s m ore
sa le s. P eo ple f ro m h ig her i n co m e g ro ups m ig ht a ls o h av e h ig her s p en din g l e v els .
These fe atu re s a re in te rd ep en den t a n d e x clu din g a fe w o f th ose c o uld s im plif y
our a n aly sis .
Sele ctin g th e m ost ap pro pria te fe atu re s m ig ht als o ta k e ex tr a tim e esp ecia lly
when y ou’re d ealin g w ith a h uge d ata se t (w ith h undre d s o r e v en th ousa n ds o f
co lu m ns). P ro fe ssio nals o fte n tr y d if f e re n t c o m bin atio ns a n d se e w hic h y ie ld s
th e b est r e su lts ( o r l o ok f o r s o m eth in g t h at m ak es t h e m ost s e n se ).
In g en era l, d om ain e x pertis e c o uld b e m ore i m porta n t t h an t h e d ata a n aly sis s k ill
its e lf . A fte r a ll, w e s h ould s ta rt w ith a sk in g th e r ig ht q uestio ns th an f o cu sin g o n
ap ply in g th e m ost ela b ora te alg orith m to th e data . To fig u re out th e rig ht
questio ns ( a n d t h e m ost i m porta n t o nes), y ou o r s o m eo ne f ro m y our t e am s h ould
hav e a n e x pertis e o n t h e s u bje ct.
Onlin e D ata S ou rc es
We’v e d is c u sse d h ow to p ro cess d ata a n d s e le ct th e m ost r e le v an t f e atu re s. B ut
where d o w e g et d ata i n t h e f ir s t p la ce? H ow d o w e e n su re t h eir c re d ib ility ? A nd
fo r b eg in ners , w here t o g et d ata s o t h ey c an p ra ctic e a n aly zin g d ata ?
You can sta rt with th e UCI Mach in e Learn in g Rep osito ry
( http s
://
arc h iv e
. ic s
. uci
. ed u
/ ml
/ data se ts
. htm l
) w here in you can access data se ts
ab out b usin ess, e n gin eerin g, l if e s c ie n ces, s o cia l s c ie n ces, a n d p hysic al s c ie n ces.
You can fin d data ab out El N in o, so cia l m ed ia , han dw ritte n ch ara cte rs ,
se n so rle ss d riv e d ia g nosis , b an k m ark etin g, a n d m ore . I t’s m ore th an e n ough to
fill your tim e fo r m onth s an d years if you get se rio us on la rg e-s c ale data
an aly sis .
You can als o fin d more in te re stin g data se ts in Kag gle

( http s
://
www
. kag gle
. co m
/ data se ts
) s u ch a s d ata a b out T ita n ic S urv iv al, g ro cery
sh oppin g, m ed ic al dia g nosis , his to ric al air quality , A m azo n re v ie w s, crim e
sta tis tic s, a n d h ousin g p ric es.
Ju st sta rt w ith th ose tw o a n d y ou’ll b e fin e. It’s g ood to b ro w se th ro ugh th e
data se ts a s e arly a s to day s o th at y ou’ll g et id eas a n d in sp ir a tio n o n w hat to d o
with d ata . T ak e n ote th at d ata a n aly sis is a b out e x plo rin g a n d s o lv in g p ro ble m s,
whic h is w hy it’s a lw ay s g ood to e x plo re o ut th ere s o y ou c an b e c lo se r to th e
situ atio ns a n d c h alle n ges.
In te rn al D ata S ou rc e
If y ou’re p la n nin g to w ork in a co m pan y, u niv ers ity , o r re se arc h in stitu tio n,
th ere ’s a g ood c h an ce y ou’ll w ork w ith in te rn al d ata . F or e x am ple , if y ou’re
work in g in a b ig e co m merc e c o m pan y, e x pect th at y ou’ll w ork o n th e d ata y our
co m pan y g ath ers a n d g en era te s.
Big c o m pan ie s o fte n g en era te m eg ab yte s o f d ata e v ery s e co nd. T hese a re b ein g
sto re d a n d/o r p ro cesse d in to a d ata b ase . Y our jo b th en is to m ak e s e n se o f th ose
en dle ss str e am s o f d ata an d u se th e d eriv ed in sig hts fo r b ette r eff ic ie n cy o r
pro fita b ility .
Fir s t, th e data bein g gath ere d sh ould be re le v an t to th e opera tio ns of th e
busin ess. P erh ap s th e tim e o f p urc h ase , th e cate g ory w here th e p ro duct fa lls
under, a n d if it’s o ff e re d in d is c o unt a re a ll re le v an t. T hese in fo rm atio n s h ould
th en b e s to re d i n t h e d ata b ase ( w ith b ack ups) s o y our t e am c an a n aly ze i t l a te r.
The d ata c an b e s to re d in d if f e re n t f o rm ats a n d f ile ty pes s u ch a s C SV , S Q Lite ,
JS O N, a n d B ig Q uery . T he f ile t y pe y our c o m pan y c h ose m ig ht h ad d ep en ded o n
co nven ie n ce a n d e x is tin g i n fra str u ctu re . I t’s i m porta n t t o k now h ow t o w ork w ith
th ese file ty pes (o fte n th ey ’re m en tio ned in jo b d esc rip tio ns) s o y ou c an m ak e
mean in gfu l a n aly sis .

8 . D ata V is u aliz a tio n
D ata v is u aliz atio n m ak es it e asie r a n d f a ste r to m ak e m ean in gfu l a n aly sis o n th e
d ata . In m an y c ase s it’s o ne o f th e fir s t s te p s w hen p erfo rm in g a d ata a n aly sis .
Y ou a ccess a n d p ro cess th e d ata a n d th en s ta rt v is u aliz in g it fo r q uic k in sig hts
( e .g . l o okin g f o r o bvio us p atte rn s, o utlie rs , e tc .)
G oal o f V is u aliz a tio n
E xplo rin g a n d c o m munic atin g d ata is th e m ain g oal o f d ata v is u aliz atio n. W hen
t h e d ata i s v is u aliz ed ( in a b ar c h art, h is to gra m , o r o th er f o rm s), p atte rn s b eco m e
i m med ia te ly o bvio us. Y ou’ll k now q uic k ly i f t h ere ’s a r is in g t r e n d ( lin e g ra p h) o r
t h e r e la tiv e m ag nitu de o f s o m eth in g in r e la tio n to o th er f a cto rs ( e .g . u sin g a p ie
c h art) . I n ste ad o f te llin g p eo ple th e lo ng lis t o f n um bers , w hy n ot ju st s h ow it to
t h em f o r b ette r c la rity ?
F or e x am ple , le t’s lo ok a t th e w orld w id e se arc h tr e n d o n th e w ord ‘b itc o in ’:

h ttp s
://
tr e n ds
. google
. co m
/ tr e n ds
/ ex plo re
? q = bitc o in
I m med ia te ly y ou’ll n otic e th ere ’s a te m pora ry m assiv e in cre ase in in te re st a b out
‘ b itc o in ’ b ut g en era lly it ste ad ily d ecre ase s o ver tim e a fte r th at p eak . P erh ap s
d urin g th e p eak th ere ’s m assiv e h ype a b out th e te ch nolo gic al a n d s o cia l im pact
o f b itc o in . A nd th en th e h ype n atu ra lly d ie d d ow n b ecau se p eo ple w ere a lr e ad y
f a m ilia r w ith i t o r i t’s j u st a n atu ra l t h in g a b out h ypes.
W hic h ev er is th e c ase , d ata v is u aliz atio n a llo w ed u s to q uic k ly s e e th e p atte rn s
i n a m uch c le are r w ay. R em em ber th e g oal o f d ata v is u aliz atio n w hic h is to
e x plo re a n d c o m munic ate d ata . In th is e x am ple , w e’re a b le to q uic k ly s e e th e

patte rn s a n d t h e d ata c o m munic ate d t o u s.
This is als o im porta n t w hen p re se n tin g to th e p an el o r p ublic . O th er p eo ple
mig ht ju st p re fe r a q uic k o verv ie w o f th e d ata w ith out g oin g to o m uch in to th e
deta ils . Y ou d on’t w an t to b oth er th em w ith b orin g te x ts a n d n um bers . W hat
mak es a b ig ger im pact is h ow y ou p re se n t th e d ata s o p eo ple w ill im med ia te ly
know i ts i m porta n ce. T his i s w here d ata v is u aliz atio n c an t a k e p la ce w here in y ou
allo w p eo ple to q uic k ly ex plo re th e d ata an d eff e ctiv ely co m munic ate w hat
you’re t r y in g t o s a y.
There a re s e v era l w ay s o f v is u aliz in g d ata . Y ou c an i m med ia te ly c re ate p lo ts a n d
gra p hs w ith M ic ro so ft E xcel. Y ou can als o use D 3, se ab orn , B okeh , an d
matp lo tlib . In th is an d in th e su cceed in g ch ap te rs , w e’ll fo cu s on usin g
matp lo tlib .
Im portin g & U sin g M atp lo tlib
Acco rd in g to th eir hom ep ag e (
http s
://
matp lo tlib
. org
/2 .0 .2 /
in dex
. htm l
):
“M atp lo tlib is a P yth on 2 D p lo ttin g lib ra ry w hic h p ro duces p ublic atio n q uality
fig ure s in a v arie ty o f h ard co py fo rm ats an d in te ra ctiv e en vir o nm en ts acro ss
pla tf o rm s. M atp lo tlib can b e u se d in P yth on sc rip ts , th e P yth on an d IP yth on
sh ell, th e ju pyte r n ote b ook, w eb ap plic atio n se rv ers , an d fo ur g ra p hic al u se r
in te rfa ce t o olk its .”
In o th er w ord s, y ou c an e asily g en era te p lo ts , h is to gra m s, b ar c h arts , s c atte rp lo ts ,
an d m an y m ore u sin g P yth on a n d a fe w lin es o f c o de. In ste ad o f s p en din g s o
much tim e fig urin g th in gs o ut, y ou can fo cu s o n g en era tin g p lo ts fo r fa ste r
an aly sis a n d d ata e x plo ra tio n.
That so unds a m outh fu l. B ut alw ay s re m em ber it’s still ab out ex plo rin g an d
co m munic atin g d ata . L et’s lo ok a t a n e x am ple to m ak e th is c le ar. F ir s t, h ere ’s a
sim ple horiz o nta l bar ch art
( http s
://
matp lo tlib
. org
/2 .0 .2 /
ex am ple s
/ lin es
_ bars
_ an d
_ mark ers
/ barh
_ dem o
. htm l
):

To c re ate th at, y ou o nly n eed th is b lo ck o f c o de:
im port m atp lo tlib .p yp lo t a s
plt
plt .r c d efa u lt s ()
im port n um py a s n p
im port m atp lo tlib .p yp lo t a s p lt
plt .r c d efa u lt s ()
fig , a x = p lt .s u bplo ts ()
# E xam ple d ata
peo p le = ( 'T om ', ' D ic k ', ' H arry ', ' S lim ', ' J im ')
y_p os = n p.a ra n ge(le n (p eo p le ))
perfo rm an ce = 3 + 1 0 * n p.r a n dom .r a n d(le n (p eo p le ))
erro r = n p.r a n dom .r a n d(le n (p eo p le ))
ax.b arh (y _p os, p erfo rm an ce, x err= erro r, a lig n = 'c en te r',
co lo r= 'g re en ', e co lo r= 'b la ck ')
ax.s e t_ ytic k s(y _p os)
ax.s e t_ ytic k la b els (p eo p le )
ax.i n vert_ yaxis ()
# l a b els r e a d t o p -to -b otto m
ax.s e t_ xla b el( 'P erfo rm an ce')
ax.s e t_ tit le ('H ow f a st d o y ou w an t t o g o t o d ay?')
plt .s h ow ()
It lo oks c o m ple x a t f ir s t. B ut w hat it d id w as to im port th e n ecessa ry

lib ra rie s, s e t th e d ata , a n d d esc rib e h ow it s h ould b e s h ow n. W ritin g it a ll f ro m
sc ra tc h m ig ht b e d if f ic u lt. T he g ood n ew s i s w e c an c o py t h e c o de e x am ple s a n d
modif y i t a cco rd in g t o o ur p urp ose s a n d n ew d ata .
Asid e fro m horiz o nta l bar ch arts , m atp lo tlib is als o use fu l fo r cre atin g an d
dis p la y in g sc atte rp lo ts , boxplo ts , an d oth er vis u al re p re se n ta tio ns of data :

"""
Sim ple d em o o f a s c a tte r p lo t.
"""
im port n um py a s n p
im port m atp lo tlib .p yp lo t a s p lt
N = 5 0
x = n p.r a n dom .r a n d(N )
y = n p.r a n dom .r a n d(N )
co lo rs = n p.r a n dom .r a n d(N )
are a = n p.p i * ( 1 5 * n p.r a n dom .r a n d(N ))* *2
# 0 t o 1 5 p oin t r a d ii
plt .s c a tte r(x , y , s = are a , c = co lo rs, a lp ha= 0.5 )

plt .s h ow ()

im port m atp lo tlib .p yp lo t a s p lt
fr o m n um py.r a n dom i m port r a n d
fig , a x = p lt .s u bplo ts ()
fo r c o lo r i n [ 'r e d ', ' g re en ', ' b lu e'] :
n = 7 50
x, y = r a n d(2 , n )
sc a le = 2 00.0 * r a n d(n )
ax.s c a tte r(x , y , c = co lo r, s = sc a le , l a b el= co lo r,
alp ha= 0.3 , e d geco lo rs= 'n on e')
ax.l e g en d()
ax.g rid (T ru e)

plt .s h ow ()

Sourc e o f i m ag es a n d c o de:
http s
://
matp lo tlib
. org
/2 .0 .2 /
galle ry
. htm l
These a re ju st to s h ow y ou th e u se fu ln ess a n d p ossib ilitie s in u sin g m atp lo tlib .
Notic e th at y ou c an m ak e p ublic atio n-q uality d ata v is u aliz atio ns. A ls o n otic e
th at y ou c an m odif y th e e x am ple c o des to y our p urp ose . T here ’s n o n eed to

re in ven t th e w heel. Y ou c an c o py th e a p pro pria te se ctio ns a n d a d ap t th em to
your d ata .
Perh ap s in th e fu tu re th ere w ill be fa ste r an d easie r w ay s to cre ate data
vis u aliz atio ns e sp ecia lly w hen w ork in g w ith h uge d ata se ts . Y ou c an e v en c re ate
an im ate d p re se n ta tio ns th at c an c h an ge th ro ugh tim e. W hic h ev er is th e c ase , th e
goal o f d ata v is u aliz atio n is to e x plo re a n d c o m munic ate d ata . Y ou c an c h oose
oth er m eth ods b ut t h e g oal a lw ay s r e m ain s t h e s a m e.
In th is ch ap te r an d th e p re v io us o nes w e’v e d is c u sse d g en era l th in gs ab out
an aly zin g d ata . I n th e s u cceed in g c h ap te rs , le t’s s ta rt d is c u ssin g a d van ced to pic s
th at a re s p ecif ic t o m ach in e l e arn in g a n d a d van ced d ata a n aly sis . T he i n itia l g oal
is to g et y ou fa m ilia r w ith th e m ost c o m mon c o ncep ts a n d te rm s u se d in d ata
sc ie n ce cir c le s. L et’s sta rt w ith defin in g w hat do S uperv is e d L earn in g an d
Unsu perv is e d L earn in g m ean .

9 . S uperv is e d & U nsu perv is e d L ea rn in g
I n m an y in tr o ducto ry co urs e s an d books ab out m ach in e le arn in g an d data
s c ie n ce, y ou’ll l ik ely e n co unte r w hat S uperv is e d & U nsu perv is e d L earn in g m ean
a n d w hat are th eir dif f e re n ces. That’s becau se th ese are th e tw o gen era l
c ate g orie s o f m ach in e l e arn in g a n d d ata s c ie n ce t a sk s m an y p ro fe ssio nals d o.
W hat i s S uperv is e d L ea rn in g?
F ir s t, S uperv is e d L earn in g is a lo t sim ila r to le arn in g fro m ex am ple s. F or
i n sta n ce, w e h av e a h uge c o lle ctio n o f im ag es c o rre ctly la b ele d a s e ith er d ogs o r
c ats . O ur c o m pute r w ill t h en l e arn f ro m t h ose g iv en e x am ple s a n d c o rre ct l a b els .
P erh ap s o ur c o m pute r w ill fin d p atte rn s a n d sim ila ritie s a m ong th ose im ag es.
A nd fin ally w hen w e in tr o duce new im ag es, our co m pute r an d m odel w ill
s u ccessfu lly i d en tif y a n i m ag e w heth er t h ere ’s a d og o r c at i n i t.
I t’s a lo t lik e le arn in g w ith s u perv is io n. T here a re c o rre ct a n sw ers (e .g . c ats o r
d ogs) a n d it’s th e jo b o f o ur m odel to a lig n its e lf so o n n ew d ata it c an still
p ro duce c o rre ct a n sw ers ( in a n a ccep ta b le p erfo rm an ce le v el b ecau se it’s h ard to
r e ach 1 00% ).
F or ex am ple , L in ear R eg re ssio n is co nsid ere d under Superv is e d L earn in g.
R em em ber th at in lin ear re g re ssio n w e’re tr y in g to p re d ic t th e v alu e o f y fo r a
g iv en x . B ut f ir s t, w e h av e t o f in d p atte rn s a n d “ fit” a l in e t h at b est d esc rib es t h e
r e la tio nsh ip b etw een x a n d y ( a n d p re d ic t y v alu es f o r n ew x i n puts ).
p rin t(_ _d oc_ _)

# C od e s o u rc e: J aq ues G ro b le r
# L ic en se : B SD 3 c la u se
im port m atp lo tlib .p yp lo t a s p lt
im port n um py a s n p
fr o m s k le a rn i m port d ata se ts , l in ea r_ m od el
fr o m s k le a rn .m etr ic s i m port m ea n _sq uare d _erro r, r 2 _sc o re
# L oad t h e d ia b ete s d ata se t
dia b ete s = d ata se ts .l o ad _d ia b ete s()
# U se o n ly o n e f e a tu re
dia b ete s_ X = d ia b ete s.d ata [:, n p.n ew axis , 2 ]
# S plit t h e d ata i n to t r a in in g/t e stin g s e ts
dia b ete s_ X _tr a in = d ia b ete s_ X [:-2 0]
dia b ete s_ X _te st = d ia b ete s_ X [-2 0:]
# S plit t h e t a rg ets i n to t r a in in g/t e stin g s e ts
dia b ete s_ y_tr a in = d ia b ete s.t a rg et[:-2 0]
dia b ete s_ y_te st = d ia b ete s.t a rg et[-2 0:]
# C re a te l in ea r r e g re ssio n o b je ct
re g r = l in ea r_ m od el.L in ea rR eg re ssio n ()
# T ra in t h e m od el u sin g t h e t r a in in g s e ts
re g r.f it (d ia b ete s_ X _tr a in , d ia b ete s_ y_tr a in )
# M ak e p re d ic tio n s u sin g t h e t e stin g s e t
dia b ete s_ y_p re d = r e g r.p re d ic t(d ia b ete s_ X _te st)
# T he c o effic ie n ts
prin t('C oeffic ie n ts : \ n ', r e g r.c o ef_ )
# T he m ea n s q uare d e rro r
prin t(" M ea n s q uare d e rro r: % .2 f"
% m ea n _sq uare d _erro r(d ia b ete s_ y_te st, d ia b ete s_ y_p re d ))
# E xp la in ed v aria n ce s c o re : 1 i s p erfe ct p re d ic tio n
prin t('V aria n ce s c o re : % .2 f' % r 2 _sc o re (d ia b ete s_ y_te st, d ia b ete s_ y_p re d ))
# P lo t o u tp uts
plt .s c a tte r(d ia b ete s_ X _te st, d ia b ete s_ y_te st,
co lo r= 'b la ck ')
plt .p lo t(d ia b ete s_ X _te st, d ia b ete s_ y_p re d , c o lo r= 'b lu e', l in ew id th = 3)
plt .x tic k s(())
plt .y tic k s(())

plt .s h ow ()
Sourc e:
http
://
sc ik it
- le arn
. org
/ sta b le
/ au to
_ ex am ple s
/ lin ear
_ model
/ plo t
_ ols
. htm l
# sp hx
- glr
- au to
-
ex am ple s
- lin ear
- model
- plo t
- ols
- py
It lo oks lik e a sim ple e x am ple . H ow ev er, th at lin e w as a re su lt o f ite ra tiv ely
min im is in g th e re sid ual su m of sq uare s betw een th e tr u e valu es an d th e
pre d ic tio ns. I n o th er w ord s, th e g oal w as to p ro duce th e c o rre ct p re d ic tio n u sin g
what t h e m odel l e arn ed f ro m p re v io us e x am ple s.
Anoth er ta sk th at fa lls u nder S uperv is e d L earn in g is C la ssif ic atio n.
Here , th e
goal is to co rre ctly cla ssif y n ew d ata in to eith er o f th e tw o cate g orie s. F or
in sta n ce, w e w an t to k now if a n in co m in g e m ail is sp am o r n ot. A gain , o ur
model w ill le arn f ro m e x am ple s ( e m ails c o rre ctly la b ele d a s s p am o r n ot) . W ith
th at “ su perv is io n”, w e c an t h en c re ate a m odel t h at w ill c o rre ctly p re d ic t i f a n ew
em ail i s s p am o r n ot.
What i s U nsu perv is e d L ea rn in g?
In c o ntr a st, U nsu perv is e d L earn in g m ean s th ere ’s n o su perv is io n o r g uid an ce.
It’s o fte n t h ought o f a s h av in g n o c o rre ct a n sw ers , j u st a ccep ta b le o nes.
For e x am ple , i n C lu ste rin g ( th is f a lls u nder U nsu perv is e d L earn in g) w e’re t r y in g
to d is c o ver w here d ata p oin ts a g gre g ate (e .g . a re th ere n atu ra l c lu ste rs ? ). E ach
data p oin t is n ot la b ele d a n yth in g s o o ur m odel a n d c o m pute r w on’t b e le arn in g
fro m e x am ple s. I n ste ad , o ur c o m pute r i s l e arn in g t o i d en tif y p atte rn s w ith out a n y
ex te rn al g uid an ce.
This s e em s t o b e t h e e sse n ce o f t r u e A rtif ic ia l I n te llig en ce w here in t h e c o m pute r
can l e arn w ith out h um an i n te rv en tio n. I t’s a b out l e arn in g f ro m t h e d ata i ts e lf a n d
tr y in g to fin d th e re la tio nsh ip betw een dif f e re n t in puts (n otic e th ere ’s no
ex pecte d outp ut here in co ntr a st to R eg re ssio n an d C la ssif ic atio n dis c u sse d
earlie r). T he fo cu s is o n in puts a n d tr y in g to fin d th e p atte rn s a n d re la tio nsh ip s
am ong th em . P erh ap s th ere a re n atu ra l c lu ste rs o r th ere a re c le ar a sso cia tio ns
am ong t h e i n puts . I t’s a ls o p ossib le t h at t h ere ’s n o u se fu l r e la tio nsh ip a t a ll.
How t o A ppro ach a P ro b le m
Man y d ata sc ie n tis ts a p pro ach a p ro ble m in a b in ary w ay. D oes th e ta sk fa ll
under S uperv is e d o r U nsu perv is e d L earn in g?
The q uic k est w ay to f ig ure it o ut is b y d ete rm in in g th e e x pecte d o utp ut. A re w e
tr y in g to pre d ic t y valu es base d on new x valu es (S uperv is e d Learn in g,
Reg re ssio n)? I s a n ew i n put u nder c ate g ory A o r c ate g ory B b ase d o n p re v io usly

la b ele d d ata (S uperv is e d L earn in g, C la ssif ic atio n)? A re w e tr y in g to d is c o ver
an d re v eal how data poin ts ag gre g ate an d if th ere are natu ra l clu ste rs
(U nsu perv is e d L earn in g, C lu ste rin g)? D o in puts h av e a n in te re stin g r e la tio nsh ip
with o ne a n oth er ( d o t h ey h av e a h ig h p ro bab ility o f c o -o ccu rre n ce)?
Man y a d van ced d ata a n aly sis p ro ble m s f a ll u nder th ose g en era l q uestio ns. A fte r
all, t h e o bje ctiv e i s a lw ay s t o p re d ic t s o m eth in g ( b ase d o n p re v io us e x am ple s) o r
ex plo re t h e d ata ( fin d o ut i f t h ere a re p atte rn s).

1 0. R eg re ssio n
I n th e pre v io us ch ap te r w e’v e ta lk ed ab out U nsu perv is e d an d Superv is e d
L earn in g, in clu din g a b it a b out L in ear R eg re ssio n. I n th is c h ap te r le t’s f o cu s o n
R eg re ssio n ( p re d ic tin g a n o utp ut b ase d o n a n ew i n put a n d p re v io us l e arn in g).
B asic ally , R eg re ssio n A naly sis allo w s u s to d is c o ver if th ere ’s a re la tio nsh ip
b etw een an in dep en den t v aria b le /s an d a d ep en den t v aria b le (th e ta rg et) . F or
e x am ple , in a Sim ple Lin ear R eg re ssio n w e w an t to kno w if th ere ’s a
r e la tio nsh ip b etw een x a n d y . T his i s v ery u se fu l i n f o re castin g ( e .g . w here i s t h e
t r e n d g oin g) a n d tim e s e rie s m odellin g (e .g . te m pera tu re le v els b y y ear a n d if
g lo bal w arm in g i s t r u e).
S im ple L in ea r R eg re ssio n
H ere w e’ll b e d ealin g w ith o ne in dep en den t v aria b le a n d o ne d ep en den t. L ate r
o n w e’ll b e d ealin g w ith m ultip le v aria b le s a n d s h ow h ow c an th ey b e u se d to
p re d ic t t h e t a rg et ( s im ila r t o w hat w e t a lk ed a b out p re d ic tin g s o m eth in g b ase d o n
s e v era l f e atu re s/a ttr ib ute s).
F or n ow , l e t’s s e e a n e x am ple o f a S im ple L in ear R eg re ssio n w here in w e a n aly ze
S ala ry D ata ( S ala ry _D ata .c sv ). H ere ’s th e d ata se t ( c o m ma-s e p ara te d v alu es a n d
t h e c o lu m ns a re y ears , e x perie n ce, a n d s a la ry ):
Yea rsE xp erie n ce,S ala ry
1 .1 ,3 9343.0 0
1 .3 ,4 6205.0 0
1 .5 ,3 7731.0 0
2 .0 ,4 3525.0 0
2 .2 ,3 9891.0 0
2 .9 ,5 6642.0 0
3 .0 ,6 0150.0 0
3 .2 ,5 4445.0 0
3 .2 ,6 4445.0 0
3 .7 ,5 7189.0 0
3 .9 ,6 3218.0 0
4 .0 ,5 5794.0 0
4 .0 ,5 6957.0 0
4 .1 ,5 7081.0 0
4 .5 ,6 1111 .0 0
4 .9 ,6 7938.0 0

5.1 ,6 6029.0 0
5.3 ,8 3088.0 0
5.9 ,8 1363.0 0
6.0 ,9 3940.0 0
6.8 ,9 1738.0 0
7.1 ,9 8273.0 0
7.9 ,1 01302.0 0
8.2 ,1 1 3812.0 0
8.7 ,1 09431.0 0
9.0 ,1 05582.0 0
9.5 ,1 1 6969.0 0
9.6 ,1 1 2635.0 0
10.3 ,1 22391.0 0
10.5 ,1 21872.0 0
Here ’s th e P yth on c o de f o r f ittin g S im ple L in ear R eg re ssio n to th e T ra in in g S et:
# I m portin g t h e l ib ra rie s
im port m atp lo tlib .p yp lo t a s p lt
im port p an das a s p d
# I m portin g t h e d ata se t
data se t = p d.r e a d _csv ('S ala ry _D ata .c sv ')
X = d ata se t.i lo c[:, : -1 ].v alu es
y = d ata se t.i lo c[:, 1 ].v alu es
# S plit tin g t h e d ata se t i n to t h e T ra in in g s e t a n d T est s e t
fr o m s k le a rn .m od el_ se le ctio n i m port t r a in _te st_ sp lit
X_tr a in , X _te st, y_tr a in , y_te st = tr a in _te st_ sp lit (X , y, te st_ siz e = 1/3 ,
ra n dom _sta te = 0 )
# F it tin g S im ple L in ea r R eg re ssio n t o t h e T ra in in g s e t
fr o m s k le a rn .l in ea r_ m od el i m port L in ea rR eg re ssio n
re g re sso r = L in ea rR eg re ssio n ()
re g re sso r.f it (X _tr a in , y _tr a in )
# P re d ic tin g t h e T est s e t r e su lt s
y_p re d = r e g re sso r.p re d ic t(X _te st)

# V is u alis in g t h e T ra in in g s e t r e su lt s
plt .s c a tte r(X _tr a in , y _tr a in , c o lo r = ' r e d ')
plt .p lo t(X _tr a in , r e g re sso r.p re d ic t(X _tr a in ), c o lo r = ' b lu e')
plt .t it le ('S ala ry v s E xp erie n ce ( T ra in in g s e t)')
plt .x la b el( 'Y ea rs o f E xp erie n ce')
plt .y la b el( 'S ala ry ')
plt .s h ow ()
# V is u alis in g t h e T est s e t r e su lt s
plt .s c a tte r(X _te st, y _te st, c o lo r = ' r e d ')
plt .p lo t(X _tr a in , r e g re sso r.p re d ic t(X _tr a in ), c o lo r = ' b lu e')
plt .t it le ('S ala ry v s E xp erie n ce ( T est s e t)')
plt .x la b el( 'Y ea rs o f E xp erie n ce')
plt .y la b el( 'S ala ry ')
plt .s h ow ()
The o vera ll g oal h ere is to c re ate a m odel th at w ill p re d ic t S ala ry
base d o n Y ears o f E xperie n ce. F ir s t, w e c re ate a m odel u sin g th e T ra in in g S et
(7 0% o f th e d ata se t) . I t w ill th en f it a lin e th at is c lo se a s p ossib le w ith m ost o f
th e d ata p oin ts .
Afte r th e lin e is cre ate d , w e th en ap ply th at sa m e lin e to th e T est S et (th e
re m ain in g 3 0% o r 1 /3 o f t h e d ata se t) .

Notic e t h at t h e l in e p erfo rm ed w ell b oth o n t h e T ra in in g S et a n d t h e T est S et. A s
a r e su lt, th ere ’s a g ood c h an ce th at th e lin e o r o ur m odel w ill a ls o p erfo rm w ell
on n ew d ata .
Let’s h av e a r e cap o f w hat h ap pen ed . F ir s t, w e im porte d th e n ecessa ry lib ra rie s
(p an das fo r pro cessin g data , m atp lo tlib fo r data vis u aliz atio n). N ex t, w e
im porte d th e d ata se t an d assig ned X (th e in dep en den t v aria b le ) to Y ears o f
Experie n ce a n d y (th e ta rg et) to S ala ry . W e th en s p lit th e d ata se t in to T ra in in g
Set ( ⅔ ) a n d T est S et ( ⅓ ).
Then , w e a p ply th e L in ear R eg re ssio n m odel a n d f itte d a lin e ( w ith th e h elp o f
sc ik it- le arn , w hic h is a fre e so ftw are m ach in e le arn in g lib ra ry fo r th e P yth on
pro gra m min g la n guag e). T his is a cco m plis h ed th ro ugh th e fo llo w in g lin es o f
co de:
fr o m s k le a rn .l in ea r_ m od el i m port L in ea rR eg re ssio n
re g re sso r = L in ea rR eg re ssio n ()
re g re sso r.f it (X _tr a in , y _tr a in )
Afte r le arn in g fro m th e T ra in in g S et (X _tr a in
an d y _tr a in ), w e th en a p ply th at r e g re sso r to th e T est S et ( X _te st) a n d c o m pare
th e r e su lts u sin g d ata v is u aliz atio n ( m atp lo tlib ).
It’s a s tr a ig htf o rw ard a p pro ach . O ur m odel l e arn s f ro m t h e T ra in in g S et a n d t h en
ap plie s th at to th e T est S et (a n d s e e if th e m odel is g ood e n ough). T his is th e
esse n tia l p rin cip le o f S im ple L in ear R eg re ssio n.
Mult ip le L in ea r R eg re ssio n
That a ls o s im ila rly a p plie s t o M ultip le L in ear R eg re ssio n. T he g oal i s s till t o f it a
lin e th at b est sh ow s th e re la tio nsh ip b etw een a n in dep en den t v aria b le a n d th e

ta rg et. T he d if f e re n ce i s t h at i n M ultip le L in ear R eg re ssio n, w e h av e t o d eal w ith
at l e ast 2 f e atu re s o r i n dep en den t v aria b le s.
For e x am ple , le t’s lo ok a t a d ata se t a b out 5 0 s ta rtu ps ((5 0_S ta rtu ps.c sv ):
R& D
Spen d,A dm in is tr a tio n ,M ark etin g S pen d,S ta te ,P ro fit
165349.2 ,1 36897.8 ,4 71784.1 ,N ew Y ork ,1 92261.8 3
162597.7 ,1 51377.5 9,4 43898.5 3,C alif o rn ia ,1 91792.0 6
153441.5 1,1 011 45.5 5,4 07934.5 4,F lo rid a,1 91050.3 9
144372.4 1,1 1 8671.8 5,3 83199.6 2,N ew Y ork ,1 82901.9 9
142107.3 4,9 1391.7 7,3 66168.4 2,F lo rid a,1 66187.9 4
131876.9 ,9 9814.7 1,3 62861.3 6,N ew Y ork ,1 56991.1 2
134615.4 6,1 47198.8 7,1 27716.8 2,C alif o rn ia ,1 56122.5 1
130298.1 3,1 45530.0 6,3 23876.6 8,F lo rid a,1 55752.6
120542.5 2,1 48718.9 5,3 11 613.2 9,N ew Y ork ,1 52211 .7 7
123334.8 8,1 08679.1 7,3 04981.6 2,C alif o rn ia ,1 49759.9 6
101913.0 8,1 1 0594.1 1 ,2 29160.9 5,F lo rid a,1 46121.9 5
100671.9 6,9 1790.6 1,2 49744.5 5,C alif o rn ia ,1 44259.4
93863.7 5,1 27320.3 8,2 49839.4 4,F lo rid a,1 41585.5 2
91992.3 9,1 35495.0 7,2 52664.9 3,C alif o rn ia ,1 34307.3 5
11 9943.2 4,1 56547.4 2,2 56512.9 2,F lo rid a,1 32602.6 5
11 4523.6 1,1 22616.8 4,2 61776.2 3,N ew Y ork ,1 29917.0 4
78013.1 1 ,1 21597.5 5,2 64346.0 6,C alif o rn ia ,1 26992.9 3
94657.1 6,1 45077.5 8,2 82574.3 1,N ew Y ork ,1 25370.3 7
91749.1 6,1 1 4175.7 9,2 94919.5 7,F lo rid a,1 24266.9
86419.7 ,1 53514.1 1 ,0 ,N ew Y ork ,1 22776.8 6
76253.8 6,1 1 3867.3 ,2 98664.4 7,C alif o rn ia ,1 1 8474.0 3
78389.4 7,1 53773.4 3,2 99737.2 9,N ew Y ork ,1 11 313.0 2
73994.5 6,1 22782.7 5,3 03319.2 6,F lo rid a,1 1 0352.2 5
67532.5 3,1 05751.0 3,3 04768.7 3,F lo rid a,1 08733.9 9
77044.0 1,9 9281.3 4,1 40574.8 1,N ew Y ork ,1 08552.0 4
64664.7 1,1 39553.1 6,1 37962.6 2,C alif o rn ia ,1 07404.3 4
75328.8 7,1 44135.9 8,1 34050.0 7,F lo rid a,1 05733.5 4
72107.6 ,1 27864.5 5,3 53183.8 1,N ew Y ork ,1 05008.3 1
66051.5 2,1 82645.5 6,1 1 8148.2 ,F lo rid a,1 03282.3 8
65605.4 8,1 53032.0 6,1 07138.3 8,N ew Y ork ,1 01004.6 4
61994.4 8,1 1 5641.2 8,9 11 31.2 4,F lo rid a,9 9937.5 9
611 36.3 8,1 52701.9 2,8 8218.2 3,N ew Y ork ,9 7483.5 6

63408.8 6,1 29219.6 1,4 6085.2 5,C alif o rn ia ,9 7427.8 4
55493.9 5,1 03057.4 9,2 14634.8 1,F lo rid a,9 6778.9 2
46426.0 7,1 57693.9 2,2 10797.6 7,C alif o rn ia ,9 6712.8
46014.0 2,8 5047.4 4,2 05517.6 4,N ew Y ork ,9 6479.5 1
28663.7 6,1 27056.2 1,2 011 26.8 2,F lo rid a,9 0708.1 9
44069.9 5,5 1283.1 4,1 97029.4 2,C alif o rn ia ,8 9949.1 4
20229.5 9,6 5947.9 3,1 85265.1 ,N ew Y ork ,8 1229.0 6
38558.5 1,8 2982.0 9,1 74999.3 ,C alif o rn ia ,8 1005.7 6
28754.3 3,1 1 8546.0 5,1 72795.6 7,C alif o rn ia ,7 8239.9 1
27892.9 2,8 4710.7 7,1 64470.7 1,F lo rid a,7 7798.8 3
23640.9 3,9 6189.6 3,1 48001.1 1 ,C alif o rn ia ,7 1498.4 9
15505.7 3,1 27382.3 ,3 5534.1 7,N ew Y ork ,6 9758.9 8
22177.7 4,1 54806.1 4,2 8334.7 2,C alif o rn ia ,6 5200.3 3
1000.2 3,1 24153.0 4,1 903.9 3,N ew Y ork ,6 4926.0 8
1315.4 6,1 1 5816.2 1,2 9711 4.4 6,F lo rid a,4 9490.7 5
0,1 35426.9 2,0 ,C alif o rn ia ,4 2559.7 3
542.0 5,5 1743.1 5,0 ,N ew Y ork ,3 5673.4 1
0,1 1 6983.8 ,4 5173.0 6,C alif o rn ia ,1 4681.4
Notic e th at th ere a re m ultip le fe atu re s o r in dep en den t v aria b le s (R & D S pen d,
Adm in is tr a tio n, M ark etin g S pen d, S ta te ). A gain , th e g oal h ere is to re v eal o r
dis c o ver a r e la tio nsh ip b etw een t h e i n dep en den t v aria b le s a n d t h e t a rg et ( P ro fit) .
Als o n otic e th at u nder th e c o lu m n ‘S ta te ’, th e d ata is in te x t (n ot n um bers ).
You’ll s e e N ew Y ork , C alif o rn ia , a n d F lo rid a in ste ad o f n um bers . H ow d o y ou
deal w ith t h is k in d o f d ata ?
One c o nven ie n t w ay to d o th at is b y tr a n sfo rm in g c ate g oric al d ata (N ew Y ork ,
Calif o rn ia , F lo rid a) in to n um eric al d ata . W e c an a cco m plis h th is if w e u se th e
fo llo w in g lin es o f co de:
fr o m sk le a rn .p re p ro cessin g im port L ab elE nco d er,
OneH otE nco d er
la b ele n co d er = L ab elE nco d er()
X[:, 3 ] = l a b ele n co d er.f it _ tr a n sfo rm (X [:, 3 ])
#N ote t h is
on eh ote n co d er = O neH otE nco d er(c a te g oric a l_ fe a tu re s = [ 3 ])
X = on eh ote n co d er.f it _ tr a n sfo rm (X ).t o arra y()
Pay atte n tio n to
X[:, 3] =
la b ele n co d er.f it _ tr a n sfo rm (X [:, 3 ])
What w e d id th ere is to tr a n sfo rm th e d ata
in t h e f o urth c o lu m n ( S ta te ). I t’s n um ber 3 b ecau se P yth on i n dex in g s ta rts a t z ero
(0 ). T he g oal w as to tr a n sfo rm c ate g oric al v aria b le s d ata in to s o m eth in g w e c an

work o n. T o d o th is , w e’ll c re ate “ d um my v aria b le s” w hic h ta k e th e v alu es o f 0
or 1 . I n o th er w ord s, t h ey i n dic ate t h e p re se n ce o r a b se n ce o f s o m eth in g.
For e x am ple , w e h av e th e fo llo w in g d ata w ith c ate g oric al v aria b le s:
3.5 , N ew
York
2.0 , C alif o rn ia
6.7 , F lo rid a
If w e u se d um my v aria b le s,
th e a b ove d ata
will b e t r a n sfo rm ed i n to t h is :
3.5 , 1 , 0 , 0
2.0 , 0 , 1 , 0
6.7 , 0 , 0 , 1
Notic e t h at t h e c o lu m n f o r S ta te b ecam e e q uiv ale n t t o 3 c o lu m ns:

New Y ork
Calif o rn ia
Flo rid a
3.5
1 0 0
2.0
0 1 0
6.7
0 0 1

As m en tio ned earlie r, dum my varia b le s in dic ate th e pre se n ce or ab se n ce of
so m eth in g. T hey a re c o m monly u se d a s “ su bstitu te v aria b le s” s o w e c an d o a
quan tita tiv e an aly sis o n q ualita tiv e d ata . F ro m th e n ew ta b le ab ove w e can
quic k ly s e e th at 3 .5 is f o r N ew Y ork ( 1 N ew Y ork , 0 C alif o rn ia , a n d 0 F lo rid a).
It’s a c o nven ie n t w ay o f r e p re se n tin g c ate g orie s i n to n um eric v alu es.
How ev er, th ere ’s th is s o -c alle d “ d um my v aria b le tr a p ” w here in th ere ’s a n e x tr a
varia b le th at co uld h av e b een re m oved b ecau se it can b e p re d ic te d fro m th e
oth ers . In o ur e x am ple a b ove, n otic e th at w hen th e c o lu m ns fo r N ew Y ork a n d
Calif o rn ia a re z ero (0 ), a u to m atic ally y ou’ll k now it’s F lo rid a. Y ou c an a lr e ad y
know w hic h S ta te i t i s e v en w ith j u st t h e 2 v aria b le .
Contin uin g w ith o ur w ork o n 5 0_S ta rtu ps.c sv , w e c an a v oid t h e d um my v aria b le
tr a p b y i n clu din g t h is i n o ur c o de:
X = X [:, 1 :]
Let’s r e v ie w o ur w ork s o f a r:
im port n um py a s n p
im port m atp lo tlib .p yp lo t a s p lt
im port p an das a s p d

# I m portin g t h e d ata se t
data se t = p d.r e a d _csv ('5 0_S ta rtu ps.c sv ')
X = d ata se t.i lo c[:, : -1 ].v alu es
y = data se t.i lo c[:, 4].v alu es
Let’s lo ok at th e data :
data se t.h ea d ()

Then , w e tr a n sfo rm c ate g oric al v aria b le s in to n um eric o nes ( d um my v aria b le s):
# E nco d in g c a te g oric a l d ata
fr o m s k le a rn .p re p ro cessin g i m port L ab elE nco d er, O neH otE nco d er
la b ele n co d er = L ab elE nco d er()
X[:, 3 ] = l a b ele n co d er.f it _ tr a n sfo rm (X [:, 3 ])
on eh ote n co d er = O neH otE nco d er(c a te g oric a l_ fe a tu re s = [ 3 ])
X = on eh ote n co d er.f it _ tr a n sfo rm (X ).t o arra y()
# Avoid in g th e Dum my
Varia b le T ra p
X = X [:, 1 :]
Afte r th ose d ata p re p ro cessin g ste p s, th e d ata w ould so m eh ow lo ok lik e th is :

Notic e th at th ere a re n o c ate g oric al v aria b le s (N ew Y ork , C alif o rn ia , F lo rid a)
an d w e’v e r e m oved t h e “ re d undan t v aria b le ” t o a v oid t h e d um my v aria b le t r a p .

Now w e’re a ll s e t to d iv id in g th e d ata se t in to T ra in in g S et a n d T est S et. W e c an
do th is w ith th e f o llo w in g lin es o f c o de:
fr o m s k le a rn .m od el_ se le ctio n im port
tr a in _te st_ sp lit
X_tr a in , X _te st, y_tr a in , y_te st = tr a in _te st_ sp lit (X , y, te st_ siz e = 0.2 ,
ra n dom _sta te = 0 )
80% T ra in in g S et, 2 0% T est S et. N ex t s te p is w e c an th en
cre ate a re g re sso r an d “fit th e lin e” (a n d use th at lin e on T est S et) :
fr o m
sk le a rn .l in ea r_ m od el i m port L in ea rR eg re ssio n
re g re sso r = L in ea rR eg re ssio n ()
re g re sso r.f it (X _tr a in , y _tr a in )
# P re d ic tin g t h e T est s e t r e su lt s
y_p re d = re g re sso r.p re d ic t(X _te st)
y_pre d (p re d ic te d Pro fit valu es on th e
X_te st) w ill b e l ik e t h is :

How ev er, is th at a ll th ere is ? A re a ll th e v aria b le s ( R & D S pen d, A dm in is tr a tio n,
Mark etin g S pen d, S ta te ) re sp onsib le fo r th e ta rg et (P ro fit) . M an y d ata a n aly sts
perfo rm a d ditio nal s te p s to c re ate b ette r m odels a n d p re d ic to rs . T hey m ig ht b e
doin g B ack w ard E lim in atio n ( e .g . e lim in atin g v aria b le s o ne b y o ne u ntil th ere ’s
one o r tw o le ft) so w e’ll k now w hic h o f th e v aria b le s is m ak in g th e b ig gest
co ntr ib utio n t o o ur r e su lts ( a n d t h ere fo re m ore a ccu ra te p re d ic tio ns).
There are o th er w ay s o f m ak in g th e m ak in g th e m odel y ie ld m ore accu ra te
pre d ic tio ns. I t d ep en ds o n y our o bje ctiv es ( p erh ap s y ou w an t to u se a ll th e d ata
varia b le s) a n d r e so urc es ( n ot j u st m oney a n d c o m puta tio nal p ow er, b ut a ls o t im e
co nstr a in ts ).
Decis io n T re e
The R eg re ssio n m eth od dis c u sse d so fa r is very good if th ere ’s a lin ear
re la tio nsh ip b etw een t h e i n dep en den t v aria b le s a n d t h e t a rg et. B ut w hat i f t h ere ’s
no l in earity ( b ut t h e d ep en den t v aria b le s c an s till b e u se d t o p re d ic t t h e t a rg et) ?
This is w here o th er m eth ods s u ch a s D ecis io n T re e R eg re ssio n c o m es in . N ote
th at it so unds d if f e re n t fro m S im ple
Lin ear
R eg re ssio n an d M ultip le
Lin ear
Reg re ssio n. There ’s no lin earity an d it w ork s dif f e re n tly . D ecis io n Tre e
Reg re ssio n w ork s b y b re ak in g d ow n t h e d ata se t i n to s m alle r a n d s m alle r s u bse ts .
Here ’s an illu str a tio n th at bette r ex pla in s it:

http
://
ch em
- en g
. uto ro nto
. ca
/~
data m in in g
/ dm c
/ decis io n
_ tr e e
_ re g
. htm
In ste ad o f p lo ttin g a n d fittin g a lin e, th ere a re d ecis io n n odes a n d le af n odes.
Let’s quic k ly lo ok at an ex am ple to se e how it work s (u sin g
Positio n_S ala rie s.c sv ):
The d ata se t:
Posit io n ,L ev el,S ala ry
Busin ess A naly st,1 ,4 5000
Ju nio r C on su lt a n t,2 ,5 0000
Sen io r C on su lt a n t,3 ,6 0000
Man ager,4 ,8 0000
Cou ntr y M an ager,5 ,1 1 0000
Reg io n M an ager,6 ,1 50000
Partn er,7 ,2 00000
Sen io r P artn er,8 ,3 00000
C-le v el,9 ,5 00000
CEO ,1 0,1 000000
# D ecis io n T re e R eg re ssio n
# I m portin g t h e l ib ra rie s
im port n um py a s n p
im port m atp lo tlib .p yp lo t a s p lt
im port p an das a s p d
# I m portin g t h e d ata se t
data se t = p d.r e a d _csv ('P osit io n _S ala rie s.c sv ')
X = d ata se t.i lo c[:, 1 :2 ].v alu es
y = d ata se t.i lo c[:, 2 ].v alu es
# S plit tin g t h e d ata se t i n to t h e T ra in in g s e t a n d T est s e t
"""fr o m s k le a rn .c ro ss_ valid atio n i m port t r a in _te st_ sp lit
X_tr a in , X _te st, y _tr a in , y _te st = t r a in _te st_ sp lit (X , y , t e st_ siz e = 0 .2 , r a n dom _sta te = 0 )" ""
# F it tin g D ecis io n T re e R eg re ssio n t o t h e d ata se t
fr o m s k le a rn .t r e e i m port D ecis io n T re eR eg re sso r

re g re sso r = D ecis io n T re eR eg re sso r(r a n dom _sta te = 0 )
re g re sso r.f it (X , y )
# P re d ic tin g a n ew r e su lt
y_p re d = r e g re sso r.p re d ic t(6 .5 )
# V is u alis in g t h e D ecis io n T re e R eg re ssio n r e su lt s ( h ig h er r e so lu tio n )
X_grid = n p.a ra n ge(m in (X ), m ax(X ), 0 .0 1)
X_grid = X _grid .r e sh ap e((le n (X _grid ), 1 ))
plt .s c a tte r(X , y , c o lo r = ' r e d ')
plt .p lo t(X _grid , r e g re sso r.p re d ic t(X _grid ), c o lo r = ' b lu e')
plt .t it le ('T ru th o r B lu ff ( D ecis io n T re e R eg re ssio n )')
plt .x la b el( 'P osit io n l e v el')
plt .y la b el( 'S ala ry ')
plt .s h ow ()
When y ou r u n t h e p re v io us c o de, y ou s h ould s e e t h e f o llo w in g i n t h e J u pyte r N ote b ook:

Notic e th at th ere ’s n o lin ear re la tio nsh ip b etw een th e P ositio n L ev el an d th e
Sala ry . I n ste ad , i t’s s o m ew hat a s te p -w is e r e su lt. W e c an s till s e e t h e r e la tio nsh ip
betw een Positio n Lev el an d Sala ry , but it’s ex pre sse d in dif f e re n t te rm s
(s e em in gly n on-s tr a ig htf o rw ard a p pro ach ).
Ran dom F ore st
As d is c u sse d e arlie r, D ecis io n T re e R eg re ssio n c an b e g ood to u se w hen th ere ’s
not m uch lin earity b etw een a n in dep en den t v aria b le a n d a ta rg et. H ow ev er, th is
ap pro ach u se s th e d ata se t o nce to c o m e u p w ith r e su lts . T hat’s b ecau se in m an y
case s, it’s a lw ay s g ood to g et d if f e re n t re su lts fro m d if f e re n t a p pro ach es (e .g .
man y d ecis io n t r e es) a n d t h en a v era g in g t h ose r e su lts .
To s o lv e t h is , m an y d ata s c ie n tis ts u se R an dom F ore st R eg re ssio n. T his i s s im ply
a c o lle ctio n o r e n se m ble o f d if f e re n t d ecis io n tr e es w here in ra n dom d if f e re n t
su bse ts a re u se d a n d t h en t h e r e su lts a re a v era g ed . I t’s l ik e c re atin g d ecis io n t r e es

ag ain a n d a g ain a n d t h en g ettin g t h e r e su lts o f e ach .
In c o de, t h is w ould l o ok a l o t l ik e t h is :
# R an dom F ore st R eg re ssio n
# I m portin g t h e l ib ra rie s
im port n um py a s n p
im port m atp lo tlib .p yp lo t a s p lt
im port p an das a s p d
%matp lo tlib i n lin e
# I m portin g t h e d ata se t
data se t = p d.r e a d _csv ('P osit io n _S ala rie s.c sv ')
X = d ata se t.i lo c[:, 1 :2 ].v alu es
y = d ata se t.i lo c[:, 2 ].v alu es
# S plit tin g t h e d ata se t i n to t h e T ra in in g s e t a n d T est s e t
"""fr o m s k le a rn .c ro ss_ valid atio n i m port t r a in _te st_ sp lit
X_tr a in , X _te st, y_tr a in , y_te st = tr a in _te st_ sp lit (X , y, te st_ siz e = 0.2 ,
ra n dom _sta te = 0 )" ""
# F ea tu re S ca lin g
"""fr o m s k le a rn .p re p ro cessin g i m port S ta n dard Sca le r
sc _ X = S ta n dard Sca le r()
X_tr a in = s c _ X .f it _ tr a n sfo rm (X _tr a in )
X_te st = s c _ X .t r a n sfo rm (X _te st)
sc _ y = S ta n dard Sca le r()
y_tr a in = s c _ y.f it _ tr a n sfo rm (y _tr a in )" ""
# F it tin g R an dom F ore st R eg re ssio n t o t h e d ata se t
fr o m s k le a rn .e n se m ble i m port R an dom Fore stR eg re sso r
re g re sso r = R an dom Fore stR eg re sso r(n _estim ato rs = 3 00, r a n dom _sta te = 0 )
re g re sso r.f it (X , y )
# P re d ic tin g a n ew r e su lt
y_p re d = r e g re sso r.p re d ic t(6 .5 )
# V is u alis in g t h e R an dom F ore st R eg re ssio n r e su lt s ( h ig h er r e so lu tio n )

X_grid = n p.a ra n ge(m in (X ), m ax(X ), 0 .0 1)
X_grid = X _grid .r e sh ap e((le n (X _grid ), 1 ))
plt .s c a tte r(X , y , c o lo r = ' r e d ')
plt .p lo t(X _grid , r e g re sso r.p re d ic t(X _grid ), c o lo r = ' b lu e')
plt .t it le ('T ru th o r B lu ff ( R an dom F ore st R eg re ssio n )')
plt .x la b el( 'P osit io n l e v el')
plt .y la b el( 'S ala ry ')
plt .s h ow ()

Notic e th at it’s a lo t s im ila r to th e D ecis io n T re e R eg re ssio n e arlie r. A fte r a ll,
Ran dom F ore st (fro m th e te rm its e lf ) is a c o lle ctio n o f “ tr e es.” If th ere ’s n ot
much d ev ia tio n in o ur d ata se t, th e re su lt sh ould lo ok alm ost th e sa m e. L et’s
co m pare th em fo r easy vis u aliz atio n:

Man y d ata s c ie n tis ts p re fe r R an dom F ore st b ecau se i t a v era g es r e su lts w hic h c an
eff e ctiv ely re d uce erro rs . L ookin g at th e co de it se em s str a ig htf o rw ard an d
sim ple . B ut b eh in d th e s c en es th ere a re c o m ple x a lg orith m s a t p la y. I t’s s o rt o f a
bla ck b ox w here in t h ere ’s a n i n put, t h ere ’s a b la ck b ox a n d t h ere ’s t h e r e su lt. W e
hav e n ot m uch id ea a b out w hat h ap pen s in sid e th e b la ck b ox (a lth ough w e c an
still f in d o ut if w e d ig th ro ugh th e m ath em atic s). W e’ll e n co unte r th is a g ain a n d
ag ain a s w e d is c u ss m ore a b out d ata a n aly sis a n d m ach in e l e arn in g.

1 1 . C la ssif ic a tio n
S pam o r n ot sp am ? T his is o ne o f th e m ost p opula r u se s an d ex am ple s o f
C la ssif ic atio n. Ju st lik e R eg re ssio n, C la ssif ic atio n is als o under Superv is e d
L earn in g. O ur m odel le arn s fro m la b elle d d ata (“ w ith s u perv is io n”). T hen , o ur
s y ste m a p plie s t h at l e arn in g t o n ew d ata se t.
F or e x am ple , w e h av e a d ata se t w ith d if f e re n t e m ail m essa g es a n d e ach o ne w as
l a b elle d eith er S pam or N ot S pam . O ur m odel m ig ht th en fin d patte rn s or
c o m monalitie s a m ong e m ail m essa g es th at a re m ark ed S pam . W hen p erfo rm in g
a p re d ic tio n, o ur m odel m ig ht tr y to fin d th ose p atte rn s a n d c o m monalitie s in
n ew e m ail m essa g es.
T here a re d if f e re n t a p pro ach es i n d oin g s u ccessfu l C la ssif ic atio n. L et’s d is c u ss a
f e w o f t h em :
L ogis tic R eg re ssio n
I n m an y C la ssif ic atio n ta sk s, th e g oal is to d ete rm in e w heth er it’s 0 o r 1 u sin g
t w o in dep en den t varia b le s. F or ex am ple , giv en th at th e A ge an d E stim ate d
S ala ry d ete rm in e a n o utc o m e s u ch a s w hen t h e p ers o n p urc h ase d o r n ot, h ow c an
w e su ccessfu lly c re ate a m odel th at sh ow s th eir re la tio nsh ip s a n d u se th at fo r
p re d ic tio n?
T his so unds c o nfu sin g w hic h is w hy it’s a lw ay s b est to lo ok a t a n e x am ple :

H ere o ur tw o v aria b le s a re A ge a n d E stim ate d S ala ry . E ach d ata p oin t is th en
c la ssif ie d e ith er a s 0 ( d id n’t b uy) o r 1 ( b ought) . T here ’s a lin e th at s e p ara te s th e
t w o (w ith co lo r le g en ds fo r easy vis u aliz atio n). This ap pro ach (L ogis tic

Reg re ssio n) is b ase d o n p ro bab ility ( e .g . th e p ro bab ility o f a d ata p oin t if it’s a 0
or 1 ).
As w ith R eg re ssio n in th e p re v io us c h ap te r w here in th ere ’s th is s o -c alle d b la ck
box, th e b eh in d th e sc en es o f L ogis tic R eg re ssio n fo r C la ssif ic atio n c an se em
co m ple x . G ood n ew s is its im ple m en ta tio n is str a ig htf o rw ard e sp ecia lly w hen
we use Pyth on an d sc ik it- le arn :
Here ’s a peek of th e data se t fir s t
(‘S ocia l_ N etw ork _A ds.c sv ’):

# L ogis tic R eg re ssio n
# I m portin g t h e l ib ra rie s
im port n um py a s n p
im port m atp lo tlib .p yp lo t a s p lt
im port p an das a s p d
%matp lo tlib i n lin e
# I m portin g t h e d ata se t
data se t = p d.r e a d _csv ('S ocia l_ N etw ork _A ds.c sv ')
X = d ata se t.i lo c[:, [ 2 , 3 ]].v alu es
y = d ata se t.i lo c[:, 4 ].v alu es
# S plit tin g t h e d ata se t i n to t h e T ra in in g s e t a n d T est s e t
fr o m s k le a rn .m od el_ se le ctio n i m port t r a in _te st_ sp lit
X_tr a in , X _te st, y _tr a in , y _te st = t r a in _te st_ sp lit (X , y , t e st_ siz e = 0 .2 5, r a n dom _sta te = 0 )
# F ea tu re S ca lin g
fr o m s k le a rn .p re p ro cessin g i m port S ta n dard Sca le r
sc = S ta n dard Sca le r()
X_tr a in = s c .f it _ tr a n sfo rm (X _tr a in )
X_te st = s c .t r a n sfo rm (X _te st)

# F it tin g L ogis tic R eg re ssio n t o t h e T ra in in g s e t
fr o m s k le a rn .l in ea r_ m od el i m port L ogis tic R eg re ssio n
cla ssif ie r = L ogis tic R eg re ssio n (r a n dom _sta te = 0 )
cla ssif ie r.f it (X _tr a in , y _tr a in )
# P re d ic tin g t h e T est s e t r e su lt s
y_p re d = c la ssif ie r.p re d ic t(X _te st)
# M ak in g t h e C on fu sio n M atr ix
fr o m s k le a rn .m etr ic s i m port c o n fu sio n _m atr ix
cm = c o n fu sio n _m atr ix (y _te st, y _p re d )
# V is u alis in g t h e T ra in in g s e t r e su lt s
fr o m m atp lo tlib .c o lo rs i m port L is te d C olo rm ap
X_se t, y _se t = X _tr a in , y _tr a in
X1, X 2 = n p.m esh grid (n p.a ra n ge(s ta rt = X _se t[:, 0 ].m in () - 1 , s to p = X _se t[:, 0 ].m ax() + 1 , s te p =
0.0 1),
np.a ra n ge(s ta rt = X _se t[:, 1 ].m in () - 1 , s to p = X _se t[:, 1 ].m ax() + 1 , s te p = 0 .0 1))
plt .c o n to u rf(X 1, X 2, c la ssif ie r.p re d ic t(n p.a rra y([X 1.r a vel( ), X 2.r a vel( )]).T ).r e sh ap e(X 1.s h ap e),
alp ha = 0 .7 5, c m ap = L is te d C olo rm ap (('r e d ', ' g re en ') ))
plt .x lim (X 1.m in (), X 1.m ax())
plt .y lim (X 2.m in (), X 2.m ax())
fo r i , j i n e n um era te (n p.u niq ue(y _se t)):
plt .s c a tte r(X _se t[y _se t = = j , 0 ], X _se t[y _se t = = j , 1 ],
c = L is te d C olo rm ap (('r e d ', ' g re en ') )(i) , l a b el = j )
plt .t it le ('L ogis tic R eg re ssio n ( T ra in in g s e t)')
plt .x la b el( 'A ge')
plt .y la b el( 'E stim ate d S ala ry ')
plt .l e g en d()
plt .s h ow ()
# V is u alis in g t h e T est s e t r e su lt s
fr o m m atp lo tlib .c o lo rs i m port L is te d C olo rm ap
X_se t, y _se t = X _te st, y _te st
X1, X 2 = n p.m esh grid (n p.a ra n ge(s ta rt = X _se t[:, 0 ].m in () - 1 , s to p = X _se t[:, 0 ].m ax() + 1 , s te p =
0.0 1),
np.a ra n ge(s ta rt = X _se t[:, 1 ].m in () - 1 , s to p = X _se t[:, 1 ].m ax() + 1 , s te p = 0 .0 1))
plt .c o n to u rf(X 1, X 2, c la ssif ie r.p re d ic t(n p.a rra y([X 1.r a vel( ), X 2.r a vel( )]).T ).r e sh ap e(X 1.s h ap e),
alp ha = 0 .7 5, c m ap = L is te d C olo rm ap (('r e d ', ' g re en ') ))
plt .x lim (X 1.m in (), X 1.m ax())
plt .y lim (X 2.m in (), X 2.m ax())
fo r i , j i n e n um era te (n p.u niq ue(y _se t)):
plt .s c a tte r(X _se t[y _se t = = j , 0 ], X _se t[y _se t = = j , 1 ],
c = L is te d C olo rm ap (('r e d ', ' g re en ') )(i) , l a b el = j )
plt .t it le ('L ogis tic R eg re ssio n ( T est s e t)')
plt .x la b el( 'A ge')
plt .y la b el( 'E stim ate d S ala ry ')
plt .l e g en d()
plt .s h ow ()
When w e r u n t h is , y ou’ll s e e t h e f o llo w in g v is u aliz atio ns i n y our J u pyte r N ote b ook:

It’s a co m mon ste p to le arn fir s t fro m th e T ra in in g S et an d th en ap ply th at
le arn in g to th e T est S et (a n d s e e if th e m odel is g ood e n ough in p re d ic tin g th e
re su lt f o r n ew d ata p oin ts ). A fte r a ll th is is th e e sse n ce o f S uperv is e d L earn in g.
Fir s t, th ere ’s tr a in in g a n d s u perv is io n. N ex t, th e le sso n w ill b e a p plie d to n ew
situ atio ns.
As y ou n otic e in th e v is u aliz atio n fo r th e T est S et, m ost o f th e g re en d ots fa ll
under th e g re en r e g io n ( w ith a f e w r e d d ots th ough b ecau se it’s h ard to a ch ie v e
100% a ccu ra cy in lo gis tic re g re ssio n). T his m ean s o ur m odel c o uld b e g ood
en ough f o r p re d ic tin g w heth er a p ers o n w ith a c erta in A ge a n d E stim ate d S ala ry
would p urc h ase o r n ot.
Als o p ay e x tr a a tte n tio n t o t h e f o llo w in g b lo ck s o f c o de:
# F ea tu re S ca lin g
fr o m s k le a rn .p re p ro cessin g i m port S ta n dard Sca le r
sc = S ta n dard Sca le r()

X_tr a in = s c .f it _ tr a n sfo rm (X _tr a in )
X_te st = sc .t r a n sfo rm (X _te st)
We fir s t tr a n sfo rm ed th e data in to th e sa m e
ra n ge o r s c ale to a v oid s k ew in g o r h eav y re lia n ce o n a c erta in v aria b le . In o ur
data se t, t h e E stim ate d S ala ry i s e x pre sse d i n t h ousa n ds w hile a g e i s e x pre sse d i n
a s m alle r s c ale . W e h av e to m ak e th em in th e s a m e r a n ge s o w e c an g et a m ore
re aso nab le m odel.
Well, asid e fro m L ogis tic R eg re ssio n, th ere are oth er w ay s of perfo rm in g
Cla ssif ic atio n t a sk s. L et’s d is c u ss t h em n ex t.
K-N ea re st N eig h bors
Notic e th at L ogis tic R eg re ssio n s e em s to h av e a lin ear b oundary b etw een 0 s a n d
1s. A s a re su lt, it m is se s a fe w o f th e d ata p oin ts th at s h ould h av e b een o n th e
oth er s id e.
Than kfu lly , th ere a re n on-lin ear m odels th at c an c ap tu re m ore d ata p oin ts in a
more a ccu ra te m an ner. O ne o f th em is th ro ugh th e u se o f K -N eare st N eig hbors .
It w ork s b y h av in g a “ n ew d ata p oin t” a n d th en c o untin g h ow m an y n eig hbors
belo ng to e ith er c ate g ory . I f m ore n eig hbors b elo ng to c ate g ory A th an c ate g ory
B, t h en t h e n ew p oin t s h ould b elo ng t o c ate g ory A .
There fo re , th e c la ssif ic atio n o f a c erta in p oin t is b ase d o n th e m ajo rity o f its
neare st n eig hbors (h en ce th e n am e). T his can o fte n b e acco m plis h ed b y th e
fo llo w in g c o de:
fr o m s k le a rn .n eig h bors i m port K Neig h borsC la ssif ie r
cla ssif ie r = K Neig h borsC la ssif ie r(n _n eig h bors = 5 , m etr ic = 'm in kow sk i', p
= 2 )
cla ssif ie r.f it (X _tr a in , y _tr a in )
Again , in ste ad o f sta rtin g fro m sc ra tc h , w e’re
im portin g “ p re b uilt c o de” th at m ak es o ur ta sk f a ste r a n d e asie r. T he b eh in d th e
sc en es c o uld b e le arn ed a n d s tu die d . B ut fo r m an y p urp ose s, th e p re b uilt o nes
are g ood e n ough t o m ak e r e aso nab ly u se fu l m odels .
Let’s lo ok at an ex am ple o f h ow to im ple m en t th is u sin g ag ain th e d ata se t
‘S ocia l_ N etw ork _A ds.c sv ’:
# K -N ea re st N eig h bors ( K -N N)
# I m portin g t h e l ib ra rie s
im port n um py a s n p
im port m atp lo tlib .p yp lo t a s p lt
im port p an das a s p d
%matp lo tlib i n lin e

# I m portin g t h e d ata se t
data se t = p d.r e a d _csv ('S ocia l_ N etw ork _A ds.c sv ')
X = d ata se t.i lo c[:, [ 2 , 3 ]].v alu es
y = d ata se t.i lo c[:, 4 ].v alu es
# S plit tin g t h e d ata se t i n to t h e T ra in in g s e t a n d T est s e t
fr o m s k le a rn .m od el_ se le ctio n i m port t r a in _te st_ sp lit
X_tr a in , X _te st, y_tr a in , y_te st = tr a in _te st_ sp lit (X , y, te st_ siz e = 0.2 5,
ra n dom _sta te = 0 )
# F ea tu re S ca lin g
fr o m s k le a rn .p re p ro cessin g i m port S ta n dard Sca le r
sc = S ta n dard Sca le r()
X_tr a in = s c .f it _ tr a n sfo rm (X _tr a in )
X_te st = s c .t r a n sfo rm (X _te st)
# F it tin g K -N N t o t h e T ra in in g s e t
fr o m s k le a rn .n eig h bors i m port K Neig h borsC la ssif ie r
cla ssif ie r = K Neig h borsC la ssif ie r(n _n eig h bors = 5 , m etr ic = 'm in kow sk i', p
= 2 )
cla ssif ie r.f it (X _tr a in , y _tr a in )
# P re d ic tin g t h e T est s e t r e su lt s
y_p re d = c la ssif ie r.p re d ic t(X _te st)
# M ak in g t h e C on fu sio n M atr ix
fr o m s k le a rn .m etr ic s i m port c o n fu sio n _m atr ix
cm = c o n fu sio n _m atr ix (y _te st, y _p re d )
# V is u alis in g t h e T ra in in g s e t r e su lt s
fr o m m atp lo tlib .c o lo rs i m port L is te d C olo rm ap
X_se t, y _se t = X _tr a in , y _tr a in
X1, X 2 = n p.m esh grid (n p.a ra n ge(s ta rt = X _se t[:, 0 ].m in () - 1 , s to p = X _se t[:,
0].m ax() + 1 , s te p = 0 .0 1),
np.a ra n ge(s ta rt = X _se t[:, 1 ].m in () - 1 , s to p = X _se t[:, 1 ].m ax() + 1 , s te p =
0.0 1))

plt .c o n to u rf(X 1, X2, cla ssif ie r.p re d ic t(n p.a rra y([X 1.r a vel( ),
X2.r a vel( )]).T ).r e sh ap e(X 1.s h ap e),
alp ha = 0 .7 5, c m ap = L is te d C olo rm ap (('r e d ', ' g re en ') ))
plt .x lim (X 1.m in (), X 1.m ax())
plt .y lim (X 2.m in (), X 2.m ax())
fo r i , j i n e n um era te (n p.u niq ue(y _se t)):
plt .s c a tte r(X _se t[y _se t = = j , 0 ], X _se t[y _se t = = j , 1 ],
c = L is te d C olo rm ap (('r e d ', ' g re en ') )(i) , l a b el = j )
plt .t it le ('K -N N ( T ra in in g s e t)')
plt .x la b el( 'A ge')
plt .y la b el( 'E stim ate d S ala ry ')
plt .l e g en d()
plt .s h ow ()
# V is u alis in g t h e T est s e t r e su lt s
fr o m m atp lo tlib .c o lo rs i m port L is te d C olo rm ap
X_se t, y _se t = X _te st, y _te st
X1, X 2 = n p.m esh grid (n p.a ra n ge(s ta rt = X _se t[:, 0 ].m in () - 1 , s to p = X _se t[:,
0].m ax() + 1 , s te p = 0 .0 1),
np.a ra n ge(s ta rt = X _se t[:, 1 ].m in () - 1 , s to p = X _se t[:, 1 ].m ax() + 1 , s te p =
0.0 1))
plt .c o n to u rf(X 1, X2, cla ssif ie r.p re d ic t(n p.a rra y([X 1.r a vel( ),
X2.r a vel( )]).T ).r e sh ap e(X 1.s h ap e),
alp ha = 0 .7 5, c m ap = L is te d C olo rm ap (('r e d ', ' g re en ') ))
plt .x lim (X 1.m in (), X 1.m ax())
plt .y lim (X 2.m in (), X 2.m ax())
fo r i , j i n e n um era te (n p.u niq ue(y _se t)):
plt .s c a tte r(X _se t[y _se t = = j , 0 ], X _se t[y _se t = = j , 1 ],
c = L is te d C olo rm ap (('r e d ', ' g re en ') )(i) , l a b el = j )
plt .t it le ('K -N N ( T est s e t)')
plt .x la b el( 'A ge')
plt .y la b el( 'E stim ate d S ala ry ')
plt .l e g en d()
plt .s h ow ()
When w e r u n th is in J u pyte r N ote b ook, w e s h ould s e e th e f o llo w in g

vis u aliz atio ns:

Notic e th at th e b oundary is n on-lin ear. T his is th e c ase b ecau se o f th e d if f e re n t
ap pro ach b y K -N eare st N eig hbors ( K -N N). A ls o n otic e t h at t h ere a re s till m is se s
(e .g . f e w r e d d ots a re s till in th e g re en r e g io n). T o c ap tu re th em a ll m ay r e q uir e
th e u se o f a b ig ger d ata se t o r a n oth er m eth od (o r p erh ap s th ere ’s n o w ay to
cap tu re a ll o f t h em b ecau se o ur d ata a n d m odel w ill n ev er b e p erfe ct) .
Decis io n T re e C la ssif ic a tio n
As w ith R eg re ssio n, m an y d ata sc ie n tis ts als o im ple m en t D ecis io n T re es in
Cla ssif ic atio n. A s m en tio ned in th e p re v io us c h ap te r, c re atin g a d ecis io n tr e e is
ab out b re ak in g d ow n a d ata se t in to s m alle r a n d s m alle r s u bse ts w hile b ra n ch in g
th em o ut ( c re atin g a n a sso cia te d d ecis io n t r e e).
Here ’s a sim ple ex am ple so you can unders ta n d it bette r:

Notic e th at b ra n ch es an d le av es re su lt fro m b re ak in g d ow n th e d ata se t in to
sm alle r su bse ts . In C la ssif ic atio n, w e can sim ila rly ap ply th is th ro ugh th e
fo llo w in g co de (a g ain usin g th e S ocia l_ N etw ork _A ds.c sv ):
# D ecis io n T re e
Cla ssif ic a tio n
# I m portin g t h e l ib ra rie s
im port n um py a s n p
im port m atp lo tlib .p yp lo t a s p lt
im port p an das a s p d
%matp lo tlib i n lin e
# I m portin g t h e d ata se t
data se t = p d.r e a d _csv ('S ocia l_ N etw ork _A ds.c sv ')
X = d ata se t.i lo c[:, [ 2 , 3 ]].v alu es
y = d ata se t.i lo c[:, 4 ].v alu es
# S plit tin g t h e d ata se t i n to t h e T ra in in g s e t a n d T est s e t
fr o m s k le a rn .m od el_ se le ctio n i m port t r a in _te st_ sp lit
X_tr a in , X _te st, y_tr a in , y_te st = tr a in _te st_ sp lit (X , y, te st_ siz e = 0.2 5,
ra n dom _sta te = 0 )
# F ea tu re S ca lin g
fr o m s k le a rn .p re p ro cessin g i m port S ta n dard Sca le r
sc = S ta n dard Sca le r()
X_tr a in = s c .f it _ tr a n sfo rm (X _tr a in )
X_te st = s c .t r a n sfo rm (X _te st)

# F it tin g D ecis io n T re e C la ssif ic a tio n t o t h e T ra in in g s e t
fr o m s k le a rn .t r e e i m port D ecis io n T re eC la ssif ie r
cla ssif ie r = D ecis io n T re eC la ssif ie r(c rit e rio n = ' e n tr o p y', r a n dom _sta te = 0 )
cla ssif ie r.f it (X _tr a in , y _tr a in )
# P re d ic tin g t h e T est s e t r e su lt s
y_p re d = c la ssif ie r.p re d ic t(X _te st)
# M ak in g t h e C on fu sio n M atr ix
fr o m s k le a rn .m etr ic s i m port c o n fu sio n _m atr ix
cm = c o n fu sio n _m atr ix (y _te st, y _p re d )
# V is u alis in g t h e T ra in in g s e t r e su lt s
fr o m m atp lo tlib .c o lo rs i m port L is te d C olo rm ap
X_se t, y _se t = X _tr a in , y _tr a in
X1, X 2 = n p.m esh grid (n p.a ra n ge(s ta rt = X _se t[:, 0 ].m in () - 1 , s to p = X _se t[:,
0].m ax() + 1 , s te p = 0 .0 1),
np.a ra n ge(s ta rt = X _se t[:, 1 ].m in () - 1 , s to p = X _se t[:, 1 ].m ax() + 1 , s te p =
0.0 1))
plt .c o n to u rf(X 1, X2, cla ssif ie r.p re d ic t(n p.a rra y([X 1.r a vel( ),
X2.r a vel( )]).T ).r e sh ap e(X 1.s h ap e),
alp ha = 0 .7 5, c m ap = L is te d C olo rm ap (('r e d ', ' g re en ') ))
plt .x lim (X 1.m in (), X 1.m ax())
plt .y lim (X 2.m in (), X 2.m ax())
fo r i , j i n e n um era te (n p.u niq ue(y _se t)):
plt .s c a tte r(X _se t[y _se t = = j , 0 ], X _se t[y _se t = = j , 1 ],
c = L is te d C olo rm ap (('r e d ', ' g re en ') )(i) , l a b el = j )
plt .t it le ('D ecis io n T re e C la ssif ic a tio n ( T ra in in g s e t)')
plt .x la b el( 'A ge')
plt .y la b el( 'E stim ate d S ala ry ')
plt .l e g en d()
plt .s h ow ()
# V is u alis in g t h e T est s e t r e su lt s
fr o m m atp lo tlib .c o lo rs i m port L is te d C olo rm ap
X_se t, y _se t = X _te st, y _te st
X1, X 2 = n p.m esh grid (n p.a ra n ge(s ta rt = X _se t[:, 0 ].m in () - 1 , s to p = X _se t[:,

0].m ax() + 1 , s te p = 0 .0 1),
np.a ra n ge(s ta rt = X _se t[:, 1 ].m in () - 1 , s to p = X _se t[:, 1 ].m ax() + 1 , s te p =
0.0 1))
plt .c o n to u rf(X 1, X2, cla ssif ie r.p re d ic t(n p.a rra y([X 1.r a vel( ),
X2.r a vel( )]).T ).r e sh ap e(X 1.s h ap e),
alp ha = 0 .7 5, c m ap = L is te d C olo rm ap (('r e d ', ' g re en ') ))
plt .x lim (X 1.m in (), X 1.m ax())
plt .y lim (X 2.m in (), X 2.m ax())
fo r i , j i n e n um era te (n p.u niq ue(y _se t)):
plt .s c a tte r(X _se t[y _se t = = j , 0 ], X _se t[y _se t = = j , 1 ],
c = L is te d C olo rm ap (('r e d ', ' g re en ') )(i) , l a b el = j )
plt .t it le ('D ecis io n T re e C la ssif ic a tio n ( T est s e t)')
plt .x la b el( 'A ge')
plt .y la b el( 'E stim ate d S ala ry ')
plt .l e g en d()
plt .s h ow ()
The m ost im porta n t d if fe re n ce is in th is b lo ck o f co d e:
fr o m
sk le a rn .t r e e i m port D ecis io n T re eC la ssif ie r
cla ssif ie r = D ecis io n T re eC la ssif ie r(c rit e rio n = ' e n tr o p y', r a n dom _sta te = 0 )
cla ssif ie r.f it (X _tr a in , y _tr a in )
When w e r u n th e w hole c o de ( in clu din g th e d ata
vis u aliz atio n), w e’ll s e e t h is :

Notic e th e huge dif f e re n ce co m pare d to L ogis tic R eg re ssio n an d K -N eare st
Neig hbors ( K -N N). I n t h ese l a tte r t w o, t h ere a re j u st t w o b oundarie s. B ut h ere i n
our D ecis io n T re e C la ssif ic atio n, th ere a re p oin ts o uts id e th e m ain re d re g io n
th at fa ll in sid e “ m in i re d re g io ns.”
As a re su lt, o ur m odel w as a b le to c ap tu re
data poin ts th at m ig ht be im possib le oth erw is e (e .g . w hen usin g L ogis tic
Reg re ssio n).
Ran dom F ore st C la ssif ic a tio n
Recall fro m th e p re v io us c h ap te r a b out R eg re ssio n th at a R an dom F ore st is a
co lle ctio n o r e n se m ble o f m an y d ecis io n t r e es. T his a ls o a p plie s t o C la ssif ic atio n
where in m an y d ecis io n t r e es a re u se d a n d t h e r e su lts a re a v era g ed .
# R an dom F ore st C la ssif ic a tio n
# I m portin g t h e l ib ra rie s
im port n um py a s n p
im port m atp lo tlib .p yp lo t a s p lt
im port p an das a s p d
%matp lo tlib i n lin e
# I m portin g t h e d ata se t
data se t = p d.r e a d _csv ('S ocia l_ N etw ork _A ds.c sv ')
X = d ata se t.i lo c[:, [ 2 , 3 ]].v alu es
y = d ata se t.i lo c[:, 4 ].v alu es
# S plit tin g t h e d ata se t i n to t h e T ra in in g s e t a n d T est s e t
fr o m s k le a rn .m od el_ se le ctio n i m port t r a in _te st_ sp lit
X_tr a in , X _te st, y _tr a in , y _te st = t r a in _te st_ sp lit (X , y , t e st_ siz e = 0 .2 5, r a n dom _sta te = 0 )
# F ea tu re S ca lin g
fr o m s k le a rn .p re p ro cessin g i m port S ta n dard Sca le r

sc = S ta n dard Sca le r()
X_tr a in = s c .f it _ tr a n sfo rm (X _tr a in )
X_te st = s c .t r a n sfo rm (X _te st)
# F it tin g R an dom F ore st C la ssif ic a tio n t o t h e T ra in in g s e t
fr o m s k le a rn .e n se m ble i m port R an dom Fore stC la ssif ie r
cla ssif ie r = R an dom Fore stC la ssif ie r(n _estim ato rs = 1 0, c rit e rio n = ' e n tr o p y', r a n dom _sta te = 0 )
cla ssif ie r.f it (X _tr a in , y _tr a in )
# P re d ic tin g t h e T est s e t r e su lt s
y_p re d = c la ssif ie r.p re d ic t(X _te st)
# M ak in g t h e C on fu sio n M atr ix
fr o m s k le a rn .m etr ic s i m port c o n fu sio n _m atr ix
cm = c o n fu sio n _m atr ix (y _te st, y _p re d )
# V is u alis in g t h e T ra in in g s e t r e su lt s
fr o m m atp lo tlib .c o lo rs i m port L is te d C olo rm ap
X_se t, y _se t = X _tr a in , y _tr a in
X1, X 2 = n p.m esh grid (n p.a ra n ge(s ta rt = X _se t[:, 0 ].m in () - 1 , s to p = X _se t[:, 0 ].m ax() + 1 , s te p =
0.0 1),
np.a ra n ge(s ta rt = X _se t[:, 1 ].m in () - 1 , s to p = X _se t[:, 1 ].m ax() + 1 , s te p = 0 .0 1))
plt .c o n to u rf(X 1, X 2, c la ssif ie r.p re d ic t(n p.a rra y([X 1.r a vel( ), X 2.r a vel( )]).T ).r e sh ap e(X 1.s h ap e),
alp ha = 0 .7 5, c m ap = L is te d C olo rm ap (('r e d ', ' g re en ') ))
plt .x lim (X 1.m in (), X 1.m ax())
plt .y lim (X 2.m in (), X 2.m ax())
fo r i , j i n e n um era te (n p.u niq ue(y _se t)):
plt .s c a tte r(X _se t[y _se t = = j , 0 ], X _se t[y _se t = = j , 1 ],
c = L is te d C olo rm ap (('r e d ', ' g re en ') )(i) , l a b el = j )
plt .t it le ('R an dom F ore st C la ssif ic a tio n ( T ra in in g s e t)')
plt .x la b el( 'A ge')
plt .y la b el( 'E stim ate d S ala ry ')
plt .l e g en d()
plt .s h ow ()
# V is u alis in g t h e T est s e t r e su lt s
fr o m m atp lo tlib .c o lo rs i m port L is te d C olo rm ap
X_se t, y _se t = X _te st, y _te st
X1, X 2 = n p.m esh grid (n p.a ra n ge(s ta rt = X _se t[:, 0 ].m in () - 1 , s to p = X _se t[:, 0 ].m ax() + 1 , s te p =
0.0 1),
np.a ra n ge(s ta rt = X _se t[:, 1 ].m in () - 1 , s to p = X _se t[:, 1 ].m ax() + 1 , s te p = 0 .0 1))
plt .c o n to u rf(X 1, X 2, c la ssif ie r.p re d ic t(n p.a rra y([X 1.r a vel( ), X 2.r a vel( )]).T ).r e sh ap e(X 1.s h ap e),
alp ha = 0 .7 5, c m ap = L is te d C olo rm ap (('r e d ', ' g re en ') ))
plt .x lim (X 1.m in (), X 1.m ax())
plt .y lim (X 2.m in (), X 2.m ax())
fo r i , j i n e n um era te (n p.u niq ue(y _se t)):
plt .s c a tte r(X _se t[y _se t = = j , 0 ], X _se t[y _se t = = j , 1 ],
c = L is te d C olo rm ap (('r e d ', ' g re en ') )(i) , l a b el = j )
plt .t it le ('R an dom F ore st C la ssif ic a tio n ( T est s e t)')
plt .x la b el( 'A ge')

plt .y la b el( 'E stim ate d S ala ry ')
plt .l e g en d()
plt .s h ow ()
When w e r u n t h e c o de, w e’ll s e e t h e f o llo w in g:

Notic e th e s im ila ritie s b etw een th e D ecis io n T re e a n d R an dom F ore st. A fte r a ll,
th ey t a k e a s im ila r a p pro ach o f b re ak in g d ow n a d ata se t i n to s m alle r s u bse ts . T he
dif f e re n ce is th at R an dom Fore st use s ra n dom ness an d av era g in g dif f e re n t
decis io n t r e es t o c o m e u p w ith a m ore a ccu ra te m odel.

1 2. C lu ste rin g
I n th e p re v io us c h ap te rs , w e’v e d is c u sse d S uperv is e d L earn in g (R eg re ssio n &
C la ssif ic atio n). W e’v e le arn ed a b out le arn in g fro m “ la b elle d ” d ata . T here w ere
a lr e ad y c o rre ct a n sw ers a n d o ur j o b b ack t h en w as t o l e arn h ow t o a rriv e a t t h ose
a n sw ers a n d a p ply t h e l e arn in g t o n ew d ata .
B ut in th is c h ap te r it w ill b e d if f e re n t. T hat’s b ecau se w e’ll b e sta rtin g w ith
U nsu perv is e d L earn in g w here in t h ere w ere n o c o rre ct a n sw ers o r l a b els g iv en . I n
o th er w ord s, t h ere ’s o nly i n put d ata b ut t h ere ’s n o o utp ut. T here ’s n o s u perv is io n
w hen l e arn in g f ro m d ata .
I n fa ct, U nsu perv is e d L earn in g is sa id to em body th e esse n ce of A rtif ic ia l
I n te llig en ce. T hat’s b ecau se t h ere ’s n ot m uch h um an s u perv is io n o r i n te rv en tio n.
A s a re su lt, th e a lg orith m s a re le ft o n th eir o w n to d is c o ver th in gs fro m d ata .
T his is e sp ecia lly th e c ase in C lu ste rin g w here in th e g oal is to re v eal o rg an ic
a g gre g ate s o r “ clu ste rs ” i n d ata .
G oals & U se s o f C lu ste rin g
T his is a fo rm o f U nsu perv is e d L earn in g w here th ere a re n o la b els o r in m an y
c ase s th ere a re n o tr u ly c o rre ct a n sw ers . T hat’s b ecau se th ere w ere n o c o rre ct
a n sw ers in th e fir s t p la ce. W e ju st h av e a d ata se t a n d o ur g oal is to se e th e
g ro upin gs t h at h av e o rg an ic ally f o rm ed .
W e’re n ot t r y in g t o p re d ic t a n o utc o m e h ere . T he g oal i s t o l o ok f o r s tr u ctu re s i n
t h e data .
In oth er w ord s, w e’re “d iv id in g” th e data se t in to gro ups w here in
m em bers h av e s o m e s im ila ritie s o r p ro xim itie s. F or e x am ple , e ach e co m merc e
c u sto m er m ig ht belo ng to a partic u la r gro up (e .g . giv en th eir in co m e an d
s p en din g le v el) . If w e h av e g ath ere d e n ough d ata p oin ts , it’s lik ely th ere a re
a g gre g ate s.
A t f ir s t th e d ata p oin ts w ill s e em s c atte re d ( n o p atte rn a t a ll) . B ut o nce w e a p ply
a C lu ste rin g a lg orith m , t h e d ata w ill s o m eh ow m ak e s e n se b ecau se w e’ll b e a b le
t o e asily v is u aliz e th e g ro ups o r c lu ste rs . A sid e fro m d is c o verin g th e n atu ra l
g ro upin gs, C lu ste rin g a lg orith m s m ay a ls o r e v eal o utlie rs f o r A nom aly D ete ctio n
( w e’ll a ls o d is c u ss t h is l a te r).
C lu ste rin g is bein g ap plie d re g ula rly in th e fie ld s of m ark etin g, bio lo gy,
e arth quak e stu die s, m an ufa ctu rin g, se n so r o utp uts , p ro duct c ate g oriz atio n, a n d

oth er s c ie n tif ic a n d b usin ess a re as. H ow ev er, t h ere a re n o r u le s s e t i n s to ne w hen
it co m es to d ete rm in in g th e n um ber o f clu ste rs an d w hic h d ata p oin t sh ould
belo ng to a c erta in c lu ste r. It’s u p to o ur o bje ctiv e (o r if th e re su lts a re u se fu l
en ough). T his i s a ls o w here o ur e x pertis e i n a p artic u la r d om ain c o m es i n .
As w ith o th er d ata a n aly sis a n d m ach in e le arn in g a lg orith m s a n d to ols , it’s s till
ab out o ur d om ain k now le d ge. T his w ay w e c an lo ok a t a n d a n aly ze th e d ata in
th e pro per co nte x t. E ven w ith th e m ost ad van ced to ols an d te ch niq ues, th e
co nte x t a n d o bje ctiv e a re s till c ru cia l i n m ak in g s e n se o f d ata .
K-M ea n s C lu ste rin g
One w ay t o m ak e s e n se o f d ata t h ro ugh C lu ste rin g i s b y K -M ean s. I t’s o ne o f t h e
most popula r C lu ste rin g alg orith m s becau se of its sim plic ity . It w ork s by
partitio nin g o bje cts in to k c lu ste rs (n um ber o f c lu ste rs w e s p ecif ie d ) b ase d o n
fe atu re s im ila rity .
Notic e th at th e n um ber o f c lu ste rs is a rb itr a ry . W e c an s e t it in to a n y n um ber w e
lik e. H ow ev er, i t’s g ood t o m ak e t h e n um ber o f c lu ste rs j u st e n ough t o m ak e o ur
work m ean in gfu l a n d u se fu l. L et’s d is c u ss a n e x am ple t o i llu str a te t h is .
Here w e h av e d ata a b out M all C usto m ers (‘M all_ C usto m ers .c sv ’) w here in fo
ab out t h eir G en der, A ge, A nnual I n co m e, a n d S pen din g S co re a re i n dic ate d . T he
hig her t h e S pen din g S co re ( o ut o f 1 00), t h e m ore t h ey s p en d a t t h e M all.
To s ta rt, w e i m port t h e n ecessa ry l ib ra rie s:
im port n um py a s n p
im port m atp lo tlib .p yp lo t a s p lt
im port p an das a s p d
%matp lo tlib i n lin e
Then w e i m port t h e d ata a n d t a k e a p eek :
data se t =
pd.r e a d _csv ('M all_ C usto m ers.c sv ')
data se t.h ea d (1 0)

In th is e x am ple w e’re m ore in te re ste d in g ro upin g th e C usto m ers a cco rd in g to
th eir A nnual I n co m e a n d S pen din g S co re .
X = d ata se t.i lo c[:, [ 3 , 4 ]].v alu es
Our g oal h ere is to r e v eal th e c lu ste rs a n d h elp th e m ark etin g d ep artm en t
fo rm ula te t h eir s tr a te g ie s. F or i n sta n ce, w e m ig ht s u bdiv id e t h e C usto m ers i n 5 d is tin ct g ro ups:
1 .
Med iu m A nnual I n co m e, M ed iu m S pen din g S co re
2 .
Hig h A nnual I n co m e, L ow S pen din g S co re
3 .
Low A nnual I n co m e, L ow S pen din g S co re
4 .
Low A nnual I n co m e, H ig h S pen din g S co re
5 .
Hig h A nnual I n co m e, H ig h S pen din g S co re
It’s w orth w hile to p ay a tte n tio n to th e # 2 G ro up (H ig h A nnual In co m e, L ow
Spen din g S co re ). If th ere ’s a siz ab le n um ber o f c u sto m ers th at fa ll u nder th is
gro up, i t c o uld m ean a h uge o pportu nity f o r t h e m all. T hese c u sto m ers h av e h ig h
Annual I n co m e a n d y et t h ey ’re s p en din g o r u sin g m ost o f t h eir m oney e ls e w here
(n ot in th e M all) . If w e co uld k now th at th ey ’re in su ff ic ie n t n um bers , th e
mark etin g d ep artm en t c o uld f o rm ula te s p ecif ic s tr a te g ie s to e n tic e C lu ste r # 2 to
buy m ore f ro m t h e M all.
Alth ough th e n um ber o f c lu ste rs is o fte n a rb itr a ry , th ere a re w ay s to fin d th at
optim al num ber. O ne su ch w ay is th ro ugh th e E lb ow M eth od an d W CSS
(w ith in -c lu ste r su m s of sq uare s). H ere ’s th e co de to acco m plis h th is :
fr o m
sk le a rn .c lu ste r i m port K M ea n s
wcss = [ ]
fo r i i n r a n ge(1 , 1 1 ):
km ea n s = K M ea n s(n _clu ste rs = i , i n it = ' k -m ea n s+ +', r a n dom _sta te = 4 2)
km ea n s.f it (X )

wcss.a p pen d(k m ea n s.i n ertia _)
plt .p lo t(r a n ge(1 , 1 1 ), w css)
plt .t it le ('T he E lb ow M eth od ')
plt .x la b el( 'N um ber o f c lu ste rs')
plt .y la b el( 'W CSS')
plt .s h ow ()

Notic e th at th e “ elb ow ” p oin ts a t 5 (n um ber o f c lu ste rs ). C oin cid en ta lly , th is
num ber w as a ls o th e “ d esir e d ” n um ber o f g ro ups th at w ill s u bdiv id e th e d ata se t
acco rd in g t o t h eir A nnual I n co m e a n d S pen din g S co re .
Afte r d ete rm in in g th e o ptim al n um ber o f c lu ste rs , w e c an th en p ro ceed w ith
ap ply in g K -M ean s t o t h e d ata se t a n d t h en p erfo rm in g d ata v is u aliz atio n:
km ea n s
= K M ea n s(n _clu ste rs = 5 , i n it = ' k -m ea n s+ +', r a n dom _sta te = 4 2)
y_k m ea n s = km ea n s.f it _ p re d ic t(X )
plt .s c a tte r(X [y _k m ea n s == 0, 0],
X[y _k m ea n s = = 0 , 1 ], s = 1 00, c = ' r e d ', l a b el = ' C lu ste r 1 ')
plt .s c a tte r(X [y _k m ea n s = = 1 , 0 ], X [y _k m ea n s = = 1 , 1 ], s = 1 00, c = 'b lu e',
la b el = ' C lu ste r 2 ')
plt .s c a tte r(X [y _k m ea n s = = 2 , 0 ], X [y _k m ea n s = = 2 , 1 ], s = 1 00, c = 'g re en ',
la b el = ' C lu ste r 3 ')
plt .s c a tte r(X [y _k m ea n s = = 3 , 0 ], X [y _k m ea n s = = 3 , 1 ], s = 1 00, c = 'c y an ',
la b el = ' C lu ste r 4 ')
plt .s c a tte r(X [y _k m ea n s == 4, 0], X [y _k m ea n s == 4, 1], s = 100, c =
'm agen ta ', l a b el = ' C lu ste r 5 ')
plt .s c a tte r(k m ea n s.c lu ste r_ cen te rs_ [:, 0 ], k m ea n s.c lu ste r_ cen te rs_ [:, 1 ], s =
300, c = ' y ello w ', l a b el = ' C en tr o id s')
plt .t it le ('C lu ste rs o f c u sto m ers')
plt .x la b el( 'A nnual I n co m e ( k $)')

plt .y la b el( 'S pen din g S co re ( 1 -1 00)')
plt .l e g en d()
plt .s h ow ()

There w e h av e it. W e h av e 5 c lu ste rs a n d C lu ste r # 2 ( b lu e p oin ts , H ig h A nnual
In co m e a n d L ow S pen din g S co re ) is s ig nif ic an t e n ough. I t m ig ht b e w orth w hile
fo r t h e m ark etin g d ep artm en t t o f o cu s o n t h at g ro up.
Als o n otic e th e C en tr o id s (th e y ello w p oin ts ). T his is a p art o f h ow K -M ean s
clu ste rin g w ork s. It’s an ite ra tiv e ap pro ach w here ra n dom p oin ts are p la ced
in itia lly u ntil t h ey c o nverg e t o a m in im um ( e .g . s u m o f d is ta n ces i s m in im iz ed ).
As m en tio ned e arlie r, it c an a ll b e a rb itr a ry a n d it m ay d ep en d h eav ily o n o ur
ju dgm en t a n d p ossib le a p plic atio n. W e c an se t n _clu ste rs in to a n yth in g o th er
th an 5 . W e o nly u se d th e E lb ow M eth od so w e c an h av e a m ore so und a n d
co nsis te n t b asis f o r th e n um ber o f c lu ste rs . B ut it’s s till u p to o ur ju dgm en t w hat
sh ould w e u se a n d i f t h e r e su lts a re g ood e n ough f o r o ur a p plic atio n.
Anom aly D ete ctio n
Asid e f ro m r e v ealin g th e n atu ra l c lu ste rs , it’s a ls o a c o m mon c ase to s e e if th ere
are obvio us poin ts th at don’t belo ng to th ose clu ste rs . T his is th e heart of
dete ctin g a n om alie s o r o utlie rs i n d ata .
This is a c ru cia l ta sk b ecau se a n y la rg e d ev ia tio n fro m th e n orm al c an c au se a
cata str o phe. I s a c re d it c ard tr a n sa ctio n f ra u dule n t? I s a lo gin a ctiv ity s u sp ic io us
(y ou m ig ht b e lo ggin g in fro m a to ta lly d if f e re n t lo catio n o r d ev ic e)? A re th e
te m pera tu re an d p re ssu re le v els in a ta n k b ein g m ain ta in ed co nsis te n tly (a n y
outlie r m ig ht cau se ex plo sio ns an d o pera tio nal h alt) ? Is a certa in d ata p oin t
cau se d b y w ro ng e n tr y o r m easu re m en t ( e .g . p erh ap s i n ch es w ere u se d i n ste ad o f

cen tim ete rs )?
With s tr a ig htf o rw ard d ata v is u aliz atio n w e c an im med ia te ly s e e th e o utlie rs . W e
can th en e v alu ate if th ese o utlie rs p re se n t a m ajo r th re at. W e c an a ls o s e e a n d
asse ss th ose o utlie rs b y re fe rrin g to th e m ean a n d s ta n dard d ev ia tio n. If a d ata
poin t d ev ia te s b y a s ta n dard d ev ia tio n f ro m t h e m ean , i t c o uld b e a n a n om aly .
This is a ls o w here o ur d om ain e x pertis e c o m es in . If th ere ’s a n a n om aly , h ow
se rio us are th e co nse q uen ces? For in sta n ce, th ere m ig ht be th ousa n ds of
purc h ase tr a n sa ctio ns h ap pen in g in a n o nlin e s to re e v ery d ay. I f w e’re to o tig ht
with o ur a n om aly d ete ctio n, m an y o f th ose tr a n sa ctio ns w ill b e r e je cte d ( w hic h
re su lts to lo ss o f s a le s a n d p ro fits ). O n th e o th er h an d, if w e’re a llo w in g m uch
fre ed om in o ur a n om aly d ete ctio n o ur s y ste m w ould a p pro ve m ore tr a n sa ctio ns.
How ev er, th is m ig ht le ad to c o m pla in ts la te r a n d p ossib ly lo ss o f c u sto m ers in
th e l o ng t e rm .
Notic e h ere th at it’s n ot a ll a b out a lg orith m s e sp ecia lly w hen w e’re d ealin g w ith
busin ess case s. E ach fie ld m ig ht re q uir e a d if f e re n t se n sitiv ity le v el. T here ’s
alw ay s a t r a d eo ff a n d e ith er o f t h e o ptio ns c o uld b e c o stly . I t’s a m atte r o f t e stin g
an d know in g if our sy ste m of dete ctin g an om alie s is su ff ic ie n t fo r our
ap plic atio n.

1 3. A sso cia tio n R ule L ea rn in g
T his is a c o ntin uatio n o f U nsu perv is e d L earn in g. I n th e p re v io us c h ap te r w e’v e
d is c o vere d n atu ra l p atte rn s a n d a g gre g ate s in M all_ C usto m ers .c sv . T here w as
n ot m uch s u perv is io n a n d g uid an ce o n h ow th e “ co rre ct a n sw ers ” s h ould lo ok
l ik e. W e’v e a llo w ed th e a lg orith m s to d is c o ver a n d s tu dy th e d ata . A s a re su lt,
w e’re a b le t o g ain i n sig hts f ro m t h e d ata t h at w e c an u se .
I n th is ch ap te r w e’ll fo cu s o n A sso cia tio n R ule L earn in g. T he g oal h ere is
d is c o ver h ow ite m s a re “ re la te d ” o r a sso cia te d w ith o ne a n oth er. T his c an b e
v ery u se fu l in d ete rm in in g w hic h p ro ducts s h ould b e p la ced to geth er in g ro cery
s to re s. F or in sta n ce, m an y c u sto m ers m ig ht a lw ay s b e b uyin g b re ad a n d m ilk
t o geth er. W e c an t h en r e arra n ge s o m e s h elv es a n d p ro ducts s o t h e b re ad a n d m ilk
w ill b e n ear t o e ach o th er.
T his c an a ls o b e a g ood w ay to re co m men d re la te d p ro ducts to c u sto m ers . F or
e x am ple , m an y c u sto m ers m ig ht b e b uyin g d ia p ers o nlin e a n d th en p urc h asin g
b ooks ab out pare n tin g la te r. These tw o pro ducts hav e str o ng asso cia tio ns
b ecau se th ey m ark th e c u sto m er’s lif e tr a n sitio n (h av in g a b ab y). A ls o if w e
n otic e a d em an d s u rg e i n d ia p ers , w e m ig ht a ls o g et r e ad y w ith p are n tin g b ooks.
T his is a g ood w ay to so m eh ow fo re cast a n d p re p are fo r fu tu re d em an ds b y
b uyin g s u pplie s i n a d van ce.
I n gro cery sh oppin g or an y busin ess in volv ed in re ta il an d whole sa le
t r a n sa ctio ns, A sso cia tio n R ule L earn in g can be very use fu l in optim iz atio n
( e n co ura g in g cu sto m ers to buy m ore pro ducts ) an d m atc h in g su pply w ith
d em an d (e .g . s a le s im pro vem en t in o ne p ro duct a ls o s ig nals th e s a m e th in g to
a n oth er r e la te d p ro duct) .
E xp la n atio n
S o h ow d o w e d ete rm in e th e “ le v el o f r e la te d ness” o f ite m s to o ne a n oth er a n d
c re ate u se fu l g ro ups o ut o f it.? O ne s tr a ig htf o rw ard a p pro ach is b y c o untin g th e
t r a n sa ctio ns th at in volv e a p artic u la r se t. F or e x am ple , w e h av e th e fo llo w in g
t r a n sa ctio ns:
T ra n sa ctio n
Purc h ase s
1
Egg, h am , h otd og

2 Egg, h am , m ilk
3
Egg, a p ple , o nio n
4
Beer, m ilk , j u ic e

Our t a rg et s e t i s { E gg, h am }. N otic e t h at t h is c o m bin atio n o f p urc h ase s o ccu rre d
in 2 tr a n sa ctio ns (T ra n sa ctio ns 1 an d 2). In oth er w ord s, th is co m bin atio n
hap pen ed 5 0% o f th e tim e. It’s a s im ple e x am ple b ut if w e’re s tu dyin g 1 0,0 00
tr a n sa ctio ns an d 5 0% is still th e case , o f co urs e th ere ’s a str o ng asso cia tio n
betw een e g g a n d h am .
We m ig ht th en re aliz e th at it’s w orth w hile to p ut e g gs a n d h am s to geth er (o r
off e r th em in a b undle ) to m ak e o ur c u sto m ers ’ liv es e asie r ( w hile w e a ls o m ak e
more s a le s). T he h ig her th e p erc en ta g e o f o ur ta rg et s e t in th e to ta l tr a n sa ctio ns,
th e b ette r.
Or, i f t h e p erc en ta g e s till f a lls u nder o ur a rb itr a ry t h re sh old ( e .g . 3 0% ,
20% ), w e c o uld s till p ay a tte n tio n t o a p artic u la r s e t a n d m ak e a d ju stm en ts t o o ur
pro ducts a n d o ff e rs .
Asid e fro m calc u la tin g th e actu al perc en ta g e, an oth er w ay to know how
“p opula r” a n ite m se t is b y w ork in g o n p ro bab ilitie s. F or e x am ple , h ow lik ely is
pro duct X to ap pear w ith pro duct Y ? If th ere ’s a hig h pro bab ility , w e can
so m eh ow s a y t h at t h e t w o p ro ducts a re c lo se ly r e la te d .
Those a re w ay s o f e stim atin g th e “ re la te d ness” o r le v el o f a sso cia tio n b etw een
tw o p ro ducts . O ne o r a c o m bin atio n o f a p pro ach es m ig ht b e a lr e ad y e n ough f o r
certa in a p plic atio ns. P erh ap s w ork in g o n p ro bab ilitie s y ie ld s b ette r re su lts . O r,
prio ritis in g a v ery p opula r ite m se t (h ig h p erc en ta g e o f o ccu rre n ce) re su lts to
more t r a n sa ctio ns.
In th e e n d, it m ig ht b e a b out te stin g d if f e re n t a p pro ach es (a n d c o m bin atio ns o f
pro ducts ) a n d th en s e ein g w hic h o ne y ie ld s th e o ptim al r e su lts . I t m ig ht b e e v en
th e c ase th at a c o m bin atio n o f tw o p ro ducts w ith v ery lo w r e la te d ness a llo w f o r
more p urc h ase s t o h ap pen .
Aprio ri
Whic h ev er i s t h e c ase , l e t’s e x plo re h ow i t a ll a p plie s t o t h e r e al w orld . L et’s c all
th e p ro ble m “ M ark et B ask et O ptim iz atio n.” O ur g oal h ere i s to g en era te a l is t o f

se ts ( p ro duct s e ts ) a n d th eir c o rre sp ondin g le v el o f r e la te d ness o r s u pport to o ne
an oth er. Here ’s a peek of th e data se t to giv e you a bette r id ea:
sh rim p,a lm on ds,a voca d o,v eg eta b le s mix ,g re en gra p es,w hole wea t
flo u r,y am s,c o tta ge ch eese ,e n erg y d rin k,t o m ato ju ic e,l o w fa t yogu rt,g re en
te a ,h on ey ,s a la d ,m in era l wate r,s a lm on ,a n tio xyd an t ju ic e,f r o zen
sm ooth ie ,s p in ach ,o liv e o il
burg ers,m ea tb alls ,e g gs
ch utn ey
tu rk ey ,a voca d o
min era l w ate r,m ilk ,e n erg y b ar,w hole w hea t r ic e,g re en t e a
lo w f a t y ogu rt
whole w hea t p asta ,f r e n ch f r ie s
so u p,l ig h t c re a m ,s h allo t
fr o zen v eg eta b le s,s p agh etti,g re en t e a
fr e n ch fr ie s
Those a re lis te d a cco rd in g to th e tr a n sa ctio ns w here th ey a p p ear.
For e x am ple , in th e f ir s t tr a n sa ctio n th e c u sto m er b ought d if f e re n t th in gs ( fro m
sh rim p to o liv e o il) . In th e se co nd tr a n sa ctio n th e cu sto m er b ought b urg ers ,
meatb alls , a n d e g gs.
As b efo re , le t’s im port th e n ecessa ry lib ra ry /lib ra rie s s o th at w e c an w ork o n th e
data :
im port p an das a s p d
data se t = p d.r e a d _csv ('M ark et_ B ask et_ O ptim is a tio n .c sv ', h ea d er = N on e)
Nex t is w e a d d th e ite m s in a lis t s o th at w e c an w ork o n th em m uch e asie r. W e
can a cco m plis h t h is b y i n itia liz in g a n e m pty l is t a n d t h en r u nnin g a f o r l o op ( s till
re m em ber h ow t o d o a ll t h ese ?):
tr a n sa ctio n s = [ ]
fo r i i n r a n ge(0 , 7 501):
tr a n sa ctio n s.a p pen d([s tr (d ata se t.v alu es[i,j ]) fo r j in ra n ge(0 , 20)])
Afte r
we’v e d one th at, w e s h ould th en g en era te a lis t o f “ re la te d p ro ducts ” w ith th eir
co rre sp ondin g le v el o f s u pport o r r e la te d ness. O ne w ay to a cco m plis h th is is b y
th e im ple m en ta tio n of th e A prio ri alg orith m (fo r asso cia tio n ru le le arn in g).
Than kfu lly , w e d on’t h av e t o w rite a n yth in g f ro m s c ra tc h .
We c an u se A pyori w hic h is a s im ple im ple m en ta tio n o f th e A prio ri a lg orith m .
You can fin d it here fo r your re fe re n ce:
http s
://
pypi
. org
/ pro je ct
/ ap yori
/#
desc rip tio n
It’s p re b uilt fo r u s a n d a lm ost re ad y fo r o ur o w n u sa g e. It’s s im ila r to h ow w e
use s c ik it- le arn , p an das, a n d n um py. I n ste ad o f s ta rtin g f ro m s c ra tc h , w e a lr e ad y

hav e b lo ck s o f c o de w e c an s im ply i m ple m en t. T ak e n ote t h at c o din g e v ery th in g
fro m s c ra tc h i s t im e c o nsu m in g a n d t e ch nic ally c h alle n gin g.
To im ple m en t A pyori, w e can im port it sim ila rly as how w e im port oth er
lib ra rie s:
fr o m a p yori im port a p rio ri
Nex t is w e s e t u p th e r u le s ( th e le v els o f
min im um r e la te d ness) s o w e c an s o m eh ow g en era te a u se fu l l is t o f r e la te d i te m s.
That’s b ecau se a lm ost a n y tw o ite m s m ig ht h av e s o m e le v el o f r e la te d ness. T he
obje ctiv e h ere i s t o i n clu de o nly t h e l is t t h at c o uld b e u se fu l f o r u s.
ru le s = ap rio ri( tr a n sa ctio ns, m in _su pport = 0.0 03, m in _co nfid en ce = 0.2 ,
min _lif t = 3 , m in _le n gth = 2 )
Well th at’s th e im ple m en ta tio n o f A prio ri u sin g
Apyori. T he n ex t s te p i s t o g en era te a n d v ie w t h e r e su lts . W e c an a cco m plis h t h is
usin g t h e f o llo w in g b lo ck o f c o de:
re su lt s = l is t(r u le s)
re su lt s _ lis t = [ ]
fo r i i n r a n ge(0 , l e n (r e su lt s )):
re su lt s _ lis t.a p pen d('R ULE:\t ' + str (r e su lt s [i] [0 ]) + '\n SU PPO RT:\t ' +
str (r e su lt s [i] [1 ]))
prin t ( r e su lt s _ lis t)
When y ou ru n a ll th e c o de in J u pyte r N ote b ook, y ou’ll s e e
so m eth in g lik e th is :

It’s m essy a n d a lm ost in co m pre h en sib le . B ut if y ou ru n it in S pyder (a n oth er
use fu l d ata s c ie n ce p ack ag e in clu ded in A naco nda in sta lla tio n), th e re su lt w ill

lo ok a b it n eate r:

Notic e th at th ere a re d if f e re n t it e m se ts w ith th eir c o rre sp ondin g “ S upport.” T he
hig her th e S upport, w e c an so m eh ow sa y th at th e h ig her th e re la te d ness. F or
in sta n ce, lig ht c re am a n d c h ic k en o fte n g o to geth er b ecau se p eo ple m ig ht b e
usin g th e tw o to c o ok so m eth in g. A noth er e x am ple is in th e ite m se t w ith a n
in dex o f 5 (to m ato s a u ce a n d g ro und b eef). T hese tw o ite m s m ig ht a lw ay s g o
to geth er in th e g ro cery b ag b ecau se th ey ’re a ls o u se d to p re p are a m eal o r a
re cip e.
This is o nly a n in tr o ductio n o f A sso cia tio n R ule L earn in g. T he g oal h ere w as to
ex plo re th e p ote n tia l a p plic atio ns o f it to re al- w orld sc en ario s su ch a s m ark et
bask et o ptim iz atio n. T here a re o th er m ore s o phis tic ate d w ay s to d o th is . B ut in
gen era l, i t’s a b out d ete rm in in g t h e l e v el o f r e la te d ness a m ong t h e i te m s a n d t h en
ev alu atin g t h at i f i t’s u se fu l o r g ood e n ough.

1 4. R ein fo rc em en t L ea rn in g
N otic e t h at i n th e p re v io us c h ap te rs , t h e f o cu s i s o n w ork in g o n p ast i n fo rm atio n
a n d th en d eriv in g in sig hts fro m it. In o th er w ord s, w e’re m uch fo cu se d o n th e
p ast t h an o n t h e p re se n t a n d f u tu re .
B ut f o r d ata s c ie n ce a n d m ach in e l e arn in g t o b eco m e t r u ly u se fu l, t h e a lg orith m s
a n d sy ste m s sh ould w ork on re al- tim e situ atio ns. F or in sta n ce, w e re q uir e
s y ste m s t h at l e arn r e al- tim e a n d a d ju sts a cco rd in gly t o m ax im iz e t h e r e w ard s.
W hat i s R ein fo rc em en t L ea rn in g?
T his is w here R ein fo rc em en t L earn in g ( R L) c o m es in . I n a n uts h ell, R L is a b out
r e in fo rc in g th e c o rre ct o r d esir e d b eh av io rs a s tim e p asse s. A re w ard fo r e v ery
c o rre ct b eh av io r a n d a p unis h m en t o th erw is e .
R ecen tly R L w as im ple m en te d to b eat w orld c h am pio ns a t th e g am e o f G o a n d
s u ccessfu lly p la y v ario us A ta ri v id eo g am es (a lth ough R ein fo rc em en t L earn in g
t h ere w as m ore so phis tic ate d a n d in co rp ora te d d eep le arn in g). A s th e sy ste m
l e arn s f ro m r e in fo rc em en t, i t w as a b le t o a ch ie v e a g oal o r m ax im iz e t h e r e w ard .
O ne s im ple e x am ple i s i n t h e o ptim iz atio n o f c lic k -th ro ugh r a te s ( C TR ) o f o nlin e
a d s. P erh ap s y ou h av e 1 0 a d s th at e sse n tia lly sa y th e sa m e th in g (m ay be th e
w ord s a n d d esig ns a re s lig htly d if f e re n t f ro m o ne a n oth er). A t f ir s t y ou w an t to
k now w hic h a d p erfo rm s b est a n d y ie ld s th e h ig hest C TR . A fte r a ll, m ore c lic k s
c o uld m ean m ore p ro sp ects a n d c u sto m ers f o r y our b usin ess.
B ut if y ou w an t to m ax im iz e th e C TR , w hy n ot p erfo rm th e a d ju stm en ts a s th e
a d s a re b ein g r u n? I n o th er w ord s, d on’t w ait f o r y our e n tir e a d b udget t o r u n o ut
b efo re know in g w hic h one perfo rm ed best. In ste ad , fin d out w hic h ad s are
p erfo rm in g b est w hile t h ey ’re b ein g r u n. M ak e a d ju stm en ts e arly o n s o l a te r o nly
t h e h ig hest- p erfo rm in g a d s w ill b e s h ow n t o t h e p ro sp ects .
I t’s v ery s im ila r t o a f a m ous p ro ble m i n p ro bab ility t h eo ry a b out t h e m ulti- a rm ed
b an dit p ro ble m . L et’s s a y y ou h av e a lim ite d re so urc e (e .g . a d vertis in g b udget)
a n d s o m e c h oic es ( 1 0 a d v aria n ts ). H ow w ill y ou a llo cate y our r e so urc e a m ong
t h ose c h oic es s o y ou c an m ax im iz e y our g ain ( e .g . o ptim al C TR )?
F ir s t, y ou h av e to “ ex plo re ” a n d tr y th e a d s o ne b y o ne. O f c o urs e , if y ou’re
s e ein g th at A d 1 p erfo rm s u nusu ally w ell, y ou’ll “ ex plo it” it a n d ru n it fo r th e
r e st o f th e c am paig n. Y ou d on’t n eed to w aste y our m oney o n u nderp erfo rm in g

ad s. S tic k t o t h e w in ner a n d c o ntin uously e x plo it i ts p erfo rm an ce.
There ’s o ne c atc h th ough. E arly o n A d 1 m ig ht b e p erfo rm in g w ell so w e’re
te m pte d to u se it a g ain a n d a g ain . B ut w hat if A d 2 c atc h es u p a n d if w e le t
th in gs u nfo ld A d 2 w ill p ro duce h ig her g ain s? W e’ll n ev er k now b ecau se th e
perfo rm an ce o f A d 1 w as a lr e ad y e x plo ite d .
There w ill alw ay s b e tr a d eo ff s in m an y d ata an aly sis an d m ach in e le arn in g
pro je cts . That’s w hy it’s alw ay s re co m men ded to se t perfo rm an ce ta rg ets
befo re h an d in ste ad o f w onderin g ab out th e w hat- if s la te r. E ven in th e m ost
so phis tic ate d te ch niq ues an d alg orith m s, tr a d eo ff s an d co nstr a in ts are alw ay s
th ere .
Com paris o n w it h S uperv is e d & U nsu perv is e d L ea rn in g
Notic e th at th e d efin itio n o f R ein fo rc em en t L earn in g d oesn ’t e x actly fit u nder
eith er Superv is e d or Unsu perv is e d Learn in g. Rem em ber th at Superv is e d
Learn in g is a b out le arn in g th ro ugh s u perv is io n a n d tr a in in g. O n th e o th er h an d,
Unsu perv is e d Learn in g is actu ally re v ealin g or dis c o verin g in sig hts fro m
unstr u ctu re d d ata ( n o s u perv is io n, n o l a b els ).
One k ey d if f e re n ce c o m pare d to R L is in m ax im iz in g th e s e t re w ard , le arn in g
fro m u se r i n te ra ctio n, a n d t h e a b ility t o u pdate i ts e lf i n r e al t im e. R em em ber t h at
RL is fir s t ab out ex plo rin g an d ex plo itin g. In co ntr a st, b oth S uperv is e d an d
Unsu perv is e d L earn in g can b e m ore ab out p assiv ely le arn in g fro m h is to ric al
data ( n ot r e al t im e).
There ’s a fin e b oundary a m ong th e 3 b ecau se a ll o f th em a re still c o ncern ed
ab out o ptim iz atio n in o ne w ay o r a n oth er. W hic h ev er is th e c ase , a ll 3 h av e
use fu l a p plic atio ns i n b oth s c ie n tif ic a n d b usin ess s e ttin gs.
Apply in g R ein fo rc em en t L ea rn in g
RL is p artic u la rly u se fu l in m an y b usin ess s c en ario s s u ch a s o ptim iz in g c lic k -
th ro ugh r a te s. H ow c an w e m ax im iz e th e n um ber o f c lic k s f o r a h ead lin e? T ak e
note t h at n ew s s to rie s o fte n h av e l im ite d l if e sp an s i n t e rm s o f t h eir r e le v an ce a n d
popula rity . G iv en t h at l im ite d r e so urc e ( tim e), h ow c an w e i m med ia te ly s h ow t h e
best p erfo rm in g h ead lin e?
This is a ls o th e c ase in m ax im is in g th e C TR o f o nlin e a d s. W e h av e a lim ite d a d
budget a n d w e w an t t o g et t h e m ost o ut o f i t. L et’s e x plo re a n e x am ple ( u sin g t h e

data f ro m A ds_ C TR _O ptim is a tio n.c sv ) to b ette r illu str a te th e id ea:
As u su al w e
fir s t im port th e n ecessa ry lib ra rie s s o th at w e c an w ork o n o ur d ata ( a n d a ls o f o r
data v is u aliz atio n)
im port m atp lo tlib .p yp lo t a s p lt
im port p an das a s p d
%matp lo tlib in lin e
#so p lo ts c a n sh ow in o u r J u pyte r N ote b ook
We th en
im port th e data se t an d ta k e a peek
data se t =
pd.r e a d _csv ('A ds_ C TR _O ptim is a tio n .c sv ')
data se t.h ea d (1 0)

In each ro und, th e ad s are d is p la y ed an d it’s in dic ate d w hic h o ne/o nes w ere
clic k ed ( 0 i f n ot c lic k ed , 1 i f c lic k ed ). A s d is c u sse d e arlie r, t h e g oal i s t o e x plo re
fir s t, p ic k t h e w in ner a n d t h en e x plo it i t.
One p opula r w ay to a ch ie v e th is is b y T hom pso n S am plin g. S im ply , it a d dre sse s
th e e x plo ra tio n-e x plo ita tio n d ile m ma (tr y in g to a ch ie v e a b ala n ce) b y s a m plin g
or tr y in g th e p ro m is in g actio ns w hile ig norin g o r d is c ard in g actio ns th at are
lik ely to u nderp erfo rm . T he a lg orith m w ork s o n p ro bab ilitie s a n d th is c an b e
ex pre sse d i n c o de t h ro ugh t h e f o llo w in g:
im port r a n dom
N = 1 0000
d = 1 0
ad s_ se le cte d = [ ]
num bers_ of_ re w ard s_ 1 = [ 0 ] * d
num bers_ of_ re w ard s_ 0 = [ 0 ] * d
to ta l_ re w ard = 0
fo r n i n r a n ge(0 , N ):
ad = 0
max_ra n dom = 0

fo r i i n r a n ge(0 , d ):
ra n dom _b eta = ra n dom .b eta varia te (n um bers_ of_ re w ard s_ 1[i] + 1,
num bers_ of_ re w ard s_ 0[i] + 1 )
if r a n dom _b eta > m ax_ra n dom :
max_ra n dom = r a n dom _b eta
ad = i
ad s_ se le cte d .a p pen d(a d )
re w ard = d ata se t.v alu es[n , a d ]
if r e w ard = = 1 :
num bers_ of_ re w ard s_ 1[a d ] = n um bers_ of_ re w ard s_ 1[a d ] + 1
els e :
num bers_ of_ re w ard s_ 0[a d ] = n um bers_ of_ re w ard s_ 0[a d ] + 1
to ta l_ re w ard = to ta l_ re w ard + re w ard
When w e ru n an d th e co de an d
vis u aliz e:
plt .h is t(a d s_ se le cte d )
plt .t it le ('H is to gra m o f a d s s e le ctio n s')
plt .x la b el( 'A ds')
plt .y la b el( 'N um ber o f t im es e a ch a d w as s e le cte d ')
plt .s h ow ()

Notic e t h at t h e i m ple m en ta tio n o f T hom pso n s a m plin g c an b e v ery c o m ple x . I t’s
an i n te re stin g a lg orith m w hic h i s w id ely p opula r i n o nlin e a d o ptim iz atio n, n ew s
artic le r e co m men datio n, p ro duct a sso rtm en t a n d o th er b usin ess a p plic atio ns.
There a re o th er in te re stin g a lg orith m s a n d h eu ris tic s s u ch a s U pper C onfid en ce
Bound. T he goal is to earn w hile le arn in g. In ste ad of la te r an aly sis , our
alg orith m c an p erfo rm a n d a d ju st in re al tim e. W e’re h opin g to m ax im iz e th e
re w ard b y tr y in g to b ala n ce th e tr a d eo ff b etw een e x plo ra tio n a n d e x plo ita tio n
(m ax im iz e im med ia te perfo rm an ce or “le arn more ” to im pro ve fu tu re

perfo rm an ce). It’s a n in te re stin g to pic its e lf a n d if y ou w an t to d ig d eep er, y ou
can re ad th e fo llo w in g Thom pso n Sam plin g tu to ria l fro m Sta n fo rd :
http s
://
web
. sta n fo rd
. ed u
/~
bvr
/ pubs
/ TS
_ Tuto ria l
. pdf

1 5. A rtif ic ia l N eu ra l N etw ork s
F or u s h um an s it’s v ery e asy fo r u s to re co gniz e o bje cts a n d d ig its . It’s a ls o
e ff o rtle ss fo r u s to k now th e m ean in g o f a s e n te n ce o r p ie ce o f te x t. H ow ev er,
i t’s a n e n tir e ly d if f e re n t c ase w ith c o m pute rs . W hat’s a u to m atic a n d t r iv ia l f o r u s
c o uld b e a n e n orm ous t a sk f o r c o m pute rs a n d a lg orith m s.
I n c o ntr a st, c o m pute rs c an p erfo rm lo ng a n d c o m ple x m ath em atic al c alc u la tio ns
w hile w e h um an s a re te rrib le a t it. I t’s in te re stin g th at th e c ap ab ilitie s o f h um an s
a n d c o m pute rs a re o pposite s o r c o m ple m en ta ry .
B ut t h e n atu ra l n ex t s te p i s t o i m ita te o r e v en s u rp ass h um an c ap ab ilitie s. I t’s l ik e
t h e g oal is to r e p la ce h um an s a t w hat th ey d o b est. I n th e n ear f u tu re w e m ig ht
n ot b e a b le t o t e ll t h e d if f e re n ce w heth er w hom w e’re t a lk in g t o i s h um an o r n ot.
A n I d ea o f H ow t h e B ra in W ork s
T o a cco m plis h th is , o ne o f th e m ost p opula r a n d p ro m is in g w ay s is th ro ugh th e
u se o f a rtif ic ia l n eu ra l n etw ork s. T hese a re lo ose ly in sp ir e d b y h ow o ur n eu ro ns
a n d b ra in s w ork . T he p re v ailin g m odel a b out h ow o ur b ra in s w ork i s b y n eu ro ns
r e ceiv in g, p ro cessin g, an d se n din g sig nals (m ay co nnect w ith o th er n eu ro ns,
r e ceiv e in put f ro m s e n se s, o r g iv e a n o utp ut) . A lth ough it’s n ot a 1 00% a ccu ra te
u nders ta n din g a b out t h e b ra in a n d n eu ro ns, t h is m odel i s u se fu l e n ough f o r m an y
a p plic atio ns.
T his i s t h e c ase i n a rtif ic ia l n eu ra l n etw ork s w here in t h ere a re n eu ro ns ( p la ced i n
o ne or fe w la y ers usu ally ) re ceiv in g an d se n din g sig nals . H ere ’s a basic
i llu str a tio n fro m
Ten so rF lo w
Pla y gro und
:

Notic e th at it s ta rte d w ith th e fe atu re s (th e in puts ) a n d th en th ey ’re c o nnecte d
with 2 “ h id den l a y ers ” o f n eu ro ns. F in ally t h ere ’s a n o utp ut w here in t h e d ata w as
alr e ad y p ro cesse d i te ra tiv ely t o c re ate a u se fu l m odel o r g en era liz atio n.
In m an y c ase s h ow a rtif ic ia l n eu ra l n etw ork s ( A NNs) a re u se d is v ery s im ila r to
how S uperv is e d L earn in g w ork s. In A NNs, w e o fte n ta k e a la rg e n um ber o f
tr a in in g e x am ple s a n d th en d ev elo p a sy ste m w hic h a llo w s fo r le arn in g fro m
th ose sa id e x am ple s. D urin g le arn in g, o ur A NN a u to m atic ally in fe rs ru le s fo r
re co gniz in g a n i m ag e, t e x t, a u dio o r a n y o th er k in d o f d ata .
As y ou m ig ht h av e a lr e ad y r e aliz ed , th e a ccu ra cy o f r e co gnitio n h eav ily d ep en d
on th e q uality a n d q uan tity o f o ur d ata . A fte r a ll, it’s G arb ag e In G arb ag e O ut.
Artif ic ia l n eu ra l n etw ork s le arn f ro m w hat f e ed in to it. W e m ig ht s till im pro ve
th e a ccu ra cy a n d p erfo rm an ce th ro ugh m ean s o th er th an im pro vin g th e q uality
an d q uan tity o f d ata (s u ch a s fe atu re s e le ctio n, c h an gin g th e le arn in g ra te , a n d
re g ula riz atio n).
Pote n tia l & C on str a in ts
The id ea b eh in d a rtif ic ia l n eu ra l n etw ork s is a ctu ally o ld . B ut re cen tly it h as
underg one m assiv e r e em erg en ce t h at m an y p eo ple ( w heth er t h ey u nders ta n d i t o r
not) t a lk a b out i t.
Why did it beco m e popula r ag ain ? It’s becau se of data av aila b ility an d
te ch nolo gic al dev elo pm en ts (e sp ecia lly m assiv e in cre ase in co m puta tio nal

pow er). B ack th en c re atin g a n d im ple m en tin g a n A NN m ig ht b e im pra ctic al in
te rm s o f t im e a n d o th er r e so urc es.
But it a ll c h an ged b ecau se o f m ore d ata a n d in cre ase d c o m puta tio nal p ow er. I t’s
very lik ely th at y ou c an im ple m en t a n a rtif ic ia l n eu ra l n etw ork rig ht in y our
desk to p o r la p to p co m pute r. A nd als o , b eh in d th e sc en es A NNs are alr e ad y
work in g t o g iv e y ou t h e m ost r e le v an t s e arc h r e su lts , m ost l ik ely p ro ducts y ou’ll
purc h ase , o r th e m ost p ro bab le a d s y ou’ll c lic k . A NNs a re a ls o b ein g u se d to
re co gniz e t h e c o nte n t o f a u dio , i m ag e, a n d v id eo .
Man y e x perts sa y th at w e’re o nly sc ra tc h in g th e su rfa ce a n d a rtif ic ia l n eu ra l
netw ork s still hav e a lo t of pote n tia l. It’s lik e w hen an ex perim en t ab out
ele ctr ic ity (d one b y M ic h ael F ara d ay ) w as p erfo rm ed a n d n o o ne h ad n o id ea
what u se w ould c o m e f ro m i t. A s t h e s to ry g oes, F ara d ay t o ld t h at t h e U K P rim e
Min is te r w ould s o on b e a b le to ta x it. T oday, a lm ost e v ery a sp ect o f o ur liv es
dir e ctly o r i n dir e ctly d ep en ds o n e le ctr ic ity .
This m ig ht a ls o b e th e c ase w ith a rtif ic ia l n eu ra l n etw ork s a n d th e e x citin g f ie ld
of D eep L earn in g (a su bfie ld of m ach in e le arn in g th at is m ore fo cu se d on
ANNs).
Here ’s a n E xam ple
With T en so rF lo w P la y gro und w e c an g et a q uic k id ea o f h ow it a ll w ork s. G o to
th eir w eb site (
http s
://
pla y gro und
. te n so rflo w
. org
/ ) a n d ta k e n ote o f th e d if f e re n t
word s th ere su ch as L earn in g R ate , A ctiv atio n, R eg ula riz atio n, F eatu re s, an d
Hid den L ay ers . A t th e b eg in nin g it w ill lo ok lik e th is ( y ou d id n’t c lic k a n yth in g
yet) :

Clic k t h e “ P la y ” b utto n ( u pper l e ft c o rn er) a n d s e e t h e c o ol a n im atio n ( p ay c lo se
atte n tio n to th e O utp ut a t th e fa r rig ht. A fte r s o m e tim e, it w ill lo ok lik e th is :

The co nnectio ns becam e cle are r am ong th e Featu re s, H id den L ay ers , an d
Outp ut. A ls o n otic e t h at t h e O utp ut h as a c le ar B lu e r e g io n ( w hile t h e r e st f a lls i n
Ora n ge). T his c o uld b e a C la ssif ic atio n t a sk w here in b lu e d ots b elo ng t o C la ss A
while t h e o ra n ge o nes b elo ng t o C la ss B .
As th e A NN ru ns, notic e th at th e div is io n betw een C la sss A an d C la ss B
beco m es c le are r. T hat’s b ecau se th e sy ste m is c o ntin uously le arn in g fro m th e
tr a in in g ex am ple s. A s th e le arn in g b eco m es m ore so lid (o r as th e ru le s are
gettin g i n fe rre d m ore a ccu ra te ly ), t h e c la ssif ic atio n a ls o b eco m es m ore a ccu ra te .
Explo rin g th e T en so rF lo w P la y gro und is a q uic k w ay to g et a n id ea o f h ow
neu ra l n etw ork s o pera te . I t’s a q uic k v is u aliz atio n ( a lth ough n ot a 1 00% a ccu ra te
re p re se n ta tio n) s o w e c an s e e th e F eatu re s, H id den L ay ers , a n d O utp ut. W e c an
ev en d o s o m e tw eak in g lik e c h an gin g th e L earn in g R ate , th e r a tio o f tr a in in g to
te st d ata , a n d t h e n um ber o f H id den L ay ers .
For in sta n ce, w e can se t th e n um ber o f h id den la y ers to 3 an d ch an ge th e
Learn in g R ate to 1 ( in ste ad o f 0 .0 3 e arlie r). W e s h ould s e e s o m eth in g lik e th is :

When w e c lic k th e P la y b utto n a n d le t it ru n fo r a w hile , s o m eh ow th e im ag e

will r e m ain l ik e t h is :

Pay a tte n tio n to th e O utp ut. N otic e th at th e C la ssif ic atio n s e em s w ors e . I n ste ad
of e n clo sin g m ost o f t h e y ello w p oin ts u nder t h e Y ello w r e g io n, t h ere a re a l o t o f
mis se s (m an y y ello w p oin ts fa ll u nder th e B lu e re g io n in ste ad ). T his o ccu rre d
becau se o f t h e c h an ge i n p ara m ete rs w e’v e d one.
For i n sta n ce, t h e L earn in g R ate h as a h uge e ff e ct o n a ccu ra cy a n d a ch ie v in g j u st
th e r ig ht c o nverg en ce. I f w e m ak e t h e L earn in g R ate t o o l o w , c o nverg en ce m ig ht
ta k e a lo t o f tim e. A nd if th e L earn in g R ate is to o h ig h (a s w ith o ur e x am ple
earlie r), w e m ig ht n ot re ach th e c o nverg en ce a t a ll b ecau se w e o vers h ot it a n d
mis se d .
There are se v era l w ay s to ach ie v e co nverg en ce w ith in re aso nab le tim e (e .g .
Learn in g R ate i s j u st r ig ht, m ore h id den l a y ers , p ro bab ly f e w er o r m ore F eatu re s
to in clu de, ap ply in g R eg ula riz atio n). B ut “o verly o ptim iz in g” fo r ev ery th in g
mig ht n ot m ak e e co nom ic s e n se . I t’s g ood t o s e t a c le ar o bje ctiv e a t t h e s ta rt a n d
stic k to it. If th ere a re o th er in te re stin g o r p ro m is in g o pportu nitie s th at p op u p,
you m ig ht w an t to fu rth er tu ne th e para m ete rs an d im pro ve th e m odel’s
perfo rm an ce.
Anyw ay, if y ou w an t to g et a n id ea h ow a n A NN m ig ht lo ok lik e in P yth on,
here ’s a s a m ple c o de:
X = n p.a rra y([ [ 0 ,0 ,1 ],[ 0 ,1 ,1 ],[ 1 ,0 ,1 ],[ 1 ,1 ,1 ] ] )
y = n p.a rra y([[0 ,1 ,1 ,0 ]]).T
sy n 0 = 2 *n p.r a n dom .r a n dom ((3 ,4 )) - 1
sy n 1 = 2 *n p.r a n dom .r a n dom ((4 ,1 )) - 1
fo r j i n x ra n ge(6 0000):
l1 = 1 /( 1 + np.e x p (-(n p.d ot(X ,s y n 0))))
l2 = 1 /( 1 + np.e x p (-(n p.d ot(l1 ,s y n 1))))
l2 _d elt a = ( y - l 2 )* (l2 *(1 -l2 ))
l1 _d elt a = l 2 _d elt a .d ot(s y n 1.T ) * ( l1 * ( 1 -l1 ))

sy n 1 + = l 1 .T .d ot(l2 _d elt a )
sy n 0 += X .T .d ot(l1 _d elt a )
Fro m
http s
://
ia m tr a sk
. gith ub
. io
/2 015/0 7/1 2/
basic
-
pyth on
- netw ork
/
It’s a v ery s im ple e x am ple . I n r e al w orld , a rtif ic ia l n eu ra l n etw ork s w ould lo ok
lo ng a n d c o m ple x w hen w ritte n fro m sc ra tc h . T han kfu lly , h ow to w ork w ith
th em is b eco m in g m ore “ d em ocra tiz ed ,” w hic h m ean s e v en p eo ple w ith lim ite d
te ch nic al b ack gro unds w ould b e a b le t o t a k e a d van ta g e o f t h em .

1 6. N atu ra l L an gu age P ro cessin g
C an w e m ak e c o m pute rs u nders ta n d w ord s a n d s e n te n ces? A s m en tio ned in th e
p re v io us ch ap te r, one of th e goals is to m atc h or su rp ass im porta n t hum an
c ap ab ilitie s. O ne o f th ose c ap ab ilitie s is la n guag e ( c o m munic atio n, k now in g th e
m ean in g of so m eth in g, arriv in g at co nclu sio ns base d on th e w ord s an d
s e n te n ces).
T his is w here N atu ra l L an guag e P ro cessin g o r N LP c o m es in . It’s a b ra n ch o f
a rtif ic ia l in te llig en ce w here in th e fo cu s is o n u nders ta n din g an d in te rp re tin g
h um an la n guag e. It c an c o ver th e u nders ta n din g a n d in te rp re ta tio n o f b oth te x t
a n d s p eech .
H av e y ou e v er d one a v oic e s e arc h in G oogle ? A re y ou fa m ilia r w ith c h atb ots
( th ey a u to m atic ally re sp ond b ase d o n y our in quir ie s a n d w ord s)? W hat a b out
G oogle T ra n sla te ? H av e y ou e v er t a lk ed t o a n A I c u sto m er s e rv ic e s y ste m ?
I t’s N atu ra l L an guag e P ro cessin g ( N LP) a t w ork . I n f a ct, w ith in a f e w o r s e v era l
y ears th e N LP m ark et m ig ht beco m e a m ulti- b illio n dolla r in dustr y . T hat’s
b ecau se i t c o uld b e w id ely u se d i n c u sto m er s e rv ic e, c re atio n o f v ir tu al a ssis ta n ts
( s im ila r t o I ro n M an ’s J A RV IS ), h ealth care d ocu m en ta tio n, a n d o th er f ie ld s.
N atu ra l L an guag e P ro cessin g is ev en u se d in u nders ta n din g th e co nte n t an d
g au gin g se n tim en ts fo und in so cia l m ed ia posts , blo g co m men ts , pro duct
r e v ie w s, n ew s, a n d o th er o nlin e s o urc es. N LP i s v ery u se fu l i n t h ese a re as d ue t o
t h e m assiv e a v aila b ility o f d ata fro m o nlin e a ctiv itie s. R em em ber th at w e c an
v astly im pro ve our data an aly sis an d m ach in e le arn in g m odel if w e hav e
s u ff ic ie n t a m ounts o f q uality d ata t o w ork o n.
A naly zin g W ord s & S en tim en ts
O ne o f th e m ost c o m mon u se s o f N LP is in u nders ta n din g th e se n tim en t in a
p ie ce o f t e x t ( e .g . I s i t a p ositiv e o r n eg ativ e p ro duct r e v ie w ?W hat d oes t h e t w eet
s a y o vera ll? ). I f w e o nly h av e a d ozen c o m men ts a n d r e v ie w s to r e ad , w e d on’t
n eed a n y t e ch nolo gy t o d o t h e t a sk . B ut w hat i f w e h av e t o d eal w ith h undre d s o r
t h ousa n ds o f s e n te n ces t o r e ad ?
T ech nolo gy is v ery u se fu l in th is la rg e-s c ale ta sk . I m ple m en tin g N LP c an m ak e
o ur liv es a bit easie r an d ev en m ak e th e re su lts a bit m ore co nsis te n t an d
r e p ro ducib le .

To g et s ta rte d , l e t’s s tu dy R esta u ra n t_ R ev ie w s.t s v ( le t’s t a k e a p eek ):
Wow ... L oved t h is p la ce.
1
Cru st i s n ot g ood .
0
Not t a sty a n d t h e t e x tu re w as j u st n asty .
0
Sto p ped b y d urin g t h e l a te M ay b an k h olid ay o ff R ic k S te v e r e co m men datio n a n d l o ved i t .
1
The s e le ctio n o n t h e m en u w as g re a t a n d s o w ere t h e p ric es.
1
Now I a m g ettin g a n gry a n d I w an t m y d am n p ho.
0
Hon eslt y i t d id n't t a ste T H AT f r e sh .)
0
The p ota to es w ere lik e r u bber a n d y ou c o u ld te ll th ey h ad b een m ad e u p a h ea d o f tim e b ein g k ep t
under a w arm er.
0
The f r ie s w ere g re a t t o o.
1
The fir s t p art is th e sta te m en t w here in a p ers o n sh are s h is /h er im pre ssio n o r
ex perie n ce a b out th e re sta u ra n t. T he se co nd p art is w heth er th at sta te m en t is
neg ativ e o r n ot (0 if n eg ativ e, 1 if p ositiv e o r L ik ed ). N otic e th at th is is v ery
sim ila r w ith S uperv is e d L earn in g w here in t h ere a re l a b els e arly o n.
How ev er, N LP is d if f e re n t b ecau se w e’re d ealin g m ain ly w ith te x t a n d la n guag e
in ste ad o f n um eric al d ata . A ls o , u nders ta n din g te x t (e .g . fin din g p atte rn s a n d
in fe rrin g ru le s) can be a huge ch alle n ge. T hat’s becau se la n guag e is ofte n
in co nsis te n t w ith n o e x plic it r u le s. F or i n sta n ce, t h e m ean in g o f t h e s e n te n ce c an
ch an ge dra m atic ally by re arra n gin g, om ittin g, or ad din g a fe w w ord s in it.
There ’s a ls o th e th in g a b out c o nte x t w here in h ow th e w ord s a re u se d g re atly
aff e ct t h e m ean in g. W e a ls o h av e t o d eal w ith “ fille r” w ord s t h at a re o nly t h ere t o
co m ple te t h e s e n te n ce b ut n ot i m porta n t w hen i t c o m es t o m ean in g.
Unders ta n din g sta te m en ts , g ettin g th e m ean in g a n d d ete rm in in g th e e m otio nal
sta te o f t h e w rite r c o uld b e a h uge c h alle n ge. T hat’s w hy i t’s r e ally d if f ic u lt e v en
fo r e x perie n ced p ro gra m mers to c o m e u p w ith a s o lu tio n o n h ow to d eal w ith
word s a n d l a n guag e.
Usin g N LT K
Than kfu lly , th ere a re n ow su ite s o f lib ra rie s a n d p ro gra m s th at m ak e N atu ra l
Lan guag e Pro cessin g with in re ach ev en fo r beg in ner pro gra m mers an d
pra ctitio ners . O ne o f th e m ost p opula r su ite s is th e N atu ra l L an guag e T oolk it
(N LT K ).
With N LT K ( d ev elo ped b y S te v en B ir d a n d E dw ard L oper in th e D ep artm en t o f
Com pute r an d In fo rm atio n S cie n ce at th e U niv ers ity of P en nsy lv an ia .) , te x t
pro cessin g b eco m es a b it m ore s tr a ig htf o rw ard b ecau se y ou’ll b e im ple m en tin g
pre -b uilt c o de i n ste ad o f w ritin g e v ery th in g f ro m s c ra tc h . I n f a ct, m an y c o untr ie s

an d u niv ers itie s a ctu ally i n co rp ora te N LT K i n t h eir c o urs e s.

Than k y o
u 
!
Than k y ou f o r b uyin g t h is b ook! I t i s i n te n ded t o h elp y ou u nders ta n din g d ata
an aly sis u sin g P yth on. I f y ou e n jo yed t h is b ook a n d f e lt t h at i t a d ded v alu e t o
your l if e , w e a sk t h at y ou p le ase t a k e t h e t im e t o r e v ie w i t.
Your h onest f e ed back w ould b e g re atly a p pre cia te d . I t r e ally d oes m ak e a
dif f e re n ce.
W e a re a v ery s m all p ublis h in g c o m pan y a n d o u r s u rv iv al d ep en ds o n y ou r
re v ie w s. P le a se , t a k e a m in ute t o w rit e u s a n h on est r e v ie w .

Sou rc es & R efe re n ces
Softw are , l ib ra rie s, & p ro gra m min g l a n gu age
● Pyth on (
http s://w ww.p yth on.o rg /
) ● Anaco nda (
http s://a n aco nda.o rg /
) ●
Vir tu ale n v (
http s://v ir tu ale n v.p ypa.i o /e n /s ta b le /
) ● Num py
( http ://w ww.n um py.o rg /
) ● Pan das (
http s://p an das.p ydata .o rg /
) ●
Matp lo tlib (
http s://m atp lo tlib .o rg /
) ● Kera s (
http s://k era s.i o /
) ●
Pyto rc h (
http s://p yto rc h .o rg /
) ● Open N eu ra l N etw ork E xch an ge
( http s://o nnx.a i/
) ● Ten so rF lo w (
http s://w ww.t e n so rflo w .o rg /
)
Data se ts
● Kag gle (
http s://w ww.k ag gle .c o m /d ata se ts
) ● Kera s D ata se ts
( http s://k era s.i o /d ata se ts /
) ● Pyto rc h V is io n D ata se ts
( http s://p yto rc h .o rg /d ocs/s ta b le /to rc h vis io n/d ata se ts .h tm l
) ● MNIS T
Data b ase W ik ip ed ia (
http s://e n .w ik ip ed ia .o rg /w ik i/M NIS T _data b ase
)
● MNIS T (
http ://y an n.l e cu n.c o m /e x db/m nis t/
) ● CIF A R-1 0
( http s://w ww.c s.t o ro nto .e d u/~ kriz /c if a r.h tm l
) ● Reu te rs d ata se t
( http s://a rc h iv e.i c s.u ci.e d u/m l/d ata se ts /r e u te rs -
21578+ te x t+ cate g oriz atio n+ co lle ctio n
) ● IM DB S en tim en t A naly sis
( http ://a i.s ta n fo rd .e d u/~ am aas/d ata /s e n tim en t/
) Onlin e b ook s,
tu to ria ls , & o th er r e fe re n ces
● Cours e ra D eep L earn in g S pecia liz atio n
( http s://w ww.c o urs e ra .o rg /s p ecia liz atio ns/d eep -le arn in g
) ● fa st.a i -
Deep L earn in g f o r C oders ( h ttp ://c o urs e .f a st.a i/)
● Kera s E xam ple s
( http s://g ith ub.c o m /k era s-te am /k era s/tr e e/m aste r/e x am ple s
) ● Pyto rc h
Exam ple s (
http s://g ith ub.c o m /p yto rc h /e x am ple s
) ● Pyto rc h M NIS T
ex am ple
( http s://g is t.g ith ub.c o m /x m fb it/b 27cd bff 6 8870418bdb8cefa 8 6a2 d558
)
● Overfittin g (
http s://e n .w ik ip ed ia .o rg /w ik i/O verfittin g
) ● A N eu ra l
Netw ork P ro gra m (
http s://p la y gro und.t e n so rflo w .o rg /
) ● Ten so rF lo w
Exam ple s (
http s://g ith ub.c o m /a y m eric d am ie n /T en so rF lo w -E xam ple s
)
● Mach in e L earn in g C ra sh C ours e b y G oogle
( http s://p la y gro und.t e n so rflo w .o rg /
)

Than k y o
u 
!
Than k y ou f o r b uyin g t h is b ook! I t i s i n te n ded t o h elp y ou u nders ta n din g d ata
an aly sis u sin g P yth on. I f y ou e n jo yed t h is b ook a n d f e lt t h at i t a d ded v alu e t o
your l if e , w e a sk t h at y ou p le ase t a k e t h e t im e t o r e v ie w i t.
Your h onest f e ed back w ould b e g re atly a p pre cia te d . I t r e ally d oes m ak e a
dif f e re n ce.


W e a re a v ery s m all p ublis h in g c o m pan y a n d o u r s u rv iv al d ep en ds o n y ou r
re v ie w s. P le a se , t a k e a m in ute t o w rit e u s a n h on est r e v ie w .

Сообщить о нарушении / Abuse

Все документы на сайте взяты из открытых источников, которые размещаются пользователями. Приносим свои глубочайшие извинения, если Ваш документ был опубликован без Вашего на то согласия.