SAS ja Hadoop jotain uutta, wanhaa, lainattua ja keltaista Jarno Lindqvist, SAS Simon Gregory, Hortonworks Woody Christy, Cloudera
Kuka on Hadoop? The name Hadoop is a homey story going back to 2003 into the realm of a toddler's experimentation with oldfashioned human language. Doug Cutting's son, then 2, was just beginning to talk and called his beloved stuffed yellow elephant "Hadoop" (stress on the first syllable). Doug Cutting creator of Hadoop (now working for Cloudera)
Mikä on Hadoop? An open source framework for distributed storage and processing, designed for commodity hardware and capable of handling very large quantities of data
Kunnon elefantti ei unohda! (koska sillä on hajautettu, vikasietoinen tiedostojärjestelmä) HDFS Distributed, Redundant, Reliable Storage on sosiaalinen eläin! (koska se osaa prosessoida rinnakkain) voi olla todella ISO! (koska se skaalautuu miltei rajattomasti) MapReduce Distributed Data Processing Company Confidential For Internal Use Only Copyright 2014, SAS Institute Inc. All rights reserved.
SASin tulevaisuuden nuotit Company Confidential - For Internal Use Only Copyright 2014, SAS Institute Inc. All rights reserved.
Miten SAS hyödyntää Hadoopia? Hadoop-klusteri dataalustana Hadoop-klusteri analyyttisenä In- Memory alustana EVALUATE / MONITOR RESULTS IDENTIFY / FORMULATE PROBLEM DATA PREPARATION DEPLOY MODEL DATA EXPLORATION VALIDATE MODEL BUILD MODEL TRANSFORM & SELECT SAS DI Studio SAS Data Loader SAS/ACCESS to Hadoop PROC Hadoop SAS HPA proseduurit SAS Visual Analytics SAS Visual Statistics SAS Scoring Accelerator
SAS halaa nyt elefanttia joka suunnasta! Company Confidential For Internal Use Only Copyright 2014, SAS Institute Inc. All rights reserved.
SAS ja Hadoop-ekosysteemi User Interface SAS Data Management SAS Enterprise Miner SAS Studio SAS Visual Analytics SAS Visual Statistics SAS In-memory Statistics for Hadoop SAS User Metadata Data Access Base SAS & SAS/ACCESS to Hadoop SAS Metadata In-Memory Data Data Access Access Next-Gen SAS User Data Processing Pig Hive SAS Embedded Process SAS LASR Analytic Server Map Reduce File System HDFS
Hadoop analyyttisenä In-Memory alustana SAS ANALYTIC HADOOP ENVIRONMENT SAS In-Memory Analytics Process In-Memory, use Hadoop for storage persistence and commodity computing HADOOP SAS LASR ANALYTIC SERVER APPLICATIONS WEB AND MOBILE CLIENTS SAS IN-MEMORY Data Loader SAS IN-MEMORY Visual Analytics SAS IN-MEMORY Visual Statistics SAS IN-MEMORY SAS IN-MEMORY In Memory Statistics for Hadoop
SAS Data Management ja Hadoop SAS Data Loader for Hadoop New web based solution for Data Management and Quality processing within the Hadoop cluster SAS Data Integration Studio Traditional SAS ETL/ELT development environment PROC HADOOP & SAS/ACCESS to Hadoop Enables HiveQL, Pig, HDFS and Map Reduce statement submission SAS/ACCESS to Hadoop makes HIVE tables behave like any other SAS library
SAS Data Loader For Hadoop Enables true self service Hadoop Data Management via user friendly web interface ETL/ELT in Hadoop Executes SAS DS2 (SAS Embedded Process) and HiveQL Data Extraction, Filtering, Expressions & Summarization Parallel Data Loading from Hadoop to SAS LASR Server (In Memory) Data Quality Data Profiling Copyright 2014, SAS Institute Inc. All rights reserved.
SAS Data Loader For Hadoop User is enabled to work independently Doesn t have to know how to use Hadoop. Non-technical user SAS Data Loader for Hadoop Query data Filter data Transform data Summarize data Profile data Cleanse data Load data Hadoop Hadoop does the work. Processing is fast. All data management is done in Hadoop
SAS Data Loader For Hadoop SAS Data Loader for Hadoop Hadoop Non-technical user Query data Filter data Transform data Summarize data Profile data Cleanse data Load data You can also direct high speed loads of data into distributed SAS LASR Analytic Server Optional: SAS LASR
Components SAS Data Loader for Hadoop SAS vapp (Windows 7, VM Player 6) Deployed components (Cloudera CDH 5.0, Hortonworks HDP 2.0) Self Service User interface Buttons/Directives Query data Filter data Transform data Summarize data Profile data Cleanse data Load data Execution Environment Directives/tasks run inside the Hadoop Cluster to minimize unnecessary data movement Hadoop Cluster SAS components installed in cluster enable data processing to run inside Hadoop SAS Embedded Process SAS Code Accelerator SAS DQ Accelerator SAS Code Accelerator for Hadoop SAS Data Quality Accelerator for Hadoop We use HiveQL and DS2 to invoke processing Optional: SAS LASR distributed server SAS Embedded Process
Company Confidential For Internal Use Only Copyright 2014, SAS Institute Inc. All rights reserved.
Company Confidential For Internal Use Only Copyright 2014, SAS Institute Inc. All rights reserved.
Company Confidential For Internal Use Only Copyright 2014, SAS Institute Inc. All rights reserved.
Typical filter, summary and sort options available Company Confidential For Internal Use Only Copyright 2014, SAS Institute Inc. All rights reserved.
Company Confidential For Internal Use Only Copyright 2014, SAS Institute Inc. All rights reserved.
Company Confidential For Internal Use Only Copyright 2014, SAS Institute Inc. All rights reserved.
Company Confidential For Internal Use Only Copyright 2014, SAS Institute Inc. All rights reserved.
SAS Data Integration Studio Miksi wanhaa? Koska SASkehittäjälle entuudestaan tuttu SAS DI Studio on nyt myös tehokas Hadoop-kehitysväline! (ei se niin wanha ole, DI Studio 4.9 julkaistiin elokuussa!)
SAS Data Integration Studio ja Hadoop Kaksi tasoa: HIVE (tai Impala) SAS Data Integration Studio SAS Workspace Server SAS/ACCESS Interface to Hadoop Hadoop JAR Files (distro specific) HDFS Hadoop Hive / Hive2 Server Hive Metastore Tables -kerros (taulut) ja HDFS-kerros (tiedostot) Työasema ja SAS DI Studio SAS palvelin Hadoop-klusteri
SAS Data Integration Studio ja Hadoop SAS DI Studio sisältää kattavan valikoiman valmiita Hadooptransformaatioita. Lisäksi myös tutut SQL-transformaatiot luovat pass-thru kelpoista HiveQL syntaksia. High-Performance Analytics transformaatiot lataavat dataa joko VA:n Hadoopiin (SASHDAT) tai suoraan SAS LASR (In-Memory) prosessille
SAS Data Integration Studio ja Hadoop HIVE-kirjastot näkyvät SAS Management Consolessa kuten muutkin kirjastot ja Register Tables toiminnolla voidaan lukea HIVE-taulujen metadata
SAS Data Integration Studio ja Hadoop Lue Hadoop data (HIVEtaulun muodossa) Muokkaa taulua HIVE QL syntaksia käytten Kirjoita tulostaulu takaisin Hadoopiin (HIVE:n kautta) SAS DI Studiolla voi käsitellä HIVEtauluja Hadoopin sisällä, niin että prosessointi pysyy Hadoopissa, (huomaa symboli H )
SAS Data Integration Studio ja Hadoop SAS DI Studiolla voi lukea ja kirjoittaa peräkkäistiedostoa suoraan Hadoopin tiedostojärjestelmään, (HDFS) -tasolle valmiita transformaatioita hyödyntäen
SAS Data Integration Studio ja Hadoop SAS DI Studiolla voi helposti siirtää tiedostoja Hadoopin ja paikallisen tiedostojärjestelmän välillä (binäärit, mediatiedostot, jar-paketit jne jne.)
PROC Hadoop Helppo tapa kutsua Hadoopia SASista Hadoop_config file PROC Hadoop NameNode Miksi lainattua? Koska PROC HADOOP mahdollistaa Hadoopkoodin upottamisen mihin tahansa SASohjelmaan Hadoop JAR Files HDFS SAS työasema ja Base SAS / EG Hadoop-klusteri
PROC Hadoop HDFS komentojen kutsuminen HDFS-komennoilla filename cfg "C:\Users\hadoop_config.xml"; PROC HADOOP options=cfg username="hadoop" password="hadoop"; hdfs mkdir="/user/hadoop/testfolder" ; hdfs rename="/user/hadoop/testfolder" out="/user/hadoop/testfolder_new"; hdfs delete="/user/hadoop/testfolder_new" ; hdfs copyfromlocal="c:\sample_data\dept.txt" out="/user/hadoop/testfolder/ ; hdfs copytolocal="/user/hadoop/testfolder" out="c:\sample_data\" ; run; operoidaan Hadoopin tiedostojärjestelmätasolla
PROC Hadoop MapReduce jarpakettien kutsuminen filename cfg "C:\Users\hadoop_config.xml"; PROC HADOOP options=cfg username="hadoop" password="hadoop" verbose; hdfs delete="/user/hadoop/out"; mapreduce input="/user/hadoop/gutenberg" output="/user/hadoop/out" jar="c:\sample_data\hadoop examples 2.0.0 mr1 cdh4.1.2.jar" outputkey="org.apache.hadoop.io.text" outputvalue="org.apache.hadoop.io.intwritable" reduce="org.apache.hadoop.examples.wordcount$intsumreducer" combine="org.apache.hadoop.examples.wordcount$intsumreducer" map="org.apache.hadoop.examples.wordcount$tokenizermapper" reducetasks=0 ; run; MapReduce -koodi pakataan Javatyyppisiin jarpaketteihin ennen Hadoopiin vientiä
PROC Hadoop PigLatin koodin kutsuminen SAS-ohjelmasta /* Pig statement to process HDFS data file */ filename cfg "C:\Users\hadoop_config.xml"; filename code1 "C:\Users\pig_cd.txt"; PROC HADOOP options=cfg username="hadoop" password="hadoop" verbose; pig code=code1 ; run; Contents of C:\Users\pig_cd.txt : cd NYSE; dividends = load 'NYSE_dividends' as (exchange, symbol, date, dividend); grouped = group dividends by symbol; avg = foreach grouped generate group, AVG(dividends.dividend); store avg into 'average_dividend'; Pig Latin on ylätason kieli Hadoop datamanipulointia varten
SAS/ACCESS to Hadoop SAS-ohjelmointia kuten ennenkin LIBNAME hdplib hadoop PORT=10000 SERVER=sascldserv02 USER=hadoop PASSWORD= hadoop ; 1. Suorita Hadoop (HIVE) PROC DATASETS lib=hdplib; quit; PROC CONTENTS data=hdplib.hdp_table; quit; PROC SQL; select * from hdplib.hdp_table ; quit; PROC MEANS data= hdplib.hdp_table; run; kirjastoviittaus 2. Käytä SASproseduureja ja DATA stepiä kuten ennenkin 3. Tai mene Enterprise Guidella kiinni Hadoopdataan. SAS ajaa SORT, MEANS, SUMMARY, TABULATE, REPORT proseduurit automaattisesti Hadoop-klusterissa (In- Database)
KIITOS!