Miten Hadoopista tuli helppoa? Jarno Lindqvist Principal Advisor SAS
Hadoop? Avoimen lähdekoodin hajautettu tallennus- ja prosessointikehikko, joka skaalautuu ja replikoituu Söpö keltainen elefantti Alusta tuhansille lisäprojekteille, jotka täydentävät Hadoop-kehikon ominaisuuksia (Hive, Pig, Impala, Spark, Oozie, Sqoop, Mahout, Ambari, Flume, Storm )
Aina mun pitää koodata public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); result.set(sum); context.write(key, result); INSERT OVERWRITE TABLE actions_users SELECT u.id, actions.date FROM ( SELECT av.uid AS uid FROM action_video av WHERE av.date = '2008-06-03' UNION ALL SELECT ac.uid AS uid FROM action_comment ac WHERE ac.date = '2008-06-03' ) actions JOIN users u ON(u.id = actions.uid); FROM ( FROM ( FROM action_video av SELECT av.uid AS uid, av.id AS id, av.date AS date UNION ALL FROM action_comment ac SELECT ac.uid AS uid, ac.id AS id, ac.date AS date ) union_actions SELECT union_actions.uid, union_actions.id, union_actions.date CLUSTER BY union_actions.uid) map INSERT OVERWRITE TABLE actions_reduced SELECT TRANSFORM(map.uid, map.id, map.date) USING 'reduce_script' AS (uid, id, reduced_val); A = LOAD '/user/xxx/firstinput' USING PigStorage(); B = group... C =... agrregation function STORE C INTO '/user/vxj/firstinputtempresult/days1';.. Atab = LOAD '/user/xxx/secondinput' USING PigStorage(); Btab = group... Ctab =... agrregation function STORE Ctab INTO '/user/vxj/secondinputtempresult/days1'; EXEC; E = LOAD '/user/vxj/firstinputtempresult/' USING PigStorage(); F = group... G =... aggregation function STORE G INTO '/user/vxj/finalresult1';.. Etab =LOAD '/user/vxj/secondinputtempresult/' USING PigStorage(); Ftab = group... Gtab =... aggregation function STORE Gtab INTO '/user/vxj/finalresult2'; big = LOAD 'big_data' AS (b1,b2,b3); tiny = LOAD 'tiny_data' AS (t1,t2,t3); mini = LOAD 'mini_data' AS (m1,m2,m3); C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated'; Pig Latinaa!!! MapReducee!!! HiveQL:ää!!!
Miltä tämä kuulostaisi? Jee! Pastellivärisiä nappuloita!
Power to the people! Data vaivattomasti lähteistä Hadoopiin (ja sieltä pois) Datan profilointi siellä, missä se jo on (Hadoopissahan se ) Perustason filtteröinnit, summaukset, yhdistelyt Tadaa...! Transponointi Hadoopin sisällä! (DS2 prosessointi Hadoopissa Wow!) Datan vaivaton rinnakkaislataus SAS Visual Analyticsiin (jos sellainen on käytössä)
Paljon ytyä pienessä tuutissa! 1 DATAN SIIRTELY 2 DATAN MUUNTELU 3 DATAN PUHDISTUS 4 DATAN INTEGROINTI 5 DATAN JAKELU Copy Data to Hadoop Query Validate Join Load SAS LASR Profile Data Select Columns Parse Create Match codes Create tables Identification Analysis Apply Filters Standardize Sort & De-duplicate Create views Query Map Columns Change Case Aggregate Copy from Hadoop Import a File Sort / Order Gender Analysis Run a SAS program Browse Tables Calculate Columns Pattern Analysis Run a Hive program Transpose data Field Extraction Aggregate Transform data Delete Rows Access data, move it into Hadoop, and assess the data structure and content Select data of interest, manipulate it, and structure it into the data format desired Put data into a consistent format Combine datasets, including data that has no common key, remove duplicate data, and create new data points thru aggregation Load datasets into SAS LASR in-memory analytic server, Create new Hadoop tables, and deliver data to other databases and apps
No miten se sitten toimii? Hadoop Cluster Profile Cleanse Join Load Query Filter Transform De-duplicate SAS vapp SAS Data Loader (Web App) SAS/Access to Hadoop RDBMS SAS Text Files SAS LASR In-Memory Analytic Server (Optional) (Web Browser) Hadoop Cluster Node SAS DS2 code SAS Embedded Process SAS Data Quality Accelerator for Hadoop SAS Code Accelerator for Hadoop SAS Embedded Process
Entäpä käyttöliittymän päässä? 1) Käynnistä virtuaalikone 3) Avaa selain ja käytä! 2) Tarkista Data Loaderin IP
Millaisen elefantin tarvitsen? MapR 4.1 -> Mutta millasen SAS-palvelimen mää tarvitten? Kuules, et tarvitse sellaista lainkaan!!! Cloudera 5.2 -> Hortonworks 2.1 -> Data Loader toimii Hadoopklusterissa!
Hyvä kaveri myös Visual Analyticsille
Lataa 90 päivän testiversio Data Loaderista ja kokeile itse! http://www.sas.com/en_us/software/data-management/data-loader-hadoop.html#trial