AWS, BIG DATA, JAVA

Posts

Showing posts from 2017

Splitting a large file on Windows

November 16, 2017

HJ Split - Nice tool but it does not preserve lines. It breaks the files based on size of each part. Split for windows is available as part of http://gnuwin32.sourceforge.net/packages/coreutils.htm Removing first n(5 in this case ) lines from a file MORE /E +5 input.csv > out.csv

Git - Overview

November 03, 2017

Introduction Git is becoming de-facto choice of source code control system for new generation of developers. Git is based on distributed repositories. It manages the content in terms of snapshots and knows how to apply/rollback the changes between one snapshot to another. Most used commands To track changes or stage changes git add filename To track or stage all files under dir1 git add dir1\ To commit all the changes git commit -a To push the changes to remote git push origin master References: https://www.ibm.com/developerworks/library/d-learn-workings-git/index.html https://stackoverflow.com/questions/2745076/what-are-the-differences-between-git-commit-and-git-push https://drive.google.com/file/d/1PuZDiljecX31pDipbXxyfv8P2jhDwQo6/view?usp=sharing

Unit of Storage

October 25, 2017

Unit of Storage Unit Description Kilobyte 1024 Bytes Megabyte 1024 Kilobytes Gigabyte 1024 Megabytes Terabyte 1024 Gigabytes Petabyte 1024 Terabytes Exabyte 1024 Petabytes Zettabyte 1024 Exabytes Yottabyte 1024 Zettabytes Brontobyte 1024 Yottabytes

Statistics & Machine Learning

September 19, 2017

Useful Resources Statistics http://www.statisticshowto.com/probability-and-statistics/statistics-definitions/ http://tutorials.istudy.psu.edu/basicstatistics/basicstatistics2.html http://www.reading.ac.uk/ssc/resources/Docs/Statistical_Glossary.pdf http://blog.minitab.com/blog/adventures-in-statistics-2/understanding-analysis-of-variance-anova-and-the-f-test Random Forest - http://blog.yhat.com/posts/random-forests-in-python.html https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/#one http://www.ling.upenn.edu/~clight/chisquared.htm http://analyticstrainings.com/?p=151 Feature Selection - Machine Learning http://blog.datadive.net/selecting-good-features-part-i-univariate-selection/ http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

Impala

August 28, 2017

1. Creating a partitioned table in parquet format directly from an unpartitioned table in different format. create external table dbname.partitioned_table PARTITIONED BY ( col5,col6 ) STORED AS PARQUET LOCATION 'hdfs://nameservice/path/partitioned_table' AS select col1 , col2 , col3, col4 col5, col6 from dbname.unpartitioned_table 2. Error while accessing HIVE TEXTFILE format table in Impala. Create a TEXTFILE format table in HIVE. CREATE EXTERNAL TABLE test_impala_do ( category STRING, segment STRING, level_1 STRING, group_code INT, d_xuid_count INT, segment_code INT, datetime STRING COMMENT 'DATETIME', source_file STRING COMMENT 'SOURCE_FILE' ) PARTITIONED BY ( akey STRING COMMENT 'AKEY AS...

JVM

August 10, 2017

JVM argument -verbose:class helps debug classloader. With this option , you can see the classes being loaded by the application. Use -Xss<N><m|g> argument to specify the stack size for the JVM. If the default JVM stack size is too small , you may get java.lang.StackOverflowError from applications deep recursion. For example you can specify 128 mb of stack space by -Xss128m ClassLoader.findClass() can throws ClassNotFoundException if the jvm process does not have read permissions for the jar/class file , even if it is in classpath.

DOS BATCH SCRIPTS

March 24, 2017

1. Had to frequently pull files from a server. The file set used to be in following format. abc123_file1.txt def23145_file2.txt r4rtyyy_file3.txt The task was to remove the prefix from names and put the file set back. It was too mechanical, so I wrote following script to do the same. ECHO OFF SETLOCAL EnableDelayedExpansion FOR %%I IN (*) DO ( REM ECHO %%I SET xname=%%I for /f "tokens=1,2 delims=_" %%a in ("%%I") do ( REM ECHO %%a REM ECHO %%b REM ECHO !xname! SET xname=!xname:%%a_=! REM ECHO !xname! SET ENDSWITH=!xname:~-4! REM ECHO !ENDSWITH! IF "!ENDSWITH!"==".csv" REN %%I !xname! IF "!ENDSWITH!"=="done" REN %%I !xname! ) ) 2. I faced a situation where I needed t...

Hadoop

January 22, 2017

* Good article on MapReduce http://www.bigsynapse.com/mapreduce-internals * org.apache.hadoop.mapreduce.Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> is generic class. There is an Mapper generic interface too but it is in different package org.apache.hadoop.mapre.Mapper<K1,V1,K2,V2> and has nothing to do with the Mapper class. The Reducer is also a generic class similar to Mapper. * Hadoop also has two special annotations to indicate the audience and stability of any interface or class -InterfaceStability -InterfaceAudience * All the key value types in Hadoop Map Reduce programming must implement Writable interface, which results in type implementing efficient serialization using DataInput/DataOutput. * hadoop fs -put input/sample.txt hdfs://quickstart.cloudera:8020/sample.txt returned "ConnectException" because namenode service was not running. Used f...