Posts

Showing posts from 2017

Splitting a large file on Windows

HJ Split - Nice tool but it does not preserve  lines. It breaks the files based on size of each part. Split for windows is available as part of  http://gnuwin32.sourceforge.net/packages/coreutils.htm Removing first n(5 in this case )  lines from a file MORE /E +5 input.csv > out.csv

Git - Overview

Introduction Git is becoming de-facto choice of source code control system for new generation of developers.  Git is based on distributed repositories. It manages the content in terms of snapshots and knows how to apply/rollback the changes between one snapshot to another. Most used commands  To track changes or stage changes git add filename To track or stage all files under dir1 git add  dir1\ To commit all the changes git commit -a To push the changes to remote git push origin master References: https://www.ibm.com/developerworks/library/d-learn-workings-git/index.html https://stackoverflow.com/questions/2745076/what-are-the-differences-between-git-commit-and-git-push https://drive.google.com/file/d/1PuZDiljecX31pDipbXxyfv8P2jhDwQo6/view?usp=sharing

Unit of Storage

Unit of Storage Unit Description Kilobyte 1024 Bytes Megabyte 1024 Kilobytes Gigabyte 1024 Megabytes Terabyte 1024 Gigabytes Petabyte 1024 Terabytes Exabyte 1024 Petabytes Zettabyte 1024 Exabytes Yottabyte 1024 Zettabytes Brontobyte 1024 Yottabytes

Statistics & Machine Learning

Useful Resources  Statistics http://www.statisticshowto.com/probability-and-statistics/statistics-definitions/ http://tutorials.istudy.psu.edu/basicstatistics/basicstatistics2.html http://www.reading.ac.uk/ssc/resources/Docs/Statistical_Glossary.pdf http://blog.minitab.com/blog/adventures-in-statistics-2/understanding-analysis-of-variance-anova-and-the-f-test Random Forest - http://blog.yhat.com/posts/random-forests-in-python.html https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/#one http://www.ling.upenn.edu/~clight/chisquared.htm http://analyticstrainings.com/?p=151 Feature Selection - Machine Learning http://blog.datadive.net/selecting-good-features-part-i-univariate-selection/ http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

Impala

1. Creating a partitioned table in  parquet format directly from an unpartitioned table in different format. create external table dbname.partitioned_table PARTITIONED BY ( col5,col6 ) STORED AS PARQUET LOCATION 'hdfs://nameservice/path/partitioned_table' AS select   col1 ,   col2 ,   col3,   col4   col5,   col6 from  dbname.unpartitioned_table 2.   Error while accessing HIVE TEXTFILE format table in Impala. Create a  TEXTFILE format table in HIVE.  CREATE EXTERNAL TABLE test_impala_do (   category STRING,   segment STRING,   level_1 STRING,   group_code INT,   d_xuid_count INT,   segment_code INT,   datetime STRING COMMENT 'DATETIME',   source_file STRING COMMENT 'SOURCE_FILE' ) PARTITIONED BY (   akey STRING COMMENT 'AKEY AS DEFINED' ) WITH SERDEPROPERTIES ('serialization.format'='1') STORED AS TEXTFILE LOCATION 'hdfs://nameservice1/..../test_impala_do' Populated using spark SQL job. INSERT OVERWRITE INTO

JVM

JVM argument -verbose:class  helps debug classloader.  With this option , you can see the  classes being loaded by the application. Use  -Xss<N><m|g>  argument to specify the stack size for the JVM. If the default JVM stack size is too small , you may get  java.lang.StackOverflowError from applications deep recursion. For example you can specify 128 mb of stack space by  -Xss128m  ClassLoader.findClass()  can throws  ClassNotFoundException if the jvm process does not have read permissions for the jar/class file , even if it is in classpath. 

DOS BATCH SCRIPTS

1. Had to frequently pull files from a server. The file set  used to be in following format. abc123_file1.txt def23145_file2.txt r4rtyyy_file3.txt The task was to remove the prefix from names and put the file set back. It was too mechanical, so I wrote following script to do the same. ECHO OFF SETLOCAL EnableDelayedExpansion FOR %%I IN (*) DO (    REM ECHO %%I    SET  xname=%%I      for /f "tokens=1,2 delims=_" %%a in ("%%I") do (      REM ECHO %%a      REM ECHO %%b        REM ECHO !xname!      SET xname=!xname:%%a_=!      REM ECHO !xname!      SET ENDSWITH=!xname:~-4!      REM ECHO !ENDSWITH!      IF "!ENDSWITH!"==".csv"  REN %%I  !xname!      IF "!ENDSWITH!"=="done"  REN %%I  !xname!    ) ) 2.  I faced  a situation where I needed to upload files to server at some intervals and not in one go. I was looking for a way to write script. This is what I did.  I was using winSCP  to upload the file.

Hadoop

* Good article on MapReduce http://www.bigsynapse.com/mapreduce-internals *  org.apache.hadoop.mapreduce.Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>  is generic class. There is an Mapper generic  interface too but it is in different package   org.apache.hadoop.mapre.Mapper<K1,V1,K2,V2> and has nothing to do with the Mapper class. The Reducer is also a generic class similar to Mapper. * Hadoop also has two special annotations to indicate  the audience and stability of any interface or class        -InterfaceStability        -InterfaceAudience * All the key value types in Hadoop Map Reduce programming must implement Writable interface, which results in type implementing efficient serialization using DataInput/DataOutput. * hadoop fs -put   input/sample.txt   hdfs://quickstart.cloudera:8020/sample.txt  returned "ConnectException"  because namenode service was not running.  Used following commands to find the status and restart the service sudo service had