Docker in BigDataScience: Shipping BigData Science Tools in Docker Contrainer

Docker made our life very easy as Data Science. Using docker, we can install tools related to
Data science very easily without the hassle of configuration.

This blog is dedicated to installation of Hadoop/Spark environment using Docker images
But before going there, let me introduce you Docker first. So entire section is devided into following
steps:

1. Docker Concept
2. How to install Docker on window
3. Install Spark using docker images

Lets start with Docker

Virtual machine takes long time to boot up and require lots of packages and dependecies to boot up. Linux container(LCX) solve the problem by enabling
multiple isolated environments to run on a single machine.
More info about LCX pleas refer wiki page https://en.wikipedia.org/wiki/LXC

LCX is heart/base of Docker.

More info on Docker please refer "docker.training"

How to install Docker on window machine. please refer the Docker documentation below
https://docs.docker.com/windows/step_one/

Once docker is up and running fine, please follow below steps to install/run jupyter notebook

1. Click on "Docker Quickstart Terminal"
2. Go to bash sheel (type "bash" and hit return key)
3. On docker terminal run following command

docker run -i -t -h sandbox sequenceiq/spark:1.2.0 /etc/bootstrap.sh -bash
It will take time as firt it will look for spark image on local machine and will start downloading images from docker hub
Please wait untill all images gets downloaded and extracted properly without any error
on sucessful execution of docker run command you will see following messages

$ docker run -i -t -h sandbox sequenceiq/spark:1.2.0 /etc/bootstrap.sh -bash
/
Starting sshd: [ OK ]
Starting namenodes on [sandbox]
sandbox: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-nameno
de-sandbox.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-data
node-sandbox.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-ro
ot-secondarynamenode-sandbox.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn--resourcemanage
r-sandbox.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nod
emanager-sandbox.out

Testing of Spark setup

4. cd /usr/local/spark
5. run "./bin/spark-shell --master yarn-client --driver-memory 1g --executor-memory 1g --executor-cores 1"
6. scala> sc.parallelize(1 to 1000).count()

7. run "./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --driver-memory 1g --executor-memory 1g --executor-cores 1 ./lib/spark-examples-1.2.0-hadoop2.4.0.jar

Cloudera Quick start with Docker image

Please follow the steps below.

https://hub.docker.com/r/cloudera/quickstart/

3 comments:

Ramesh SampangiNovember 3, 2021 at 8:46 AM
Participate and reap the benefits of the Data Science Course in Hyderabad; AI Patasala presents a structured syllabus that helps the students comprehensively grasp each idea.
Online Data Science Training in Hyderabad
UnknownMarch 6, 2022 at 8:03 PM
I at long last discovered incredible post here.I will get back here. I just added your blog to my bookmark locales. thanks.Quality presents is the urgent on welcome the guests to visit the website page, that is the thing that this site page is giving.360digitmg-data science course data science training in kanpur
UnknownMarch 7, 2022 at 7:39 PM
Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing.data science course data science course in surat

Tuesday, May 17, 2016

Shipping BigData Science Tools in Docker Contrainer

3 comments: