The Pydoop is a quick library for us to develop or research the Hadoop service and we have some prototype require using python to access the HDFS. Therefore Pydoop is the first choice to let script interact with HDFS files. However, the Pydoop installation require some hadoop library to make it compile-able on your dev machine. Hence, we choose the Cloudera hadoop client library to install on our dev machine for developing Pydoop script by eclipse.
First, you need to install the CDH4 Repository RPM to let your CentOS get the hadoop client software package. Then, you can yum the hadoop-client
# rpm -ivh cdh4-repository-1-0.noarch.rpm
# yum install hadoop-client
For pip installation, you need to assign the JAVA_HOME and HADOOP_HOME for python to compile the pydoop package.
# export JAVA_HOME=/urs/lib/jvm/java-1.6.0
# export HADOOP_HOME=/usr/lib/hadoop
# pip install pydoop==0.10.0
The Voyage Beyond the Cloud
(/_\) (-_-) (*o*) (^_^) (0_0) (XoX) (T_T) Comme D'habitute
Monday, July 20, 2015
Monday, March 30, 2015
The High-Availability Design Paradigm of Application
We have often encounter the Design dilemma between Service Reliability and Application Development Cost. For High-Availability, we need put a lot effort aside from Business Requirement. However, too simple HA solution would have melt your business down when you facing the accident from all kind of Hardware Failure. Here we talk about three level's HA Paradigm from the perspective of implementation complexity.
1. Manual Switch: Service has basic Monitor Infrastructure to help you identify the Hardware failure and you can bring up the Application on the other spare capacity and keep the service continuity. Manual Switch is easily to adopt and take minimal cost. However, this is not even to be an HA Design. Because from the service interrupted to manual started, it might take over 30 minutes (Monitoring Interval usually take 5~10 minute to catch event and alert. Human check False Alarm. Confirmed and follow the SOP to start Application. Service resume.) This kind of low level HA is suitable for some none-timing critical mission such as file transfer, report generator etc. 30 minutes is tolerable for these kind of service flow.
2. Semi-Auto Switch: Some service might have strictly data consistency and require really rigid transaction result without race condition. At meanwhile, the interruption should not take over couple minutes. We usually design Active and Passive nodes and let those nodes coordinate with each other. So there will be only one Active node working at same time. Once the Active node shut down, the Passive one will start to take over the control and occupy a lock (usually in Database). Once the malfunctioned machine recovered, it will not proceed any transaction due to the lock has been acquired by the other partner. There are so many design like Database Cluster. The multiple nodes will take a vote under some quorum assignment and bring up another candidate as Active one for continuing the task. In a service flow, we will have an application behind the queue to maintain the data consistency and no service interruption before the queue. All the switch issue would be taken care after the queue. System interface seems ok. But the internal service flow would have couple minutes downtime and the queue would be a cushion to prevent the damage propagate to other dependency system.
3. Active-Active Mode: This is most idealism for a service that every node has the same responsibility and no one's failure could make the service interrupted. However, sometimes this design might take a lot of over burden for all the application to communicate with each other for maintaining the data consistency or prevent from race condition issue during the transaction. Often, this kind of burden will drag down all the performance among whole cluster in poor design. Hence, only few scenario could adopt this feature without too much effort like Web Farm with only query capability (no transaction). Service flow that would only have one concurrent user connected to server at a time like ATM (you would only have on debit card, right?) For the application that focus on data availability, this design is pretty fascinated. But if your application is required to maintain the data consistency with multiple connection from many concurrent users, the Active-Active Mode need a lot of time to enhance the performance, better Data structure for reducing the lock activities, crystal clear service flow and business purpose in case you need to expand the features in future business change.
Usually, we could take compromise for level 2 or 1. But if we could, why not Active-Active for completing your solution.
1. Manual Switch: Service has basic Monitor Infrastructure to help you identify the Hardware failure and you can bring up the Application on the other spare capacity and keep the service continuity. Manual Switch is easily to adopt and take minimal cost. However, this is not even to be an HA Design. Because from the service interrupted to manual started, it might take over 30 minutes (Monitoring Interval usually take 5~10 minute to catch event and alert. Human check False Alarm. Confirmed and follow the SOP to start Application. Service resume.) This kind of low level HA is suitable for some none-timing critical mission such as file transfer, report generator etc. 30 minutes is tolerable for these kind of service flow.
2. Semi-Auto Switch: Some service might have strictly data consistency and require really rigid transaction result without race condition. At meanwhile, the interruption should not take over couple minutes. We usually design Active and Passive nodes and let those nodes coordinate with each other. So there will be only one Active node working at same time. Once the Active node shut down, the Passive one will start to take over the control and occupy a lock (usually in Database). Once the malfunctioned machine recovered, it will not proceed any transaction due to the lock has been acquired by the other partner. There are so many design like Database Cluster. The multiple nodes will take a vote under some quorum assignment and bring up another candidate as Active one for continuing the task. In a service flow, we will have an application behind the queue to maintain the data consistency and no service interruption before the queue. All the switch issue would be taken care after the queue. System interface seems ok. But the internal service flow would have couple minutes downtime and the queue would be a cushion to prevent the damage propagate to other dependency system.
3. Active-Active Mode: This is most idealism for a service that every node has the same responsibility and no one's failure could make the service interrupted. However, sometimes this design might take a lot of over burden for all the application to communicate with each other for maintaining the data consistency or prevent from race condition issue during the transaction. Often, this kind of burden will drag down all the performance among whole cluster in poor design. Hence, only few scenario could adopt this feature without too much effort like Web Farm with only query capability (no transaction). Service flow that would only have one concurrent user connected to server at a time like ATM (you would only have on debit card, right?) For the application that focus on data availability, this design is pretty fascinated. But if your application is required to maintain the data consistency with multiple connection from many concurrent users, the Active-Active Mode need a lot of time to enhance the performance, better Data structure for reducing the lock activities, crystal clear service flow and business purpose in case you need to expand the features in future business change.
Usually, we could take compromise for level 2 or 1. But if we could, why not Active-Active for completing your solution.
Sunday, November 2, 2014
The way to figure out how your linux application installations and deployments are configured
There are several main distribution among Linux community and each of them have slightly different configuration upon the software package management and compiler prefix. Hence, this is crucial to learn what the current package deployed on your system:
1. Default Repository Manager: such as yum, apt-get, homebrew, you have to learn the main stream repository and deployment management system on you distribution.
2. Default Package Manager: such as rpm, dkpg, you need to learn the shell command about software installation shell command. And these command usually provide a database to record the relationship between package's dependencies. You might have a lot customized software package that provide by vendor which you can not found the open source software on public repository for you to yum or apt-get.
3. Useful shell command `locate`: we usually use this command to learn the installation made by ./configure and `make install`. This command is really useful when you have some cutting edge software that compiled and installed manually by your own. Usually, sometimes we use the command to check the library missing or any misconfigure package installation.
4. Finally, you have to check the following folder like /etc/init.d on CentOS for your command about `service` or `sbin/service` and `chkconfig`. These are the command for you to lookup the application would be started at booting.
5. If the file does exist but you still have trouble on link it, you could use ldd command to parse the ELF binary for checking all the dependencies' location such as
[user@server ~]$ ldd /usr/lib64/libtdsodbc.so.0
linux-vdso.so.1 => (0x00007fffe3ea0000)
libodbcinst.so.2 => not found
libgnutls.so.26 => /usr/lib64/libgnutls.so.26 (0x00007f65b1a6a000)
librt.so.1 => /lib64/librt.so.1 (0x00007f65b1862000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f65b1645000)
libc.so.6 => /lib64/libc.so.6 (0x00007f65b12b0000)
libtasn1.so.3 => /usr/lib64/libtasn1.so.3 (0x00007f65b10a0000)
libz.so.1 => /lib64/libz.so.1 (0x00007f65b0e8a000)
libgcrypt.so.11 => /lib64/libgcrypt.so.11 (0x00007f65b0c14000)
/lib64/ld-linux-x86-64.so.2 (0x00007f65b1f71000)
libgpg-error.so.0 => /lib64/libgpg-error.so.0 (0x00007f65b0a10000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f65b080c000)
1. Default Repository Manager: such as yum, apt-get, homebrew, you have to learn the main stream repository and deployment management system on you distribution.
2. Default Package Manager: such as rpm, dkpg, you need to learn the shell command about software installation shell command. And these command usually provide a database to record the relationship between package's dependencies. You might have a lot customized software package that provide by vendor which you can not found the open source software on public repository for you to yum or apt-get.
3. Useful shell command `locate`: we usually use this command to learn the installation made by ./configure and `make install`. This command is really useful when you have some cutting edge software that compiled and installed manually by your own. Usually, sometimes we use the command to check the library missing or any misconfigure package installation.
4. Finally, you have to check the following folder like /etc/init.d on CentOS for your command about `service` or `sbin/service` and `chkconfig`. These are the command for you to lookup the application would be started at booting.
5. If the file does exist but you still have trouble on link it, you could use ldd command to parse the ELF binary for checking all the dependencies' location such as
[user@server ~]$ ldd /usr/lib64/libtdsodbc.so.0
linux-vdso.so.1 => (0x00007fffe3ea0000)
libodbcinst.so.2 => not found
libgnutls.so.26 => /usr/lib64/libgnutls.so.26 (0x00007f65b1a6a000)
librt.so.1 => /lib64/librt.so.1 (0x00007f65b1862000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f65b1645000)
libc.so.6 => /lib64/libc.so.6 (0x00007f65b12b0000)
libtasn1.so.3 => /usr/lib64/libtasn1.so.3 (0x00007f65b10a0000)
libz.so.1 => /lib64/libz.so.1 (0x00007f65b0e8a000)
libgcrypt.so.11 => /lib64/libgcrypt.so.11 (0x00007f65b0c14000)
/lib64/ld-linux-x86-64.so.2 (0x00007f65b1f71000)
libgpg-error.so.0 => /lib64/libgpg-error.so.0 (0x00007f65b0a10000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f65b080c000)
Thursday, October 2, 2014
The Cent OS Workstation
Recently, I would like to use Eclipse to develop python Client for Hadoop usage such as HDFS and HBase. However, I found there is no way to build up a "pydoop" client on Windows Machine. Hence we have to use CentOS as our regular based workstation. Although I have the experience on Mac for daily usage (email, browser, documentation) but the stuff related to workstation are totally different from Sever Management and Office Work. The Distribution Config is Software Developement.
For better transition, I have to left some work still on windows workstation. Therefore, the interaction between new CentOS Workstation and original windows machine take its matter:
1. Install the RDP Client on CentOS
Then you can use below command to connect to your original window workstation:
2. Install xrdp as RDP Server for Window client. This part would require the Extra Software Repository - EPEL for yum-ing the xrdp package
Then you should refresh your yum repository:
================
epel Extra Packages for Enterprise Linux 6 - x86_64 11,105
================
3. After EPEL, you should be able to install ntfs-3g for NTFS disk access.
/dev/sda2 /mnt/win_d ntfs-3g rw,umask=0000,defaults 0 0
4. Install Samba to enhance the file transfer for your document
For better transition, I have to left some work still on windows workstation. Therefore, the interaction between new CentOS Workstation and original windows machine take its matter:
1. Install the RDP Client on CentOS
[root@new]# yum install xfreerdpThen you can use below command to connect to your original window workstation:
[root@new]# xfreerdp --plugin cliprdr -d [domain] -u [username] -g [w]x[h] 192.x.x.x2. Install xrdp as RDP Server for Window client. This part would require the Extra Software Repository - EPEL for yum-ing the xrdp package
[root@new]# wget http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm[root@new]# rpm -ivh epel-release-6-8.noarch.rpmThen you should refresh your yum repository:
[root@new]# yum repolist================
epel Extra Packages for Enterprise Linux 6 - x86_64 11,105
================
Now you can install the xrdp and vnc server:
[root@new]# yum install xrdp tigervnc-server[root@new]# service xrdp startFinal, you should make those service auto-started after reboot:[root@new]# chkconfig xrdp on3. After EPEL, you should be able to install ntfs-3g for NTFS disk access.
[root@new]# yum install ntfs-3g[root@new]# vim /etc/fstab/dev/sda2 /mnt/win_d ntfs-3g rw,umask=0000,defaults 0 0
4. Install Samba to enhance the file transfer for your document
5. Change the mouse scroll like Mac's nature. This would be more convenient for your daily work.
[root@new]#xmodmap -e "pointer = 1 2 3 5 4 7 6 8 9 10"Wednesday, September 10, 2014
The XRDP Bug after restart the service
I have installed the XRDP on CentOS for Hadoop Java development. And the Eclipse on CentOS require the RDP to Gnume Desktop for GUI. However, I need all the connection to the same session to work on latest progress. So I added up the static port as session config like below:
[xrdp2]
name=sesman-Xvnc-5910
lib=libvnc.so
username=ask
password=ask
ip=127.0.0.1
port=5910
[xrdp2]
name=sesman-Xvnc-5910
lib=libvnc.so
username=ask
password=ask
ip=127.0.0.1
port=5910
However, after I restarted the machine. The sesman session shows error on connection. After several try I found the XRDP require an initial session when you try to identify the port of session. Which means when there is no 5910 session on XRDP Service. The above connection will get error. We have to left the config like:
[xrdp1]
name=sesman-Xvnc-New
lib=libvnc.so
username=ask
password=ask
ip=127.0.0.1
port=-1
This config will start up the first session at port 5910 (Default). Then you can successfully identify the session port 5910 as [xrdp2] assigned and get into the session you left after service restart.
There is a way to get your desktop session on localhost:
[xrdp0]
name=sesman-Xvnc-Local
lib=libvnc.so
username=ask
password=ask
ip=127.0.0.1
port=5900
name=sesman-Xvnc-New
lib=libvnc.so
username=ask
password=ask
ip=127.0.0.1
port=-1
This config will start up the first session at port 5910 (Default). Then you can successfully identify the session port 5910 as [xrdp2] assigned and get into the session you left after service restart.
There is a way to get your desktop session on localhost:
[xrdp0]
name=sesman-Xvnc-Local
lib=libvnc.so
username=ask
password=ask
ip=127.0.0.1
port=5900
Before restart xrdp, you should change the setting [System]->[Preference]->[Remote Desktop] to open the value of [Allow other User to control your desktop]. The xrdp0 will show the local desktop you have directly work on.
Tuesday, June 10, 2014
The Fundamental Design for Service Application
Here so called Service Application indicate to background daemon for application such like Request Handle Server, Flow Processor. In Cloud Era, every Service Application should consider these three feature into fundamental design:
1. High Availability: HA has three design level based on the complication and cost. Lowest level is none. Basic Level is Active-Standby. The highest level is Capacity Impact. To reach the Highest Level, it means your system has some kind of scale out ability. However, we have met some terrible design which doesn't control the database transaction and lock logic well. That will introduce incapable to scale out your application and limited your option to Active-Standby Design. Otherwise, all well trained developers should have the ability to implement the application with scale out.
2. Application Resilient: any application would happen the process crash or machine incidentally rebooted. There should be a design to make the service resiliently restarted or elegantly terminated and started by other application. This is not only about the exception handle issue but also the service continuity. However, most modern designed OS has the basic tool to keep this feature on your application.
3. Log Notification: This is must have but each team has different implementation.
1. High Availability: HA has three design level based on the complication and cost. Lowest level is none. Basic Level is Active-Standby. The highest level is Capacity Impact. To reach the Highest Level, it means your system has some kind of scale out ability. However, we have met some terrible design which doesn't control the database transaction and lock logic well. That will introduce incapable to scale out your application and limited your option to Active-Standby Design. Otherwise, all well trained developers should have the ability to implement the application with scale out.
2. Application Resilient: any application would happen the process crash or machine incidentally rebooted. There should be a design to make the service resiliently restarted or elegantly terminated and started by other application. This is not only about the exception handle issue but also the service continuity. However, most modern designed OS has the basic tool to keep this feature on your application.
3. Log Notification: This is must have but each team has different implementation.
Monday, May 12, 2014
QT Creator 5 and Boost Framework on Windows
C++ is a language which brings you benefit from cross-platform and better performance. When we need compile C++ on different plat-form, usually we have to choose different IDE for windows and Linux. However, since QT has support the tool chain from Microsoft Visual C++ (msvc) we found QT might be the neat solution when we need an IDE that support our project on different platform.
For windows, we use boost as the framework for our C++ development (msvc 11 has support the implementation for C++ 11 statndard which is quiet handy that we don't have to build gcc 4.8 on MinGW platform. But for the developer already has code ran on MinGW framework, I believe it is the same handy when you integrated QT with MinGW Tool Chain.) So first thing is down load the boost_1_55_0 into C:\.
It is really easy to get boost installed. Go to C:\boost_1_55_0 and run bootstrap.bat batch file you will get a b2.exe file as your build tool. Then run the b2.exe (you can use b2 --help to find the command arguments for you to specify the customized installation config) The latest built dll would be under "C:\boost_1_55_0\stage\lib" and the header file is under C:\boost_1_55_0\boost. The whole build process should under the windows shell of visual studio 2012 or the compiler which supports C++ 11 standard. You could get the windows shell by using "VS2012 x64 Native Tools Command Prompt".
Hence, please add the follow instruction in your QT .pro file for the project you would like to include boost:
Then you could use #include<boost\asio.hpp> to see if the compiler work.
This is for framework include. If you want to include the external library such as logger or database, you could right click the Project Icon on left Tree View and add an external library (require a specified dll)
For the debugger on QT, you have to install the cdb.exe from WDK:
http://msdn.microsoft.com/en-us/windows/hardware/hh852365.aspx
If you use the tool chain of visual studio 2012, you should install WDK 8.0.
For windows, we use boost as the framework for our C++ development (msvc 11 has support the implementation for C++ 11 statndard which is quiet handy that we don't have to build gcc 4.8 on MinGW platform. But for the developer already has code ran on MinGW framework, I believe it is the same handy when you integrated QT with MinGW Tool Chain.) So first thing is down load the boost_1_55_0 into C:\.
It is really easy to get boost installed. Go to C:\boost_1_55_0 and run bootstrap.bat batch file you will get a b2.exe file as your build tool. Then run the b2.exe (you can use b2 --help to find the command arguments for you to specify the customized installation config) The latest built dll would be under "C:\boost_1_55_0\stage\lib" and the header file is under C:\boost_1_55_0\boost. The whole build process should under the windows shell of visual studio 2012 or the compiler which supports C++ 11 standard. You could get the windows shell by using "VS2012 x64 Native Tools Command Prompt".
Hence, please add the follow instruction in your QT .pro file for the project you would like to include boost:
INCLUDEPATH+=C:\boost_1_55_0 LIBS+=-LC:\boost_1_55_0\stage\lib\
Then you could use #include<boost\asio.hpp> to see if the compiler work.
This is for framework include. If you want to include the external library such as logger or database, you could right click the Project Icon on left Tree View and add an external library (require a specified dll)
For the debugger on QT, you have to install the cdb.exe from WDK:
http://msdn.microsoft.com/en-us/windows/hardware/hh852365.aspx
If you use the tool chain of visual studio 2012, you should install WDK 8.0.
Subscribe to:
Comments (Atom)
