Here are the steps how I use "hadoop distcp" to copy files from a SFTP server to HDFS:
- Clone hadoop-filesystem-sftp at https://github.com/wnagele/hadoop-filesystem-sftp.git
- There is a bug in hadoop-filesystem-sftp which may block you running distcp correctly when you have special characters in the file names which should be escaped, e.g. ":". The fix is very simple. You can find line 331 in SFTPFileSystem.java, and encode the sftpFile.filename.
for (SFTPv3DirectoryEntry sftpFile : sftpFiles) {
String filename = URLEncoder.encode(sftpFile.filename, "UTF-8");
if (!"..".equals(filename) && !".".equals(filename))
fileStats.add(getFileStatus(sftpFile.attributes, new Path(path, filename).makeQualified(this)));
} - If using password, it might be easy, but your password to SFTP server will be public because it will be shown in MapReduce job configuration.
- hadoop-filesystem-sftp using ganymed-ssh-2 which only supports authentication using password or keyfile.
- Here is how to set up passwordless SSH, you need permission to log on the sftp server. Create a ssh key pair using "ssh-keygen". Make sure you don't overwrite your current key pair in $HOME/.ssh
$ ssk-keygen -f ${distcp_ssh}/keyfile
$ ssh-keygen -F sftp-server-name -f ${distcp_ssh}/known_hosts - Copy the public key to the sftp server.
- Copy ${distcp_ssh} to all data nodes and the client node. And you need to set the dir read only by yarn.
- On the client node. You need to set this dir readable by the user you use to run "hadoop distcp"
- The reason doing like this is that hadoop-filesystem-sftp trying to use ${user.home} for the default path for the key file and known_hosts. And the more important reason is that the task is run as yarn instead of the user running the command on each data node. Unfortunately, some one can write a mapreduce job to grab your id_rsa key file and gain the access to SFTP server.
- hadoop distcp -D fs.sftp.user=username -D fs.sftp.key.file=${distcp_ssh}/id_rsa -D fs.sftp.knownhosts=${distcp_ssh}/known_hosts -libjars hadoop-filesystem-sftp-0.0.1-SNAPSHOT-jar-with-dependencies.jar sftp://sftp-server/src-path hdfs://namenode/target-path
- WARNING: Don't use this method unless you have to.