程式扎記: [ 深入雲計算 ] HDFS : 開源的 GFS - HDFS API 介紹

2013年11月10日星期日

[ 深入雲計算 ] HDFS : 開源的 GFS - HDFS API 介紹

Preface:
Hadoop 中關於文件操作的類別基本上位於 org.apache.hadoop.fs package 中, 這些 API 支援操作包括打開文件, 讀/寫文件, 刪除文件等. Hadoop 函示庫中最終面向用戶的是 FileSystem 類別, 該類別是個抽象類, 只能透過該類的靜態方法 get 的到實作的類別, 通常使用的流程為:

1. 得到 Configuration 物件
2. 得到 FileSystem 物件
3. 進行文件操作.

API 使用範例:
底下將對如何使用 HDFS API 簡單進行介紹.

上傳本地文件
透過 FileSystem.copyFromLocalFile() 可將本地文件上傳到 HDFS 的指定位置上, 其中 src 和 dst 均為文件路徑. 範例代碼如下:

view plaincopy to clipboardprint?
package demo.hdfs;  
  
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.fs.FileStatus;  
import org.apache.hadoop.fs.FileSystem;  
import org.apache.hadoop.fs.Path;  
  
public class CopyFile {  
  
    /** 
     * BD: 上傳本地文件到 HDFS 指定路徑下. 
     * @param args 
     */  
    public static void main(String[] args) throws Exception{  
        Configuration conf = new Configuration();         
        FileSystem hdfs = FileSystem.get(conf);  
        // 設置本地文件  
        Path src = new Path(args[0]);  
        // 設置上傳 HDFS 路徑  
        Path dst = new Path(args[1]);  
          
        // 透過 FileSystem API:copyFromLocalFile(src, dst) 上傳  
        hdfs.copyFromLocalFile(src, dst);  
        System.out.printf("\t[Info] Upload to '%s'...\n", dst);  
        FileStatus[] files = hdfs.listStatus(dst);  
        for(FileStatus file:files)  
        {  
            System.out.printf("\t%s\n", file.getPath());  
        }  
    }  
}  

包裝到 MRTest.jar 後, 複製到 NameNode 下, 便可如下執行之:

# 現在在 NameNode
$ hadoop jar MRTest.jar demo.hdfs.CopyFile ./test.sh /input # 複製本地的 test.sh 到 HDFS "/input" 路徑下.
[Info] Upload to '/input'...
hdfs://ubuntun:9000/input/f_01
hdfs://ubuntun:9000/input/test.sh
$ hadoop fs -ls /input
Found 2 items
-rw-r--r-- 3 john supergroup 22 2013-11-03 22:18 /input/f_01
-rw-r--r-- 3 john supergroup 47 2013-11-04 22:01 /input/test.sh

創建 HDFS 文件
通過 FileSystem.create() 可在 HDFS 上創建文件. 範例代碼如下:

view plaincopy to clipboardprint?
package demo.hdfs;  
  
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.fs.FSDataOutputStream;  
import org.apache.hadoop.fs.FileSystem;  
import org.apache.hadoop.fs.Path;  
  
public class CreateFile {  
  
    /** 
     * BD: 創建文件到 HDFS 並寫入 Content. 
     * @param args 
     */  
    public static void main(String[] args) throws Exception{  
        if(args.length==0)  
        {  
            System.out.printf("\t[Info] Please give argu1 as 'Create file path'; argu2 as 'Content to write'!\n");  
            return;  
        }  
          
        Configuration conf = new Configuration();         
        FileSystem hdfs = FileSystem.get(conf);  
          
        // 設置建立文件路徑  
        Path dstFile = new Path(args[0]);  
        FSDataOutputStream outputStream = hdfs.create(dstFile);  
          
        if(args.length>1)  
        {  
            // 寫入文件  
            outputStream.write(args[1].getBytes());  
              
            // 關閉 OutputStream  
            outputStream.close();  
        }  
    }  
}  

包裝到 MRTest.jar 後, 複製到 NameNode 下, 便可如下執行之:

$ hadoop jar MRTest.jar demo.hdfs.CreateFile /input/abc.txt "HelloWorld" # 寫入 "HelloWorld" 到 HDFS "/input/abc.txt"
$ hadoop fs -ls /input
Found 3 items
-rw-r--r-- 3 john supergroup 10 2013-11-04 22:14 /input/abc.txt
-rw-r--r-- 3 john supergroup 22 2013-11-03 22:18 /input/f_01
-rw-r--r-- 3 john supergroup 47 2013-11-04 22:01 /input/test.sh
$ hadoop fs -cat /input/abc.txt # 檢視 abc.txt 內容
HelloWorld

創建 HDFS 目錄
通過 FileSystem.mkdirs(Path f) 可在 HDFS 上創建文件夾, 其中 f 為文件夾路徑. 範例代碼如下:

view plaincopy to clipboardprint?
package demo.hdfs;  
  
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.fs.FileSystem;  
import org.apache.hadoop.fs.Path;  
  
public class CreateDir {  
  
    /** 
     * BD: 在 HDFS 建立文件夾. 
     * @param args 
     */  
    public static void main(String[] args) throws Exception{  
        if(args.length==0)  
        {  
            System.out.printf("\t[Info] Please give argu1 as 'Create Directory Path'!\n");  
            return;  
        }  
        Configuration conf = new Configuration();         
        FileSystem hdfs = FileSystem.get(conf);  
  
        // 設置文件夾路徑  
        Path dirPath = new Path(args[0]);  
        hdfs.mkdirs(dirPath);  
    }  
}  

包裝到 MRTest.jar 後, 複製到 NameNode 下, 便可如下執行之:

$ hadoop jar MRTest.jar demo.hdfs.CreateDir /abc # 在 HDFS 中建立文件夾 "/abc"
$ hadoop fs -ls /
Found 5 items
drwxr-xr-x - john supergroup 0 2013-11-04 22:23 /abc
drwxr-xr-x - john supergroup 0 2013-11-02 04:07 /home
...

重新命名 HDFS 文件
可以透過 FileSystem.rename(Path src, Path dst) 對 HDFS 已存在文件重新命名, 範例代碼如下:

view plaincopy to clipboardprint?
package demo.hdfs;  
  
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.fs.FileSystem;  
import org.apache.hadoop.fs.Path;  
  
public class Rename {  
  
    /** 
     * BD: 對 HDFS 已存在文件重新命名. 
     * @param args 
     */  
    public static void main(String[] args) throws Exception{  
        if(args.length<1)  
        {  
            System.out.printf("\t[Info] Please give argu1 as 'Source File Path'; argu2 as 'New File Path'!\n");  
            return;  
        }  
        Configuration conf = new Configuration();         
        FileSystem hdfs = FileSystem.get(conf);  
  
        Path srcFile = new Path(args[0]);  
        Path newFile = new Path(args[1]);  
          
        boolean isDone = hdfs.rename(srcFile, newFile);  
        System.out.printf("\t[Info] '%s' being renamed to '%s'...%s!\n", srcFile, newFile, isDone?"Done":"Fail");  
    }  
}  

包裝到 MRTest.jar 後, 複製到 NameNode 下, 便可如下執行之:

$ hadoop jar MRTest.jar demo.hdfs.Rename /input/abc.txt /input/aaa.txt # 將 HDFS "/input/abc.txt" 改名成 "/input/aaa.txt"
[Info] '/input/abc.txt' being renamed to '/input/aaa.txt'...Done!
$ hadoop fs -ls /input
Found 3 items
-rw-r--r-- 3 john supergroup 10 2013-11-04 22:14 /input/aaa.txt

查看某個 HDFS 文件是否存在
通過 FileSystem.exists(Path path) 可以查看 HDFS 上該文件是否存在, 範例代碼如下:

view plaincopy to clipboardprint?
package demo.hdfs;  
  
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.fs.FileSystem;  
import org.apache.hadoop.fs.Path;  
  
public class Exists {  
  
    /** 
     * BD: 確定給定 HDFS 路徑是否存在 
     * @param args 
     */  
    public static void main(String[] args) throws Exception{  
        if(args.length==0)  
        {  
            System.out.printf("\t[Info] Please give file path(s)!\n");  
            return;  
        }  
  
        Configuration conf = new Configuration();         
        FileSystem hdfs = FileSystem.get(conf);       
        Path path=null;  
        for(int i=0;i
        {  
            path = new Path(args[i]);  
            System.out.printf("\t%s exist?...%s\n", path, hdfs.exists(path)?"Yes":"No");  
        }  
    }  
}  

包裝到 MRTest.jar 後, 複製到 NameNode 下, 便可如下執行之:

$ hadoop jar MRTest.jar demo.hdfs.Exists /input /input/f_01 /abc
/input exist?...Yes
/input/f_01 exist?...Yes
/abc exist?...No

刪除 HDFS 上面的文件
通過 FileSystem.delete(Path f, boolean recursive) 可刪除指定的 HDFS 文件, 範例代碼如下:

view plaincopy to clipboardprint?
package demo.hdfs;  
  
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.fs.FileStatus;  
import org.apache.hadoop.fs.FileSystem;  
import org.apache.hadoop.fs.Path;  
  
public class DeleteFile {  
  
    /** 
     * BD: 刪除指定 HDFS 文件 
     * @param args 
     */  
    public static void main(String[] args) throws Exception{  
        if(args.length==0)  
        {  
            System.out.printf("\t[Info] Please give file path(s) to be deleted!\n");  
            return;  
        }  
  
        Configuration conf = new Configuration();         
        FileSystem hdfs = FileSystem.get(conf);       
        Path path=null;  
        for(int i=0;i
        {  
            path = new Path(args[i]);  
            try  
            {  
                FileStatus fs = hdfs.getFileStatus(path);  
                if(fs!=null)  
                {  
                    if(fs.isDir()) System.out.printf("\tDelete %s...%s\n", path, hdfs.delete(path, true)?"Done":"Fail");  
                    else System.out.printf("\tDelete %s...%s\n", path, hdfs.delete(path, false)?"Done":"Fail");  
                }                 
            }  
            catch(java.io.FileNotFoundException e)  
            {  
                System.err.printf("\t%s doesn't exist!\n", path);  
            }  
        }  
    }  
}  

包裝到 MRTest.jar 後, 複製到 NameNode 下, 便可如下執行之:

$ hadoop jar MRTest.jar demo.hdfs.DeleteFile /input/aaa.txt /abc /aaa # 刪除 HDFS "/input/aaa.txt", "/abc", "/aaa"
Delete /input/aaa.txt...Done
Delete /abc...Done
/aaa doesn't exist!

查看 HDFS 某個目錄下的所有文件
可以透過 FileSystem.getStatus(Path p) 或 FileSystem.listStatus(Path f) 得到檔案的狀態, 並透過狀態得到檔案路徑. 範例代碼如下:

view plaincopy to clipboardprint?
package demo.hdfs;  
  
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.fs.FileStatus;  
import org.apache.hadoop.fs.FileSystem;  
import org.apache.hadoop.fs.Path;  
  
public class ListFile {  
  
    /** 
     * BD: List files if path is directory; 
     * @param args 
     */  
    public static void main(String[] args) throws Exception{  
        if(args.length==0)  
        {  
            System.out.printf("\t[Info] Please give file path(s)!\n");  
            return;  
        }  
  
        Configuration conf = new Configuration();         
        FileSystem hdfs = FileSystem.get(conf);       
        Path path=null;  
        for(int i=0;i
        {  
            path = new Path(args[i]);  
            try  
            {  
                FileStatus fs = hdfs.getFileStatus(path);  
                if(fs.isDir())  
                {  
                    System.out.printf("\t%s is Directory:\n", path);  
                    FileStatus sfs[] = hdfs.listStatus(path);  
                    for(FileStatus f:sfs)  
                    {  
                        System.out.printf("\t\t%s %s\n", f.getPath(), f.isDir()?"(d)":"");  
                    }  
                }  
                else  
                {  
                    System.out.printf("\t%s is file\n", path);  
                }  
            }  
            catch(java.io.FileNotFoundException e)  
            {  
                System.err.printf("\t%s: %s\n", path, e);  
            }  
        }  
    }  
}  

包裝到 MRTest.jar 後, 複製到 NameNode 下, 便可如下執行之:

查找某個文件在 HDFS 叢集中的位置
通過 FileSystem.getFileBlockLocations(FileStatus file, long start, long len) 可以查找指定文件在 HDFS 叢集中的位置. 範例代碼如下:

view plaincopy to clipboardprint?
package demo.hdfs;  
  
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.fs.BlockLocation;  
import org.apache.hadoop.fs.FileStatus;  
import org.apache.hadoop.fs.FileSystem;  
import org.apache.hadoop.fs.Path;  
  
public class FileLoc {  
    /** 
     * BD: Search File Block(s) in HDFS. 
     * @param args 
     */  
    public static void main(String args[]) throws Exception  
    {  
        if(args.length==0)  
        {  
            System.out.printf("\t[Info] Please give file path(s)!\n");  
            return;  
        }  
          
        Configuration conf = new Configuration();         
        FileSystem hdfs = FileSystem.get(conf);       
        Path path=null;  
        for(int i=0;i
        {  
            path = new Path(args[i]);  
            try  
            {  
                FileStatus fs = hdfs.getFileStatus(path);  
                if(!fs.isDir())  
                {  
                    BlockLocation[] blocks = hdfs.getFileBlockLocations(fs, 0, fs.getLen());  
                    System.out.printf("\t'%s' block loc:\n", path);  
                    for(BlockLocation b:blocks)  
                    {  
                        System.out.printf("\t\t%s\n", b.getHosts()[0]);  
                    }  
                    System.out.println();  
                }  
                else System.out.printf("\tSkip directory (%s)...\n", path);  
            }  
            catch(java.io.FileNotFoundException e)  
            {  
                System.err.printf("\t%s: %s\n", path, e);  
            }  
        }             
    }  
}  

包裝到 MRTest.jar 後, 複製到 NameNode 下, 便可如下執行之:

獲取 HDFS 叢集上所有節點的名稱訊息
通過 DatanodeInfo.getHostName() 可以獲取 HDFS 叢集上的所有節點名稱:

view plaincopy to clipboardprint?
package demo.hdfs;  
  
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.fs.FileSystem;  
import org.apache.hadoop.hdfs.DistributedFileSystem;  
import org.apache.hadoop.hdfs.protocol.DatanodeInfo;  
  
public class GetList {  
  
    /** 
     * BD: 獲取 HDFS 叢集上所有節點的名稱訊息  
     * @param args 
     */  
    public static void main(String[] args) throws Exception{  
        Configuration conf = new  Configuration();  
        FileSystem fs = FileSystem.get(conf);  
        DistributedFileSystem hdfs = (DistributedFileSystem)fs;  
          
        DatanodeInfo dataNodeStatus[] = hdfs.getDataNodeStats();  
        for(int i=0;i
            System.out.printf("DataNode_%d: Name=%s\n", i, dataNodeStatus[i].getHostName());  
    }  
}  

執行範例:

$ hadoop jar MRTest.jar demo.hdfs.GetList
DataNode_0: Name=ubuntud1
DataNode_1: Name=ubuntud2

Supplement:
* Submitting a Hadoop MapReduce job to a remote JobTracker

While messing around with MapReduce code, I’ve found it to be a bit tedious having to generate the jarfile, copy it to the machine running the JobTracker, and then run the job every time the job has been altered. I should be able to run my jobs directly from my development environment, as illustrated in the figure below.

沒有留言:

張貼留言

訂閱：張貼留言 (Atom)