HDFS 分布式文件系统

HDFS 是 Hadoop 自带的分布式文件系统，基于商用硬件并含有副本机制，适合于一次写入、多次读取。

HDFS 设计

数据块

为了减少寻址开销，HDFS 将文件划分为块（block）存储，块默认大小为 128MB。

同磁盘块不同，HDFS 中小于一个块大小的文件不会占据整个块空间。

NameNode 和 DataNode

HDFS 节点分为管理节点（NameNode）和数据节点（DataNode）

数据节点以块的形式，将文件存储在磁盘中

数据节点记录文件和数据节点的对应关系（元数据），这些元数据保存在内存中

数据副本机制

为了保证商用硬件下数据的可用性，HDFS 会将数据同时存放多个节点（默认 3 副本）

管理节点的高可用

NameNode 存在单点失效问题（SPOF，Sigle Point Of Failure），为了保证集群的高可用，Hadoop2 配置了 active-standby namenode。

元数据持久化

active-standby namenode 通过 FSImage 和 EditLog 保持信息同步，其中 FSImage 是持久化的元数据信息，EditLog 是内存中保存的最近操作信息。

HDFS 命令行操作

前置条件：

已经安装好 Hadoop 并完成配置
HDFS 已完成初始化

启动 HDFS

1 2	cd $HADOOP_HOME ../sbin/start-dfs.sh

常用文件命令

# 创建目录
$ hdfs dfs -mkdir -p /user/ckckgo

# 上传本地文件，类似命令有 put、cp
$ hdfs dfs -copyFromLocal quangle.txt /user/ckckgo/quangle.txt

# 查看目录内容
$ hdfs dfs -ls /user/ckckgo

# 查看文件内容
$ hdfs dfs -cat /user/ckckgo/quangle.txt

# 下载文件到本地，类似命令有 get、cp
$ hdfs dfs -copyToLocal /user/ckckgo/quangle.txt quangle2.txt

# 对比本地文件和下载文件的 MD5 值是否相同
$ md5 quangle.txt quangle2.txt 

# 删除文件
$ hdfs dfs -rm /user/ckckgo/quangle.txt

JAVA API

项目依赖

<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-common</artifactId>
  <version>3.3.4</version>
</dependency>

<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-hdfs</artifactId>
  <version>3.3.4</version>
</dependency>

文件创建/写入

public class CreateFile {
    private static final String content = "面朝大海,\n春暖花开";
    public static void main(String[] args) throws Exception {
        String dst = args[0];

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(dst), conf);
        OutputStream out = fs.create(new Path(dst));
        out.write(content.getBytes());
        out.close();
    }
}

目录文件列表

public class ListFiles {
    public static void main(String[] args) throws Exception {
        String uri = args[0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(uri), conf);

        Path[] paths = new Path[args.length];
        for (int i = 0; i < paths.length; i++) {
            paths[i] = new Path(args[i]);
        }

        FileStatus[] status = fs.listStatus(paths);
        Path[] listedPaths = FileUtil.stat2Paths(status);
        for (Path path : listedPaths) {
            System.out.println(path);
        }
    }
}

文件读取

public class CatFile {
    public static void main(String[] args) throws Exception {
        String uri = args[0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(uri), conf);
        
        InputStream in = null;
        try {
            in = fs.open(new Path(uri));
            IOUtils.copyBytes(in, System.out, 4096, false);
        } finally {
            IOUtils.closeStream(in);
        }
    }
}

文件删除

public class DeleteFile {
    public static void main(String[] args) throws Exception {
        
        String target = args[0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(target), conf);

        boolean rst = fs.delete(new Path(target), true);
        System.out.println(rst? "sucess": "faile");
    }
}

运行验证

# 设置 ClassPath 为对应 jar 文件
export HADOOP_CLASSPATH=./target/hdfs-1.0-SNAPSHOT.jar  

# 写入文件
hadoop com.ckckgo.CreateFile hdfs://localhost:9000/user/ckckgo/poem/haizi.txt

# 读取文件
hadoop com.ckckgo.CatFile hdfs://localhost:9000/user/ckckgo/poem/haizi.txt

# 查看目录内文件列表
hadoop com.ckckgo.ListFiles hdfs://localhost:9000/user/ckckgo/poem

# 删除目录
hadoop com.ckckgo.DeleteFile hdfs://localhost:9000/user/ckckgo/poem

问题与解决

Exception in thread “main” java.net.ConnectException: Call From iMac-lan/192.168.1.6 to localhost:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused

Hadoop 2.0 默认 HDFS 端口为 9000（$HADOOP_HOME/etc/hadoop/core-site.xml 中指定 fs.default.name），引入的类库中默认为 8020。可以使用完整带端口的 HDFS 路径解决此问题。