大数据进行kmeans聚类 r语言实现在hadoop2上怎么实现

Mahout实现聚类算法kmeans
1. 准备数据文件:
randomData.csv
Java程序:KmeansHadoop.java
3. 运行程序
4. 聚类结果解读
HDFS产生的目录
1). 准备数据文件:
randomData.csv数据文件randomData.csv,由R语言通过“随机正太分布函数”程序生成,单机内存实验请参考文章:
原始数据文件:这里只截取了一部分数据。
~ vi datafile/randomData.csv
注:由于Mahout中kmeans算法,默认的分融符是”
“(空格),因些我把逗号分隔的数据文件,改成以空格分隔。
2). Java程序:KmeansHadoop.java
kmeans的算法实现,请查看Mahout in
org.conan.mymahout.cluster08;
org.apache.hadoop.fs.P
org.apache.hadoop.mapred.JobC
org.apache.mahout.clustering.conversion.InputD
org.apache.mahout.clustering.kmeans.KMeansD
org.apache.mahout.clustering.kmeans.RandomSeedG
org.mon.distance.DistanceM
org.mon.distance.EuclideanDistanceMeasure;
org.apache.mahout.utils.clustering.ClusterD
org.conan.mymahout.hdfs.HdfsDAO;
org.conan.mymahout.recommendation.ItemCFH
public class KmeansHadoop {
private static final String HDFS =
"hdfs://192.168.1.210:9000";
public static void main(String[] args) throws Exception
String localFile = "datafile/randomData.csv";
String inPath = HDFS + "/user/hdfs/mix_data";
String seqFile = inPath + "/seqfile";
String seeds = inPath + "/seeds";
String outPath = inPath + "/result/";
String clusteredPoints = outPath +
"/clusteredPoints";
JobConf conf = config();
HdfsDAO hdfs = new HdfsDAO(HDFS, conf);
hdfs.rmr(inPath);
hdfs.mkdirs(inPath);
hdfs.copyFile(localFile, inPath);
hdfs.ls(inPath);
InputDriver.runJob(new Path(inPath), new Path(seqFile),
"org.apache.mahout.math.RandomAccessSparseVector");
int k = 3;
Path seqFilePath = new Path(seqFile);
Path clustersSeeds = new Path(seeds);
DistanceMeasure measure = new
EuclideanDistanceMeasure();
clustersSeeds = RandomSeedGenerator.buildRandom(conf, seqFilePath,
clustersSeeds, k, measure);
KMeansDriver.run(conf, seqFilePath, clustersSeeds, new
Path(outPath), measure, 0.01, 10, true, 0.01,
Path outGlobPath = new Path(outPath,
"clusters-*-final");
Path clusteredPointsPath = new
Path(clusteredPoints);
System.out.printf("Dumping out clusters from clusters: %s and
clusteredPoints: %s\n", outGlobPath,
clusteredPointsPath);
ClusterDumper clusterDumper = new ClusterDumper(outGlobPath,
clusteredPointsPath);
clusterDumper.printClusters(null);
public static JobConf config() {
JobConf conf = new JobConf(ItemCFHadoop.class);
conf.setJobName("ItemCFHadoop");
conf.addResource("classpath:/hadoop/core-site.xml");
conf.addResource("classpath:/hadoop/hdfs-site.xml");
conf.addResource("classpath:/hadoop/mapred-site.xml");
3). 运行程序控制台输出:
hdfs://192.168.1.210:9000/user/hdfs/mix_data
hdfs://192.168.1.210:9000/user/hdfs/mix_data
copy from: datafile/randomData.csv to
hdfs://192.168.1.210:9000/user/hdfs/mix_data
hdfs://192.168.1.210:9000/user/hdfs/mix_data
==========================================================
hdfs://192.168.1.210:9000/user/hdfs/mix_data/randomData.csv,
folder: false, size: 36655
==========================================================
SLF4J: Failed to load class
"org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger
implementation
SLF4J: See
http://www.slf4j.org/codes.html#StaticLoggerBinder for further
org.apache.hadoop.util.NativeCodeLoader
警告: Unable to load
native-hadoop library for your platform... using builtin-java
classes where applicable
org.apache.hadoop.mapred.JobClient
copyAndConfigureFiles
警告: Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for
org.apache.hadoop.mapreduce.lib.input.FileInputFormat
listStatus
信息: Total input paths to
process : 1
org.apache.press.snappy.LoadSnappy
警告: Snappy native library not
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息: Running job:
job_local_0001
org.apache.hadoop.mapred.Task initialize
信息:& Using
ResourceCalculatorPlugin : null
org.apache.hadoop.mapred.Task done
Task:attempt_local_0001_m_ is done. And is in the process
of commiting
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task commit
信息: Task
attempt_local_0001_m_ is allowed to commit
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
commitTask
信息: Saved output of task
'attempt_local_0001_m_' to
hdfs://192.168.1.210:9000/user/hdfs/mix_data/seqfile
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task sendDone
信息: Task
'attempt_local_0001_m_' done.
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息: &map
100% reduce 0%
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息: Job complete:
job_local_0001
org.apache.hadoop.mapred.Counters log
信息: Counters:
org.apache.hadoop.mapred.Counters log
信息:&& File Output Format
org.apache.hadoop.mapred.Counters log
Bytes Written=31390
org.apache.hadoop.mapred.Counters log
信息:&& File Input Format
org.apache.hadoop.mapred.Counters log
Bytes Read=36655
org.apache.hadoop.mapred.Counters log
FileSystemCounters
org.apache.hadoop.mapred.Counters log
FILE_BYTES_READ=475910
org.apache.hadoop.mapred.Counters log
HDFS_BYTES_READ=36655
org.apache.hadoop.mapred.Counters log
FILE_BYTES_WRITTEN=506350
org.apache.hadoop.mapred.Counters log
HDFS_BYTES_WRITTEN=68045
org.apache.hadoop.mapred.Counters log
信息:&& Map-Reduce
org.apache.hadoop.mapred.Counters log
Map input records=1000
org.apache.hadoop.mapred.Counters log
Spilled Records=0
org.apache.hadoop.mapred.Counters log
Total committed heap usage (bytes)=
org.apache.hadoop.mapred.Counters log
SPLIT_RAW_BYTES=124
org.apache.hadoop.mapred.Counters log
Map output records=1000
org.apache.press.CodecPool
getCompressor
信息: Got brand-new
compressor
org.apache.press.CodecPool
getDecompressor
信息: Got brand-new
decompressor
org.apache.hadoop.mapred.JobClient
copyAndConfigureFiles
警告: Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for
org.apache.hadoop.mapreduce.lib.input.FileInputFormat
listStatus
信息: Total input paths to
process : 1
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息: Running job:
job_local_0002
org.apache.hadoop.mapred.Task initialize
信息:& Using
ResourceCalculatorPlugin : null
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: io.sort.mb =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: data buffer =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: record buffer =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: Starting flush of map
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
sortAndSpill
信息: Finished spill
org.apache.hadoop.mapred.Task done
Task:attempt_local_0002_m_ is done. And is in the process
of commiting
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task sendDone
信息: Task
'attempt_local_0002_m_' done.
org.apache.hadoop.mapred.Task initialize
信息:& Using
ResourceCalculatorPlugin : null
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Merging 1 sorted
org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Down to the last
merge-pass, with 1 segments left of total size: 623
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task done
Task:attempt_local_0002_r_ is done. And is in the process
of commiting
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task commit
信息: Task
attempt_local_0002_r_ is allowed to commit
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
commitTask
信息: Saved output of task
'attempt_local_0002_r_' to
hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-1
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
信息: reduce &
org.apache.hadoop.mapred.Task sendDone
信息: Task
'attempt_local_0002_r_' done.
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息:& map
100% reduce 100%
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息: Job complete:
job_local_0002
org.apache.hadoop.mapred.Counters log
信息: Counters:
org.apache.hadoop.mapred.Counters log
信息:&& File Output Format
org.apache.hadoop.mapred.Counters log
Bytes Written=695
org.apache.hadoop.mapred.Counters log
FileSystemCounters
org.apache.hadoop.mapred.Counters log
FILE_BYTES_READ=4239303
org.apache.hadoop.mapred.Counters log
HDFS_BYTES_READ=203963
org.apache.hadoop.mapred.Counters log
FILE_BYTES_WRITTEN=4457168
org.apache.hadoop.mapred.Counters log
HDFS_BYTES_WRITTEN=140321
org.apache.hadoop.mapred.Counters log
信息:&& File Input Format
org.apache.hadoop.mapred.Counters log
Bytes Read=31390
org.apache.hadoop.mapred.Counters log
信息:&& Map-Reduce
org.apache.hadoop.mapred.Counters log
Map output materialized bytes=627
org.apache.hadoop.mapred.Counters log
Map input records=1000
org.apache.hadoop.mapred.Counters log
Reduce shuffle bytes=0
org.apache.hadoop.mapred.Counters log
Spilled Records=6
org.apache.hadoop.mapred.Counters log
Map output bytes=612
org.apache.hadoop.mapred.Counters log
Total committed heap usage (bytes)=
org.apache.hadoop.mapred.Counters log
SPLIT_RAW_BYTES=130
org.apache.hadoop.mapred.Counters log
Combine input records=0
org.apache.hadoop.mapred.Counters log
Reduce input records=3
org.apache.hadoop.mapred.Counters log
Reduce input groups=3
org.apache.hadoop.mapred.Counters log
Combine output records=0
org.apache.hadoop.mapred.Counters log
Reduce output records=3
org.apache.hadoop.mapred.Counters log
Map output records=3
org.apache.hadoop.mapred.JobClient
copyAndConfigureFiles
警告: Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for
org.apache.hadoop.mapreduce.lib.input.FileInputFormat
listStatus
信息: Total input paths to
process : 1
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息: Running job:
job_local_0003
org.apache.hadoop.mapred.Task initialize
信息:& Using
ResourceCalculatorPlugin : null
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: io.sort.mb =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: data buffer =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: record buffer =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: Starting flush of map
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
sortAndSpill
信息: Finished spill
org.apache.hadoop.mapred.Task done
Task:attempt_local_0003_m_ is done. And is in the process
of commiting
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task sendDone
信息: Task
'attempt_local_0003_m_' done.
org.apache.hadoop.mapred.Task initialize
信息:& Using
ResourceCalculatorPlugin : null
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Merging 1 sorted
org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Down to the last
merge-pass, with 1 segments left of total size: 677
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task done
Task:attempt_local_0003_r_ is done. And is in the process
of commiting
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task commit
信息: Task
attempt_local_0003_r_ is allowed to commit
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
commitTask
信息: Saved output of task
'attempt_local_0003_r_' to
hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-2
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
信息: reduce &
org.apache.hadoop.mapred.Task sendDone
信息: Task
'attempt_local_0003_r_' done.
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息:& map
100% reduce 100%
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息: Job complete:
job_local_0003
org.apache.hadoop.mapred.Counters log
信息: Counters:
org.apache.hadoop.mapred.Counters log
信息:&& File Output Format
org.apache.hadoop.mapred.Counters log
Bytes Written=695
org.apache.hadoop.mapred.Counters log
FileSystemCounters
org.apache.hadoop.mapred.Counters log
FILE_BYTES_READ=7527467
org.apache.hadoop.mapred.Counters log
HDFS_BYTES_READ=271193
org.apache.hadoop.mapred.Counters log
FILE_BYTES_WRITTEN=7901744
org.apache.hadoop.mapred.Counters log
HDFS_BYTES_WRITTEN=142099
org.apache.hadoop.mapred.Counters log
信息:&& File Input Format
org.apache.hadoop.mapred.Counters log
Bytes Read=31390
org.apache.hadoop.mapred.Counters log
信息:&& Map-Reduce
org.apache.hadoop.mapred.Counters log
Map output materialized bytes=681
org.apache.hadoop.mapred.Counters log
Map input records=1000
org.apache.hadoop.mapred.Counters log
Reduce shuffle bytes=0
org.apache.hadoop.mapred.Counters log
Spilled Records=6
org.apache.hadoop.mapred.Counters log
Map output bytes=666
org.apache.hadoop.mapred.Counters log
Total committed heap usage (bytes)=
org.apache.hadoop.mapred.Counters log
SPLIT_RAW_BYTES=130
org.apache.hadoop.mapred.Counters log
Combine input records=0
org.apache.hadoop.mapred.Counters log
Reduce input records=3
org.apache.hadoop.mapred.Counters log
Reduce input groups=3
org.apache.hadoop.mapred.Counters log
Combine output records=0
org.apache.hadoop.mapred.Counters log
Reduce output records=3
org.apache.hadoop.mapred.Counters log
Map output records=3
org.apache.hadoop.mapred.JobClient
copyAndConfigureFiles
警告: Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for
org.apache.hadoop.mapreduce.lib.input.FileInputFormat
listStatus
信息: Total input paths to
process : 1
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息: Running job:
job_local_0004
org.apache.hadoop.mapred.Task initialize
信息:& Using
ResourceCalculatorPlugin : null
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: io.sort.mb =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: data buffer =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: record buffer =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: Starting flush of map
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
sortAndSpill
信息: Finished spill
org.apache.hadoop.mapred.Task done
Task:attempt_local_0004_m_ is done. And is in the process
of commiting
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task sendDone
信息: Task
'attempt_local_0004_m_' done.
org.apache.hadoop.mapred.Task initialize
信息:& Using
ResourceCalculatorPlugin : null
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Merging 1 sorted
org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Down to the last
merge-pass, with 1 segments left of total size: 677
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task done
Task:attempt_local_0004_r_ is done. And is in the process
of commiting
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task commit
信息: Task
attempt_local_0004_r_ is allowed to commit
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
commitTask
信息: Saved output of task
'attempt_local_0004_r_' to
hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-3
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
信息: reduce &
org.apache.hadoop.mapred.Task sendDone
信息: Task
'attempt_local_0004_r_' done.
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息:& map
100% reduce 100%
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息: Job complete:
job_local_0004
org.apache.hadoop.mapred.Counters log
信息: Counters:
org.apache.hadoop.mapred.Counters log
信息:&& File Output Format
org.apache.hadoop.mapred.Counters log
Bytes Written=695
org.apache.hadoop.mapred.Counters log
FileSystemCounters
org.apache.hadoop.mapred.Counters log
FILE_BYTES_READ=
org.apache.hadoop.mapred.Counters log
HDFS_BYTES_READ=338143
org.apache.hadoop.mapred.Counters log
&&&&FILE_BYTES_WRITTEN=
org.apache.hadoop.mapred.Counters log
HDFS_BYTES_WRITTEN=143877
org.apache.hadoop.mapred.Counters log
信息:&& File Input Format
org.apache.hadoop.mapred.Counters log
Bytes Read=31390
org.apache.hadoop.mapred.Counters log
信息:&& Map-Reduce
org.apache.hadoop.mapred.Counters log
Map output materialized bytes=681
org.apache.hadoop.mapred.Counters log
Map input records=1000
org.apache.hadoop.mapred.Counters log
Reduce shuffle bytes=0
org.apache.hadoop.mapred.Counters log
Spilled Records=6
org.apache.hadoop.mapred.Counters log
Map output bytes=666
org.apache.hadoop.mapred.Counters log
Total committed heap usage (bytes)=
org.apache.hadoop.mapred.Counters log
SPLIT_RAW_BYTES=130
org.apache.hadoop.mapred.Counters log
Combine input records=0
org.apache.hadoop.mapred.Counters log
Reduce input records=3
org.apache.hadoop.mapred.Counters log
Reduce input groups=3
org.apache.hadoop.mapred.Counters log
Combine output records=0
org.apache.hadoop.mapred.Counters log
Reduce output records=3
org.apache.hadoop.mapred.Counters log
Map output records=3
org.apache.hadoop.mapred.JobClient
copyAndConfigureFiles
警告: Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for
org.apache.hadoop.mapreduce.lib.input.FileInputFormat
listStatus
信息: Total input paths to
process : 1
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息: Running job:
job_local_0005
org.apache.hadoop.mapred.Task initialize
信息:& Using
ResourceCalculatorPlugin : null
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: io.sort.mb =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: data buffer =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: record buffer =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: Starting flush of map
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
sortAndSpill
信息: Finished spill
org.apache.hadoop.mapred.Task done
Task:attempt_local_0005_m_ is done. And is in the process
of commiting
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task sendDone
信息: Task
'attempt_local_0005_m_' done.
org.apache.hadoop.mapred.Task initialize
信息:& Using
ResourceCalculatorPlugin : null
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Merging 1 sorted
org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Down to the last
merge-pass, with 1 segments left of total size: 677
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task done
Task:attempt_local_0005_r_ is done. And is in the process
of commiting
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task commit
信息: Task
attempt_local_0005_r_ is allowed to commit
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
commitTask
信息: Saved output of task
'attempt_local_0005_r_' to
hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-4
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
信息: reduce &
org.apache.hadoop.mapred.Task sendDone
信息: Task
'attempt_local_0005_r_' done.
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息:& map
100% reduce 100%
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息: Job complete:
job_local_0005
org.apache.hadoop.mapred.Counters log
信息: Counters:
org.apache.hadoop.mapred.Counters log
信息:&& File Output Format
org.apache.hadoop.mapred.Counters log
Bytes Written=695
org.apache.hadoop.mapred.Counters log
FileSystemCounters
org.apache.hadoop.mapred.Counters log
FILE_BYTES_READ=
org.apache.hadoop.mapred.Counters log
HDFS_BYTES_READ=405093
org.apache.hadoop.mapred.Counters log
FILE_BYTES_WRITTEN=
org.apache.hadoop.mapred.Counters log
HDFS_BYTES_WRITTEN=145655
org.apache.hadoop.mapred.Counters log
信息:&& File Input Format
org.apache.hadoop.mapred.Counters log
Bytes Read=31390
org.apache.hadoop.mapred.Counters log
信息:&& Map-Reduce
org.apache.hadoop.mapred.Counters log
Map output materialized bytes=681
org.apache.hadoop.mapred.Counters log
Map input records=1000
org.apache.hadoop.mapred.Counters log
Reduce shuffle bytes=0
org.apache.hadoop.mapred.Counters log
Spilled Records=6
org.apache.hadoop.mapred.Counters log
Map output bytes=666
org.apache.hadoop.mapred.Counters log
Total committed heap usage (bytes)=
org.apache.hadoop.mapred.Counters log
SPLIT_RAW_BYTES=130
org.apache.hadoop.mapred.Counters log
Combine input records=0
org.apache.hadoop.mapred.Counters log
Reduce input records=3
org.apache.hadoop.mapred.Counters log
Reduce input groups=3
org.apache.hadoop.mapred.Counters log
Combine output records=0
org.apache.hadoop.mapred.Counters log
Reduce output records=3
org.apache.hadoop.mapred.Counters log
Map output records=3
org.apache.hadoop.mapred.JobClient
copyAndConfigureFiles
警告: Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for
org.apache.hadoop.mapreduce.lib.input.FileInputFormat
listStatus
信息: Total input paths to
process : 1
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息: Running job:
job_local_0006
org.apache.hadoop.mapred.Task initialize
信息:& Using
ResourceCalculatorPlugin : null
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: io.sort.mb =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: data buffer =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: record buffer =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: Starting flush of map
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
sortAndSpill
信息: Finished spill
org.apache.hadoop.mapred.Task done
Task:attempt_local_0006_m_ is done. And is in the process
of commiting
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task sendDone
信息: Task
'attempt_local_0006_m_' done.
org.apache.hadoop.mapred.Task initialize
信息:& Using
ResourceCalculatorPlugin : null
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Merging 1 sorted
org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Down to the last
merge-pass, with 1 segments left of total size: 677
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task done
Task:attempt_local_0006_r_ is done. And is in the process
of commiting
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task commit
信息: Task
attempt_local_0006_r_ is allowed to commit
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
commitTask
信息: Saved output of task
'attempt_local_0006_r_' to
hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-5
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
信息: reduce &
org.apache.hadoop.mapred.Task sendDone
信息: Task
'attempt_local_0006_r_' done.
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息:& map
100% reduce 100%
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息: Job complete:
job_local_0006
org.apache.hadoop.mapred.Counters log
信息: Counters:
org.apache.hadoop.mapred.Counters log
信息:&& File Output Format
org.apache.hadoop.mapred.Counters log
Bytes Written=695
org.apache.hadoop.mapred.Counters log
FileSystemCounters
org.apache.hadoop.mapred.Counters log
FILE_BYTES_READ=
org.apache.hadoop.mapred.Counters log
HDFS_BYTES_READ=472043
org.apache.hadoop.mapred.Counters log
FILE_BYTES_WRITTEN=
org.apache.hadoop.mapred.Counters log
HDFS_BYTES_WRITTEN=147433
org.apache.hadoop.mapred.Counters log
信息:&& File Input Format
org.apache.hadoop.mapred.Counters log
Bytes Read=31390
org.apache.hadoop.mapred.Counters log
信息:&& Map-Reduce
org.apache.hadoop.mapred.Counters log
Map output materialized bytes=681
org.apache.hadoop.mapred.Counters log
Map input records=1000
org.apache.hadoop.mapred.Counters log
Reduce shuffle bytes=0
org.apache.hadoop.mapred.Counters log
Spilled Records=6
org.apache.hadoop.mapred.Counters log
Map output bytes=666
org.apache.hadoop.mapred.Counters log
Total committed heap usage (bytes)=
org.apache.hadoop.mapred.Counters log
SPLIT_RAW_BYTES=130
org.apache.hadoop.mapred.Counters log
Combine input records=0
org.apache.hadoop.mapred.Counters log
Reduce input records=3
org.apache.hadoop.mapred.Counters log
Reduce input groups=3
org.apache.hadoop.mapred.Counters log
Combine output records=0
org.apache.hadoop.mapred.Counters log
Reduce output records=3
15:39:38 org.apache.hadoop.mapred.Counters
Map output records=3
org.apache.hadoop.mapred.JobClient
copyAndConfigureFiles
警告: Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for
org.apache.hadoop.mapreduce.lib.input.FileInputFormat
listStatus
信息: Total input paths to
process : 1
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息: Running job:
job_local_0007
org.apache.hadoop.mapred.Task initialize
信息:& Using
ResourceCalculatorPlugin : null
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: io.sort.mb =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: data buffer =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: record buffer =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: Starting flush of map
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
sortAndSpill
信息: Finished spill
org.apache.hadoop.mapred.Task done
Task:attempt_local_0007_m_ is done. And is in the process
of commiting
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task sendDone
信息: Task
'attempt_local_0007_m_' done.
org.apache.hadoop.mapred.Task initialize
信息:& Using
ResourceCalculatorPlugin : null
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Merging 1 sorted
org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Down to the last
merge-pass, with 1 segments left of total size: 677
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task done
Task:attempt_local_0007_r_ is done. And is in the process
of commiting
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task commit
信息: Task
attempt_local_0007_r_ is allowed to commit
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
commitTask
信息: Saved output of task
'attempt_local_0007_r_' to
hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-6
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
信息: reduce &
org.apache.hadoop.mapred.Task sendDone
信息: Task
'attempt_local_0007_r_' done.
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息:& map
100% reduce 100%
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息: Job complete:
job_local_0007
org.apache.hadoop.mapred.Counters log
信息: Counters:
org.apache.hadoop.mapred.Counters log
信息:&& File Output Format
org.apache.hadoop.mapred.Counters log
Bytes Written=695
org.apache.hadoop.mapred.Counters log
FileSystemCounters
org.apache.hadoop.mapred.Counters log
FILE_BYTES_READ=
org.apache.hadoop.mapred.Counters log
HDFS_BYTES_READ=538993
org.apache.hadoop.mapred.Counters log
FILE_BYTES_WRITTEN=
org.apache.hadoop.mapred.Counters log
HDFS_BYTES_WRITTEN=149211
org.apache.hadoop.mapred.Counters log
信息:&& File Input Format
org.apache.hadoop.mapred.Counters log
Bytes Read=31390
org.apache.hadoop.mapred.Counters log
信息:&& Map-Reduce
org.apache.hadoop.mapred.Counters log
Map output materialized bytes=681
org.apache.hadoop.mapred.Counters log
Map input records=1000
org.apache.hadoop.mapred.Counters log
Reduce shuffle bytes=0
org.apache.hadoop.mapred.Counters log
Spilled Records=6
org.apache.hadoop.mapred.Counters log
Map output bytes=666
org.apache.hadoop.mapred.Counters log
Total committed heap usage (bytes)=
org.apache.hadoop.mapred.Counters log
SPLIT_RAW_BYTES=130
org.apache.hadoop.mapred.Counters log
Combine input records=0
org.apache.hadoop.mapred.Counters log
Reduce input records=3
org.apache.hadoop.mapred.Counters log
Reduce input groups=3
org.apache.hadoop.mapred.Counters log
Combine output records=0
org.apache.hadoop.mapred.Counters log
Reduce output records=3
org.apache.hadoop.mapred.Counters log
Map output records=3
org.apache.hadoop.mapred.JobClient
copyAndConfigureFiles
警告: Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for
org.apache.hadoop.mapreduce.lib.input.FileInputFormat
listStatus
信息: Total input paths to
process : 1
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息: Running job:
job_local_0008
org.apache.hadoop.mapred.Task initialize
信息:& Using
ResourceCalculatorPlugin : null
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: io.sort.mb =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: data buffer =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: record buffer =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: Starting flush of map
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
sortAndSpill
信息: Finished spill
org.apache.hadoop.mapred.Task done
Task:attempt_local_0008_m_ is done. And is in the process
of commiting
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task sendDone
信息: Task
'attempt_local_0008_m_' done.
org.apache.hadoop.mapred.Task initialize
信息:& Using
ResourceCalculatorPlugin : null
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Merging 1 sorted
org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Down to the last
merge-pass, with 1 segments left of total size: 677
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task done
Task:attempt_local_0008_r_ is done. And is in the process
of commiting
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task commit
信息: Task
attempt_local_0008_r_ is allowed to commit
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
commitTask
信息: Saved output of task
'attempt_local_0008_r_' to
hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-7
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
信息: reduce &
org.apache.hadoop.mapred.Task sendDone
信息: Task
'attempt_local_0008_r_' done.
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息:& map
100% reduce 100%
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息: Job complete:
job_local_0008
org.apache.hadoop.mapred.Counters log
信息: Counters:
org.apache.hadoop.mapred.Counters log
信息:&& File Output Format
org.apache.hadoop.mapred.Counters log
Bytes Written=695
org.apache.hadoop.mapred.Counters log
FileSystemCounters
org.apache.hadoop.mapred.Counters log
FILE_BYTES_READ=
org.apache.hadoop.mapred.Counters log
HDFS_BYTES_READ=605943
org.apache.hadoop.mapred.Counters log
FILE_BYTES_WRITTEN=
org.apache.hadoop.mapred.Counters log
HDFS_BYTES_WRITTEN=150989
org.apache.hadoop.mapred.Counters log
信息:&& File Input Format
org.apache.hadoop.mapred.Counters log
Bytes Read=31390
org.apache.hadoop.mapred.Counters log
信息:&& Map-Reduce
org.apache.hadoop.mapred.Counters log
Map output materialized bytes=681
org.apache.hadoop.mapred.Counters log
Map input records=1000
org.apache.hadoop.mapred.Counters log
Reduce shuffle bytes=0
org.apache.hadoop.mapred.Counters log
Spilled Records=6
org.apache.hadoop.mapred.Counters log
Map output bytes=666
org.apache.hadoop.mapred.Counters log
Total committed heap usage (bytes)=
org.apache.hadoop.mapred.Counters log
SPLIT_RAW_BYTES=130
org.apache.hadoop.mapred.Counters log
Combine input records=0
org.apache.hadoop.mapred.Counters log
Reduce input records=3
org.apache.hadoop.mapred.Counters log
Reduce input groups=3
org.apache.hadoop.mapred.Counters log
Combine output records=0
org.apache.hadoop.mapred.Counters log
Reduce output records=3
org.apache.hadoop.mapred.Counters log
Map output records=3
org.apache.hadoop.mapred.JobClient
copyAndConfigureFiles
警告: Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for
org.apache.hadoop.mapreduce.lib.input.FileInputFormat
listStatus
信息: Total input paths to
process : 1
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息: Running job:
job_local_0009
org.apache.hadoop.mapred.Task initialize
信息:& Using
ResourceCalculatorPlugin : null
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: io.sort.mb =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: data buffer =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: record buffer =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: Starting flush of map
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
sortAndSpill
信息: Finished spill
org.apache.hadoop.mapred.Task done
Task:attempt_local_0009_m_ is done. And is in the process
of commiting
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task sendDone
信息: Task
'attempt_local_0009_m_' done.
org.apache.hadoop.mapred.Task initialize
信息:& Using
ResourceCalculatorPlugin : null
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Merging 1 sorted
org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Down to the last
merge-pass, with 1 segments left of total size: 677
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task done
Task:attempt_local_0009_r_ is done. And is in the process
of commiting
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task commit
信息: Task
attempt_local_0009_r_ is allowed to commit
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
commitTask
信息: Saved output of task
'attempt_local_0009_r_' to
hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-8
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
信息: reduce &
org.apache.hadoop.mapred.Task sendDone
信息: Task
'attempt_local_0009_r_' done.
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息:& map
100% reduce 100%
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息: Job complete:
job_local_0009
org.apache.hadoop.mapred.Counters log
信息: Counters:
org.apache.hadoop.mapred.Counters log
信息:&& File Output Format
org.apache.hadoop.mapred.Counters log
Bytes Written=695
org.apache.hadoop.mapred.Counters log
FileSystemCounters
org.apache.hadoop.mapred.Counters log
FILE_BYTES_READ=
org.apache.hadoop.mapred.Counters log
HDFS_BYTES_READ=673669
org.apache.hadoop.mapred.Counters log
FILE_BYTES_WRITTEN=
org.apache.hadoop.mapred.Counters log
HDFS_BYTES_WRITTEN=152767
org.apache.hadoop.mapred.Counters log
信息:&& File Input Format
org.apache.hadoop.mapred.Counters log
Bytes Read=31390
org.apache.hadoop.mapred.Counters log
信息:&& Map-Reduce
org.apache.hadoop.mapred.Counters log
Map output materialized bytes=681
org.apache.hadoop.mapred.Counters log
Map input records=1000
org.apache.hadoop.mapred.Counters log
Reduce shuffle bytes=0
org.apache.hadoop.mapred.Counters log
Spilled Records=6
org.apache.hadoop.mapred.Counters log
Map output bytes=666
org.apache.hadoop.mapred.Counters log
Total committed heap usage (bytes)=
org.apache.hadoop.mapred.Counters log
SPLIT_RAW_BYTES=130
org.apache.hadoop.mapred.Counters log
Combine input records=0
org.apache.hadoop.mapred.Counters log
Reduce input records=3
org.apache.hadoop.mapred.Counters log
Reduce input groups=3
org.apache.hadoop.mapred.Counters log
Combine output records=0
org.apache.hadoop.mapred.Counters log
Reduce output records=3
org.apache.hadoop.mapred.Counters log
Map output records=3
org.apache.hadoop.mapred.JobClient
copyAndConfigureFiles
警告: Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for
org.apache.hadoop.mapreduce.lib.input.FileInputFormat
listStatus
信息: Total input paths to
process : 1
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息: Running job:
job_local_0010
org.apache.hadoop.mapred.Task initialize
信息:& Using
ResourceCalculatorPlugin : null
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: io.sort.mb =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: data buffer =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: record buffer =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: Starting flush of map
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
sortAndSpill
信息: Finished spill
org.apache.hadoop.mapred.Task done
Task:attempt_local_0010_m_ is done. And is in the process
of commiting
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task sendDone
信息: Task
'attempt_local_0010_m_' done.
org.apache.hadoop.mapred.Task initialize
信息:& Using
ResourceCalculatorPlugin : null
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Merging 1 sorted
org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Down to the last
merge-pass, with 1 segments left of total size: 677
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task done
Task:attempt_local_0010_r_ is done. And is in the process
of commiting
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task commit
信息: Task
attempt_local_0010_r_ is allowed to commit
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
commitTask
信息: Saved output of task
'attempt_local_0010_r_' to
hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-9
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
信息: reduce &
org.apache.hadoop.mapred.Task sendDone
信息: Task
'attempt_local_0010_r_' done.
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息:& map
100% reduce 100%
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息: Job complete:
job_local_0010
org.apache.hadoop.mapred.Counters log
信息: Counters:
org.apache.hadoop.mapred.Counters log
信息:&& File Output Format
org.apache.hadoop.mapred.Counters log
Bytes Written=695
org.apache.hadoop.mapred.Counters log
FileSystemCounters
org.apache.hadoop.mapred.Counters log
FILE_BYTES_READ=
org.apache.hadoop.mapred.Counters log
HDFS_BYTES_READ=741007
org.apache.hadoop.mapred.Counters log
FILE_BYTES_WRITTEN=
org.apache.hadoop.mapred.Counters log
HDFS_BYTES_WRITTEN=154545
org.apache.hadoop.mapred.Counters log
信息:&& File Input Format
org.apache.hadoop.mapred.Counters log
Bytes Read=31390
org.apache.hadoop.mapred.Counters log
信息:&& Map-Reduce
org.apache.hadoop.mapred.Counters log
Map output materialized bytes=681
org.apache.hadoop.mapred.Counters log
Map input records=1000
org.apache.hadoop.mapred.Counters log
Reduce shuffle bytes=0
org.apache.hadoop.mapred.Counters log
Spilled Records=6
org.apache.hadoop.mapred.Counters log
Map output bytes=666
org.apache.hadoop.mapred.Counters log
Total committed heap usage (bytes)=
org.apache.hadoop.mapred.Counters log
SPLIT_RAW_BYTES=130
org.apache.hadoop.mapred.Counters log
Combine input records=0
org.apache.hadoop.mapred.Counters log
Reduce input records=3
org.apache.hadoop.mapred.Counters log
Reduce input groups=3
org.apache.hadoop.mapred.Counters log
Combine output records=0
org.apache.hadoop.mapred.Counters log
Reduce output records=3
org.apache.hadoop.mapred.Counters log
Map output records=3
org.apache.hadoop.mapred.JobClient
copyAndConfigureFiles
警告: Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for
org.apache.hadoop.mapreduce.lib.input.FileInputFormat
listStatus
信息: Total input paths to
process : 1
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息: Running job:
job_local_0011
org.apache.hadoop.mapred.Task initialize
信息:& Using
ResourceCalculatorPlugin : null
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: io.sort.mb =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: data buffer =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: record buffer =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
信息: Starting flush of map
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
sortAndSpill
信息: Finished spill
org.apache.hadoop.mapred.Task done
Task:attempt_local_0011_m_ is done. And is in the process
of commiting
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task sendDone
信息: Task
'attempt_local_0011_m_' done.
org.apache.hadoop.mapred.Task initialize
信息:& Using
ResourceCalculatorPlugin : null
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Merging 1 sorted
org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Down to the last
merge-pass, with 1 segments left of total size: 677
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task done
Task:attempt_local_0011_r_ is done. And is in the process
of commiting
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task commit
信息: Task
attempt_local_0011_r_ is allowed to commit
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
commitTask
信息: Saved output of task
'attempt_local_0011_r_' to
hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-10
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
信息: reduce &
org.apache.hadoop.mapred.Task sendDone
信息: Task
'attempt_local_0011_r_' done.
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息:& map
100% reduce 100%
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息: Job complete:
job_local_0011
org.apache.hadoop.mapred.Counters log
信息: Counters:
org.apache.hadoop.mapred.Counters log
信息:&& File Output Format
org.apache.hadoop.mapred.Counters log
Bytes Written=695
org.apache.hadoop.mapred.Counters log
FileSystemCounters
org.apache.hadoop.mapred.Counters log
FILE_BYTES_READ=
org.apache.hadoop.mapred.Counters log
HDFS_BYTES_READ=808345
org.apache.hadoop.mapred.Counters log
FILE_BYTES_WRITTEN=
org.apache.hadoop.mapred.Counters log
HDFS_BYTES_WRITTEN=156323
org.apache.hadoop.mapred.Counters log
信息:&& File Input Format
org.apache.hadoop.mapred.Counters log
Bytes Read=31390
org.apache.hadoop.mapred.Counters log
信息:&& Map-Reduce
org.apache.hadoop.mapred.Counters log
Map output materialized bytes=681
org.apache.hadoop.mapred.Counters log
Map input records=1000
org.apache.hadoop.mapred.Counters log
Reduce shuffle bytes=0
org.apache.hadoop.mapred.Counters log
Spilled Records=6
org.apache.hadoop.mapred.Counters log
Map output bytes=666
org.apache.hadoop.mapred.Counters log
Total committed heap usage (bytes)=
org.apache.hadoop.mapred.Counters log
SPLIT_RAW_BYTES=130
org.apache.hadoop.mapred.Counters log
Combine input records=0
org.apache.hadoop.mapred.Counters log
Reduce input records=3
org.apache.hadoop.mapred.Counters log
Reduce input groups=3
org.apache.hadoop.mapred.Counters log
Combine output records=0
org.apache.hadoop.mapred.Counters log
Reduce output records=3
org.apache.hadoop.mapred.Counters log
Map output records=3
org.apache.hadoop.mapred.JobClient
copyAndConfigureFiles
警告: Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for
org.apache.hadoop.mapreduce.lib.input.FileInputFormat
listStatus
信息: Total input paths to
process : 1
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息: Running job:
job_local_0012
org.apache.hadoop.mapred.Task initialize
信息:& Using
ResourceCalculatorPlugin : null
org.apache.hadoop.mapred.Task done
Task:attempt_local_0012_m_ is done. And is in the process
of commiting
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task commit
信息: Task
attempt_local_0012_m_ is allowed to commit
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
commitTask
信息: Saved output of task
'attempt_local_0012_m_' to
hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusteredPoints
org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
org.apache.hadoop.mapred.Task sendDone
信息: Task
'attempt_local_0012_m_' done.
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息:& map
100% reduce 0%
org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
信息: Job complete:
job_local_0012
org.apache.hadoop.mapred.Counters log
信息: Counters:
org.apache.hadoop.mapred.Counters log
信息:&& File Output Format
org.apache.hadoop.mapred.Counters log
Bytes Written=41520
org.apache.hadoop.mapred.Counters log
信息:&& File Input Format
org.apache.hadoop.mapred.Counters log
Bytes Read=31390
org.apache.hadoop.mapred.Counters log
FileSystemCounters
org.apache.hadoop.mapred.Counters log
FILE_BYTES_READ=
org.apache.hadoop.mapred.Counters log
HDFS_BYTES_READ=437203
org.apache.hadoop.mapred.Counters log
FILE_BYTES_WRITTEN=
org.apache.hadoop.mapred.Counters log
HDFS_BYTES_WRITTEN=120417
org.apache.hadoop.mapred.Counters log
信息:&& Map-Reduce
org.apache.hadoop.mapred.Counters log
Map input records=1000
org.apache.hadoop.mapred.Counters log
Spilled Records=0
org.apache.hadoop.mapred.Counters log
Total committed heap usage (bytes)=
org.apache.hadoop.mapred.Counters log
SPLIT_RAW_BYTES=130
org.apache.hadoop.mapred.Counters log
Map output records=1000
Dumping out clusters from clusters:
hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-*-final
and clusteredPoints:
hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusteredPoints
CL-552{n=443 c=[1.631, -0.412] r=[1.563,
Weight : [props - optional]:&
1.0: [-2.393, 3.347]
1.0: [-4.364, 1.905]
1.0: [-3.275, 0.023]
1.0: [-2.479, 2.534]
1.0: [-0.559, 1.223]
CL-847{n=77 c=[-2.953, -0.971] r=[1.767,
Weight : [props - optional]:&
1.0: [-0.883, -3.320]
1.0: [-1.099, -6.063]
1.0: [-0.004, -0.610]
1.0: [-2.996, -3.610]
1.0: [3.988, 1.008]
CL-823{n=480 c=[0.219, 2.600] r=[1.479,
Weight : [props - optional]:&
1.0: [2.670, 1.851]
1.0: [2.177, 6.773]
1.0: [5.537, 2.651]
1.0: [5.663, 6.868]
1.0: [5.117, 3.747]
1.0: [1.912, 2.959]
4). 聚类结果解读我们可以把上面的日志分解析成3个部分解读
a. 初始化环境
b. 算法执行
c. 打印聚类结果
a. 初始化环境出初HDFS的数据目录和工作目录,并上传数据文件。
hdfs://192.168.1.210:9000/user/hdfs/mix_data
hdfs://192.168.1.210:9000/user/hdfs/mix_data
from: datafile/randomData.csv to
hdfs://192.168.1.210:9000/user/hdfs/mix_data
hdfs://192.168.1.210:9000/user/hdfs/mix_data
==========================================================
hdfs://192.168.1.210:9000/user/hdfs/mix_data/randomData.csv,
folder: false, size: 36655
b. 算法执行算法执行,有3个步骤。
1):把原始数据randomData.csv,转成Mahout sequence
files of VectorWritable。
2):通过随机的方法,选中kmeans的3个中心,做为初始集群
3):根据迭代次数的设置,执行MapReduce,进行计算
1):把原始数据randomData.csv,转成Mahout sequence
files of VectorWritable。
程序源代码:
InputDriver.runJob(new Path(inPath), new Path(seqFile),
"org.apache.mahout.math.RandomAccessSparseVector");
日志输出:
complete: job_local_0001
2):通过随机的方法,选中kmeans的3个中心,做为初始集群
程序源代码:
int k = 3;
Path seqFilePath = new Path(seqFile);
Path clustersSeeds = new Path(seeds);
DistanceMeasure measure = new
EuclideanDistanceMeasure();
clustersSeeds = RandomSeedGenerator.buildRandom(conf, seqFilePath,
clustersSeeds, k, measure);
日志输出:
complete: job_local_0002
3):根据迭代次数的设置,执行MapReduce,进行计算程序源代码:
KMeansDriver.run(conf, seqFilePath, clustersSeeds, new
Path(outPath), measure, 0.01, 10, true, 0.01,
日志输出:
complete: job_local_0003
complete: job_local_0004
complete: job_local_0005
complete: job_local_0006
complete: job_local_0007
complete: job_local_0008
complete: job_local_0009
complete: job_local_0010
complete: job_local_0011
complete: job_local_0012
c. 打印聚类结果
out clusters from clusters:
hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-*-final
and clusteredPoints:
hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusteredPoints
CL-552{n=443 c=[1.631, -0.412] r=[1.563,
CL-847{n=77 c=[-2.953, -0.971] r=[1.767,
CL-823{n=480 c=[0.219, 2.600] r=[1.479,
运行结果:有3个中心。
Cluster1, 包括443个点,中心坐标[1.631,
Cluster2, 包括77个点,中心坐标[-2.953,
Cluster3, 包括480
个点,中心坐标[0.219, 2.600]
HDFS产生的目录
hadoop fs -ls /user/hdfs/mix_data
-rw-r--r--&& 3
Administrator
supergroup&&&&&
-10-04 15:31
/user/hdfs/mix_data/randomData.csv
drwxr-xr-x&& -
Administrator
supergroup&&&&&&&&&
15:31 /user/hdfs/mix_data/result
drwxr-xr-x&& -
Administrator
supergroup&&&&&&&&&
15:31 /user/hdfs/mix_data/seeds
drwxr-xr-x&& -
Administrator
supergroup&&&&&&&&&
15:31 /user/hdfs/mix_data/seqfile
# 输出目录
hadoop fs -ls /user/hdfs/mix_data/result
-rw-r--r--&& 3
Administrator
supergroup&&&&&&&
/user/hdfs/mix_data/result/_policy
drwxr-xr-x&& -
Administrator
supergroup&&&&&&&&&
/user/hdfs/mix_data/result/clusteredPoints
drwxr-xr-x&& -
Administrator
supergroup&&&&&&&&&
/user/hdfs/mix_data/result/clusters-0
drwxr-xr-x&& -
Administrator
supergroup&&&&&&&&&
/user/hdfs/mix_data/result/clusters-1
drwxr-xr-x&& -
Administrator
supergroup&&&&&&&&&
/user/hdfs/mix_data/result/clusters-10-final
drwxr-xr-x&& -
Administrator
supergroup&&&&&&&&&
/user/hdfs/mix_data/result/clusters-2
drwxr-xr-x&& -
Administrator
supergroup&&&&&&&&&
/user/hdfs/mix_data/result/clusters-3
drwxr-xr-x&& -
Administrator
supergroup&&&&&&&&&
/user/hdfs/mix_data/result/clusters-4
drwxr-xr-x&& -
Administrator
supergroup&&&&&&&&&
/user/hdfs/mix_data/result/clusters-5
drwxr-xr-x&& -
Administrator
supergroup&&&&&&&&&
/user/hdfs/mix_data/result/clusters-6
drwxr-xr-x&& -
Administrator
supergroup&&&&&&&&&
/user/hdfs/mix_data/result/clusters-7
drwxr-xr-x&& -
Administrator
supergroup&&&&&&&&&
/user/hdfs/mix_data/result/clusters-8
drwxr-xr-x&& -
Administrator
supergroup&&&&&&&&&
/user/hdfs/mix_data/result/clusters-9
# 产生的随机中心种子目录
hadoop fs -ls /user/hdfs/mix_data/seeds
-rw-r--r--&& 3
Administrator
supergroup&&&&&&&
/user/hdfs/mix_data/seeds/part-randomSeed
# 输入文件换成Mahout格式文件的目录
hadoop fs -ls /user/hdfs/mix_data/seqfile
-rw-r--r--&& 3
Administrator
supergroup&&&&&&&&&
/user/hdfs/mix_data/seqfile/_SUCCESS
-rw-r--r--&& 3
Administrator
supergroup&&&&&
-10-04 15:31
/user/hdfs/mix_data/seqfile/part-m-00000
4. 用R语言可视化结果
分别把聚类后的点,保存到不同的cluster*.csv文件,然后用R语言画图。
c1&-read.csv(file="cluster1.csv",sep=",",header=FALSE)
c2&-read.csv(file="cluster2.csv",sep=",",header=FALSE)
c3&-read.csv(file="cluster3.csv",sep=",",header=FALSE)
y&-rbind(c1,c2,c3)
cols&-c(rep(1,nrow(c1)),rep(2,nrow(c2)),rep(3,nrow(c3)))
col=c("black","blue","green")[cols])
center&-matrix(c(1.631, -0.412,-2.953, -0.971,0.219,
2.600),ncol=2,byrow=TRUE)
points(center, col="violetred", pch =
从上图中,我们看到有
黑,蓝,绿,三种颜色的空心点,这些点就是原始数据。
3个紫色实点,是Mahout的kmeans后生成的3个中心。
对比文章中用R语言实现的kmeans的分类和中心,都不太一样。&
简单总结一下,在使用kmeans时,根据距离算法,阈值,出始中心,迭代次数的不同,kmeans计算的结果是不相同的。因此,用kmeans算法,我们一般只能得到一个模糊的分类标准,这个标准对于我们认识未知领域的数据集是很有帮助的。不能做为精确衡量数据的指标。
5. 模板项目上传github
大家可以下载这个项目,做为开发的起点。
/bsspirit/maven_mahout_template
checkout mahout-0.8
这样,我们完成了Mahout的聚类算法Kmeans的分步式实现。接下来,我们会继续做关于Mahout中分类的实验!
WATERFU65转载出处:
已投稿到:
以上网友发言只代表其个人观点,不代表新浪网的观点或立场。}

我要回帖

更多关于 kmeans聚类 的文章

更多推荐

版权声明:文章内容来源于网络,版权归原作者所有,如有侵权请点击这里与我们联系,我们将及时删除。

点击添加站长微信