I’ve beem trying to move data stored on hdfs to HBase using Pig.
PS: I’m running a pseudo-cluster hadoop.
Here are the steps that I’m executing
# Part 1 - Load Data into HDFS
# Create directory
hdfs dfs -mkdir /user/data
hdfs dfs -mkdir /user/data/customers
# Load data into HDFS
hdfs dfs -copyFromLocal customers.txt /user/data/customers
# Part 2 - Create Table in HBase
# Open the shell
hbase shell
# Create a table with a column family to receive the data
create 'customers', 'customer_data'
# Part 3 - Load Data with Pig
# Start the Job History Server
mr-jobhistory-daemon.sh start historyserver
# Open the Pig shell
pig -x mapreduce (if issues arise, use pig -x local)
# Navigate to the directory
cd /user/data/customers;
# View the contents of the file
cat customers.txt;
# Load data from HDFS into Pig
data = LOAD 'customers.txt' USING PigStorage(',') AS (
id:chararray,
name:chararray,
last_name:chararray,
age:int,
occupation:chararray
);
# Test the data
dump data;
# Using Pig Store
STORE data INTO 'hbase://customers' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'customer_data:name,
customer_data:last_name,
customer_data:age,
customer_data:occupation'
);
clientes.txt file:
id,nome,sobrenome,idade,funcao
c001,Josias,Silva,55,Analista de Mercado
1100002,Pedro,Malan,74,Professor
1100003,Maria,Maciel,34,Bombeiro
1100004,Suzana,Bustamante,66,Analista de TI
1100005,Karen,Moreira,74,Advogado
1100006,Patricio,Teixeira,42,Veterinario
1100007,Elisa,Haniero,43,Piloto
1100008,Mauro,Bender,63,Marceneiro
1100009,Mauricio,Wagner,39,Artista
1100010,Douglas,Macedo,60,Escritor
1100011,Francisco,McNamara,47,Cientista de Dados
1100012,Sidney,Raynor,26,Escritor
1100013,Maria,Moon,41,Gerente de Projetos
1100014,Bete,Balanaira,65,Musico
1100015,Julia,Peixoto,49,Especialista em TI
1100016,Jeronimo,Wallace,52,Engenheiro de Dados
1100017,Noeli,Laura,72,Cientista de Dados
1100018,Jean,Junior,45,Desenvolvedor RPA
1100019,Cristina,Garbim,63,Engenheiro Blockchain
Services executing in the machine:
6784 NameNode
7538 NodeManager
7395 ResourceManager
6949 DataNode
8774 HQuorumPeer
8918 HMaster
12406 Jps
6311 HRegionServer
9578 JobHistoryServer
7150 SecondaryNameNode
When I do execute STORE function inside pig shell. I face an error as you can see below:
2024-01-04 12:45:31,342 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
2024-01-04 12:45:31,435 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
2024-01-04 12:45:31,441 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
2024-01-04 12:45:31,447 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
2024-01-04 12:45:31,447 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2024-01-04 12:45:31,447 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NestedLimitOptimizer, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
2024-01-04 12:45:31,489 [main] WARN org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil - The hbase-prefix-tree module jar containing PrefixTreeCodec is not present. Continuing without it.
2024-01-04 12:45:31,507 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2024-01-04 12:45:31,507 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
2024-01-04 12:45:31,508 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2024-01-04 12:45:31,508 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2024-01-04 12:45:31,552 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
2024-01-04 12:45:31,553 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2024-01-04 12:45:31,554 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2024-01-04 12:45:31,554 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2024-01-04 12:45:31,555 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process
2024-01-04 12:45:31,686 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/pig/pig-0.17.0-core-h2.jar to DistributedCache through /tmp/temp1637841900/tmp-611598740/pig-0.17.0-core-h2.jar
2024-01-04 12:45:31,769 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/hadoop/share/hadoop/common/lib/guava-11.0.2.jar to DistributedCache through /tmp/temp1637841900/tmp-2116544670/guava-11.0.2.jar
2024-01-04 12:45:31,850 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/hadoop/share/hadoop/hdfs/lib/netty-all-4.0.52.Final.jar to DistributedCache through /tmp/temp1637841900/tmp-827707703/netty-all-4.0.52.Final.jar
2024-01-04 12:45:31,914 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/pig/lib/hbase-common-1.2.4.jar to DistributedCache through /tmp/temp1637841900/tmp-627632889/hbase-common-1.2.4.jar
2024-01-04 12:45:31,967 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/hadoop/share/hadoop/common/lib/zookeeper-3.4.13.jar to DistributedCache through /tmp/temp1637841900/tmp-366005299/zookeeper-3.4.13.jar
2024-01-04 12:45:31,990 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/pig/lib/hbase-hadoop-compat-1.2.4.jar to DistributedCache through /tmp/temp1637841900/tmp413917611/hbase-hadoop-compat-1.2.4.jar
2024-01-04 12:45:32,052 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/pig/lib/hbase-server-1.2.4.jar to DistributedCache through /tmp/temp1637841900/tmp-1309639956/hbase-server-1.2.4.jar
2024-01-04 12:45:32,069 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/hadoop/share/hadoop/common/lib/protobuf-java-2.5.0.jar to DistributedCache through /tmp/temp1637841900/tmp1730938645/protobuf-java-2.5.0.jar
2024-01-04 12:45:32,130 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/pig/lib/hbase-protocol-1.2.4.jar to DistributedCache through /tmp/temp1637841900/tmp1095665439/hbase-protocol-1.2.4.jar
2024-01-04 12:45:32,160 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/pig/lib/hbase-client-1.2.4.jar to DistributedCache through /tmp/temp1637841900/tmp1789263254/hbase-client-1.2.4.jar
2024-01-04 12:45:32,191 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/pig/lib/metrics-core-2.1.2.jar to DistributedCache through /tmp/temp1637841900/tmp1597787479/metrics-core-2.1.2.jar
2024-01-04 12:45:32,240 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/pig/lib/htrace-core-3.1.0-incubating.jar to DistributedCache through /tmp/temp1637841900/tmp1189017499/htrace-core-3.1.0-incubating.jar
2024-01-04 12:45:32,306 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/pig/lib/automaton-1.11-8.jar to DistributedCache through /tmp/temp1637841900/tmp1448513756/automaton-1.11-8.jar
2024-01-04 12:45:32,341 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/pig/lib/antlr-runtime-3.4.jar to DistributedCache through /tmp/temp1637841900/tmp759898185/antlr-runtime-3.4.jar
2024-01-04 12:45:32,359 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/pig/lib/joda-time-2.9.3.jar to DistributedCache through /tmp/temp1637841900/tmp-430917928/joda-time-2.9.3.jar
2024-01-04 12:45:32,360 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2024-01-04 12:45:32,361 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2024-01-04 12:45:32,361 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche
2024-01-04 12:45:32,361 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2024-01-04 12:45:32,370 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2024-01-04 12:45:32,410 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2024-01-04 12:45:32,423 [JobControl] INFO org.apache.hadoop.conf.Configuration.deprecation - yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
2024-01-04 12:45:32,481 [JobControl] INFO org.apache.hadoop.mapreduce.JobResourceUploader - Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1704400452523_0005
2024-01-04 12:45:32,482 [JobControl] WARN org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set. User classes may not be found. See Job or Job#setJar(String).
2024-01-04 12:45:32,495 [JobControl] INFO org.apache.pig.builtin.PigStorage - Using PigTextInputFormat
2024-01-04 12:45:32,496 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2024-01-04 12:45:32,496 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2024-01-04 12:45:32,497 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2024-01-04 12:45:32,548 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2024-01-04 12:45:32,621 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1704400452523_0005
2024-01-04 12:45:32,621 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Executing with tokens: []
2024-01-04 12:45:32,623 [JobControl] INFO org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2024-01-04 12:45:32,698 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1704400452523_0005
2024-01-04 12:45:32,702 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://localhost:8088/proxy/application_1704400452523_0005/
2024-01-04 12:45:32,910 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1704400452523_0005
2024-01-04 12:45:32,910 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases dados
2024-01-04 12:45:32,910 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: dados[1,7],dados[-1,-1] C: R:
2024-01-04 12:45:32,915 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2024-01-04 12:45:32,915 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1704400452523_0005]
2024-01-04 12:45:37,970 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2024-01-04 12:45:37,970 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2024-01-04 12:45:38,004 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2024-01-04 12:45:38,017 [main] ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map reduce job(s) failed!
2024-01-04 12:45:38,017 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
3.2.0 0.17.0 hadoop 2024-01-04 12:45:31 2024-01-04 12:45:38 UNKNOWN
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_1704400452523_0005 dados MAP_ONLY Message: Job failed!hbase://clientes,
Input(s):
Failed to read data from "hdfs://localhost:9000/user/dados/clientes/clientes.txt"
Output(s):
Failed to produce result in "hbase://clientes"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1704400452523_0005
2024-01-04 12:45:38,017 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2024-01-04 12:45:38,018 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 6017: Stopping execution on job failure with -stop_on_failure option
Digging into the log error:
Pig Stack Trace
---------------
ERROR 6017: Stopping execution on job failure with -stop_on_failure option
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias dados
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1779)
at org.apache.pig.PigServer.registerQuery(PigServer.java:708)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1110)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:512)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
at org.apache.pig.Main.run(Main.java:564)
at org.apache.pig.Main.main(Main.java:175)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 6017: Stopping execution on job failure with -stop_on_failure option
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:561)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:290)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1475)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1460)
at org.apache.pig.PigServer.execute(PigServer.java:1449)
at org.apache.pig.PigServer.access$500(PigServer.java:119)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1774)
... 14 more
================================================================================
Already tried to execute in localmode and specifing full path of the files but I didn’t have success. Can anyone help?