Error using Pig to move data from HDFS to HBase

Question

I’ve beem trying to move data stored on hdfs to HBase using Pig.

PS: I’m running a pseudo-cluster hadoop.

Here are the steps that I’m executing

# Part 1 - Load Data into HDFS

# Create directory
hdfs dfs -mkdir /user/data
hdfs dfs -mkdir /user/data/customers

# Load data into HDFS
hdfs dfs -copyFromLocal customers.txt /user/data/customers

# Part 2 - Create Table in HBase

# Open the shell
hbase shell

# Create a table with a column family to receive the data
create 'customers', 'customer_data'

# Part 3 - Load Data with Pig

# Start the Job History Server
mr-jobhistory-daemon.sh start historyserver

# Open the Pig shell
pig -x mapreduce (if issues arise, use pig -x local)

# Navigate to the directory
cd /user/data/customers;

# View the contents of the file
cat customers.txt;

# Load data from HDFS into Pig
data = LOAD 'customers.txt' USING PigStorage(',') AS (
           id:chararray,
           name:chararray,
           last_name:chararray,
           age:int,
           occupation:chararray
);

# Test the data
dump data;

# Using Pig Store
STORE data INTO 'hbase://customers' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'customer_data:name,
 customer_data:last_name,
 customer_data:age,
 customer_data:occupation'
);

clientes.txt file:

id,nome,sobrenome,idade,funcao
c001,Josias,Silva,55,Analista de Mercado
1100002,Pedro,Malan,74,Professor
1100003,Maria,Maciel,34,Bombeiro
1100004,Suzana,Bustamante,66,Analista de TI
1100005,Karen,Moreira,74,Advogado
1100006,Patricio,Teixeira,42,Veterinario
1100007,Elisa,Haniero,43,Piloto
1100008,Mauro,Bender,63,Marceneiro
1100009,Mauricio,Wagner,39,Artista
1100010,Douglas,Macedo,60,Escritor
1100011,Francisco,McNamara,47,Cientista de Dados
1100012,Sidney,Raynor,26,Escritor
1100013,Maria,Moon,41,Gerente de Projetos
1100014,Bete,Balanaira,65,Musico
1100015,Julia,Peixoto,49,Especialista em TI
1100016,Jeronimo,Wallace,52,Engenheiro de Dados
1100017,Noeli,Laura,72,Cientista de Dados
1100018,Jean,Junior,45,Desenvolvedor RPA
1100019,Cristina,Garbim,63,Engenheiro Blockchain

Services executing in the machine:


6784 NameNode
7538 NodeManager
7395 ResourceManager
6949 DataNode
8774 HQuorumPeer
8918 HMaster
12406 Jps
6311 HRegionServer
9578 JobHistoryServer
7150 SecondaryNameNode

When I do execute STORE function inside pig shell. I face an error as you can see below:


2024-01-04 12:45:31,342 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
2024-01-04 12:45:31,435 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
2024-01-04 12:45:31,441 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
2024-01-04 12:45:31,447 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
2024-01-04 12:45:31,447 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2024-01-04 12:45:31,447 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NestedLimitOptimizer, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
2024-01-04 12:45:31,489 [main] WARN  org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil - The hbase-prefix-tree module jar containing PrefixTreeCodec is not present.  Continuing without it.
2024-01-04 12:45:31,507 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2024-01-04 12:45:31,507 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
2024-01-04 12:45:31,508 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2024-01-04 12:45:31,508 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2024-01-04 12:45:31,552 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
2024-01-04 12:45:31,553 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2024-01-04 12:45:31,554 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2024-01-04 12:45:31,554 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2024-01-04 12:45:31,555 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process
2024-01-04 12:45:31,686 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/pig/pig-0.17.0-core-h2.jar to DistributedCache through /tmp/temp1637841900/tmp-611598740/pig-0.17.0-core-h2.jar
2024-01-04 12:45:31,769 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/hadoop/share/hadoop/common/lib/guava-11.0.2.jar to DistributedCache through /tmp/temp1637841900/tmp-2116544670/guava-11.0.2.jar
2024-01-04 12:45:31,850 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/hadoop/share/hadoop/hdfs/lib/netty-all-4.0.52.Final.jar to DistributedCache through /tmp/temp1637841900/tmp-827707703/netty-all-4.0.52.Final.jar
2024-01-04 12:45:31,914 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/pig/lib/hbase-common-1.2.4.jar to DistributedCache through /tmp/temp1637841900/tmp-627632889/hbase-common-1.2.4.jar
2024-01-04 12:45:31,967 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/hadoop/share/hadoop/common/lib/zookeeper-3.4.13.jar to DistributedCache through /tmp/temp1637841900/tmp-366005299/zookeeper-3.4.13.jar
2024-01-04 12:45:31,990 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/pig/lib/hbase-hadoop-compat-1.2.4.jar to DistributedCache through /tmp/temp1637841900/tmp413917611/hbase-hadoop-compat-1.2.4.jar
2024-01-04 12:45:32,052 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/pig/lib/hbase-server-1.2.4.jar to DistributedCache through /tmp/temp1637841900/tmp-1309639956/hbase-server-1.2.4.jar
2024-01-04 12:45:32,069 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/hadoop/share/hadoop/common/lib/protobuf-java-2.5.0.jar to DistributedCache through /tmp/temp1637841900/tmp1730938645/protobuf-java-2.5.0.jar
2024-01-04 12:45:32,130 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/pig/lib/hbase-protocol-1.2.4.jar to DistributedCache through /tmp/temp1637841900/tmp1095665439/hbase-protocol-1.2.4.jar
2024-01-04 12:45:32,160 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/pig/lib/hbase-client-1.2.4.jar to DistributedCache through /tmp/temp1637841900/tmp1789263254/hbase-client-1.2.4.jar
2024-01-04 12:45:32,191 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/pig/lib/metrics-core-2.1.2.jar to DistributedCache through /tmp/temp1637841900/tmp1597787479/metrics-core-2.1.2.jar
2024-01-04 12:45:32,240 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/pig/lib/htrace-core-3.1.0-incubating.jar to DistributedCache through /tmp/temp1637841900/tmp1189017499/htrace-core-3.1.0-incubating.jar
2024-01-04 12:45:32,306 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/pig/lib/automaton-1.11-8.jar to DistributedCache through /tmp/temp1637841900/tmp1448513756/automaton-1.11-8.jar
2024-01-04 12:45:32,341 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/pig/lib/antlr-runtime-3.4.jar to DistributedCache through /tmp/temp1637841900/tmp759898185/antlr-runtime-3.4.jar
2024-01-04 12:45:32,359 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/opt/pig/lib/joda-time-2.9.3.jar to DistributedCache through /tmp/temp1637841900/tmp-430917928/joda-time-2.9.3.jar
2024-01-04 12:45:32,360 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2024-01-04 12:45:32,361 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2024-01-04 12:45:32,361 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche
2024-01-04 12:45:32,361 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2024-01-04 12:45:32,370 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2024-01-04 12:45:32,410 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2024-01-04 12:45:32,423 [JobControl] INFO  org.apache.hadoop.conf.Configuration.deprecation - yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
2024-01-04 12:45:32,481 [JobControl] INFO  org.apache.hadoop.mapreduce.JobResourceUploader - Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1704400452523_0005
2024-01-04 12:45:32,482 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2024-01-04 12:45:32,495 [JobControl] INFO  org.apache.pig.builtin.PigStorage - Using PigTextInputFormat
2024-01-04 12:45:32,496 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2024-01-04 12:45:32,496 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2024-01-04 12:45:32,497 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2024-01-04 12:45:32,548 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2024-01-04 12:45:32,621 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1704400452523_0005
2024-01-04 12:45:32,621 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Executing with tokens: []
2024-01-04 12:45:32,623 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2024-01-04 12:45:32,698 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1704400452523_0005
2024-01-04 12:45:32,702 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://localhost:8088/proxy/application_1704400452523_0005/
2024-01-04 12:45:32,910 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1704400452523_0005
2024-01-04 12:45:32,910 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases dados
2024-01-04 12:45:32,910 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: dados[1,7],dados[-1,-1] C:  R: 
2024-01-04 12:45:32,915 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2024-01-04 12:45:32,915 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1704400452523_0005]
2024-01-04 12:45:37,970 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2024-01-04 12:45:37,970 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2024-01-04 12:45:38,004 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2024-01-04 12:45:38,017 [main] ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map reduce job(s) failed!
2024-01-04 12:45:38,017 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics: 

HadoopVersion   PigVersion  UserId  StartedAt   FinishedAt  Features
3.2.0   0.17.0  hadoop  2024-01-04 12:45:31 2024-01-04 12:45:38 UNKNOWN

Failed!

Failed Jobs:
JobId   Alias   Feature Message Outputs
job_1704400452523_0005  dados   MAP_ONLY    Message: Job failed!hbase://clientes,

Input(s):
Failed to read data from "hdfs://localhost:9000/user/dados/clientes/clientes.txt"

Output(s):
Failed to produce result in "hbase://clientes"

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1704400452523_0005


2024-01-04 12:45:38,017 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2024-01-04 12:45:38,018 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 6017: Stopping execution on job failure with -stop_on_failure option

Digging into the log error:

Pig Stack Trace
---------------
ERROR 6017: Stopping execution on job failure with -stop_on_failure option

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias dados
    at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1779)
    at org.apache.pig.PigServer.registerQuery(PigServer.java:708)
    at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1110)
    at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:512)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
    at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
    at org.apache.pig.Main.run(Main.java:564)
    at org.apache.pig.Main.main(Main.java:175)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 6017: Stopping execution on job failure with -stop_on_failure option
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:561)
    at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:290)
    at org.apache.pig.PigServer.launchPlan(PigServer.java:1475)
    at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1460)
    at org.apache.pig.PigServer.execute(PigServer.java:1449)
    at org.apache.pig.PigServer.access$500(PigServer.java:119)
    at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1774)
    ... 14 more
================================================================================

Already tried to execute in localmode and specifing full path of the files but I didn’t have success. Can anyone help?

Leave a Comment Cancel reply