Commit e3bc8e06 authored by yushiqian's avatar yushiqian
Browse files

copy pig lab

parent 8b41320a
# Hadoop Pig Laboratory
This lab is dedicated to Hadoop Pig and consists of a series of exercises: some of them mimic those in the MapReduce lab, others are inspired by "real-world" problems. There are two main goals for this laboratory:
* The first is to gain familiarity with the Pig Latin language to analyze data in many different ways. In other words, to focus on "what to do" with your data: to perform some simple statistics, to mine useful information, or to implement some simple algorithms.
* The second is to understand the details of Hadoop Pig internals by inspecting the process of turning a Pig Latin script into a runnable, optimized, underlying implementation in MapReduce. This means that you should examine what the Pig compiler generates from a Pig Latin script, and reason about Hadoop Job performance by analyzing Hadoop logs and statistics.
## Additional useful resources for Pig
* [Pig Eye for the SQL Guy][mortar-1]
* [Pig vs. MapReduce: When, Why, and How][mortar-2]
[mortar-1]: http://blog.mortardata.com/post/79987678239/pig-eye-for-the-sql-guy-redux
[mortar-2]: http://blog.mortardata.com/post/60274287605/pig-vs-mapreduce
### Useful tools for "debugging":
* **DESCRIBE** relation: this is very useful to understand the schema applied to each relation. Note that understanding schema propagation in Pig requires some time.
* **DUMP** relation: this command is similar to the STORE command, except that it outputs on ```stdout``` the selected relation.
* **ILLUSTRATE** relation: this command is useful to get a sample of the data in a relation.
* **EXPLAIN** generates (text and .dot) files that illustrate the DAG (directed acyclic graph) of the MapReduce jobs produced by Pig, and can be visualized by some graph-chart tools, such as GraphViz. This is very useful to grab an idea of what is going on under the hood.
### Additional documentation for the laboratory
The underlying assumption is that students taking part to this laboratory are familiar with MapReduce and Pig/Pig Latin. Additional documentation that is useful for the exercises is available here: http://pig.apache.org/docs/r0.11.0/. Note that we will use Hadoop Pig 0.12.0, included in the Cloudera distribution of Hadoop, CDH 5.3.2.
## Exercises and Rules
The general rule when using a real cluster is the following:
* First work locally: pig -x local: you can use both the interactive shell or directly work on pig scripts, to operate on data residing in the local file-system. **For EURECOM students**: note that you need to log to the "gateway" machine to run pig, even in what we call the "local execution mode". This means that you need to copy sample datasets in your group/home directory on the gateway machine to run your scripts "locally". This means you have to copy the sample input files that are available [here][input-sample] in a local directory of your group account in the gateway machine.
Also, note that for an interactive use of the pig shell, you need to connect to the gateway machine.
* Then, submit job to the cluster: pig -x mapreduce. **NOTE**: remember that a script that works locally may require some minor modifications when submitted to the Hadoop cluster. For example, you may want to explicitly set the degree of parallelism for the "reduce" phase, using the PARALLEL clause.
[input-sample]: https://github.com/michiard/CLOUDS-LAB/tree/master/labs/pig-lab/sample-input
## Exercise 1:: Word Count
Problem statement: Count the occurrences of each word in a text file.
The problem is exactly the same as the one in the MapReduce laboratory. In this exercise we will write a Pig Latin script to handle the problem, and let Pig do its work.
### Writing your first Pig Latin script
It is important to run this Pig Latin script in local execution mode: ```pig -x local```. The following lines of code can also be submitted to the interactive pig shell (grunt) with some minor modification. Use your favorite editor/IDE and copy the code for this exercise, which is reported below:
```
-- Load input data from local input directory
A = LOAD './sample-input/WORD_COUNT/sample.txt';
-- Parse and clean input data
B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word;
C = FILTER B BY word MATCHES '\\w+';
-- Explicit the GROUP-BY
D = GROUP C BY word;
-- Generate output data in the form: <word, counts>
E = FOREACH D GENERATE group, COUNT(C);
-- Store output data in local output directory
STORE E INTO './local-output/WORD_COUNT/';
```
As you can notice, this exercise is solved (to be precise, this is one possible solution). You have the freedom to develop your own solution. In this exercise, what we are going to do is have you get familiar with Pig and Pig Latin through inspecting the above script. Using all the information we have provided so far, you will have to play with Pig and answer several questions at the end of this exercise.
### How-to inspect your results and check your job in the cluster
+ Local execution: there is no mystery here, you can inspect output files in your local output directory
+ Cluster execution: you can use ```hdfs dfs``` from the command line
+ Inspecting your job status on the cluster: you can identify your job by name (try to use a original/unique name for the pig script you submit, and also check for your unix login) and check its status using the Web UI of the Resource Manager. Make sure to obtain useful information, for example, what optimization Pig brings in, to help reducing I/O costs and ultimately achieve better performance.
### Questions:
+ Q1: Compare execution times between your Pig and Hadoop MapReduce jobs: which is faster? What are the overheads in Pig?
+ Q2: What does a ```GROUP BY``` command do? In which phase of MapReduce is ```GROUP BY``` performed in this exercise and in general?
+ Q3: What does a ```FOREACH``` command do? In which phase of MapReduce is ```FOREACH``` performed in this exercise and in general?
## Exercise 2:: Working with Online Social Networks data
In this exercise we use a small dataset from Twitter that was obtained from this project: [Link][tw-data]. For convenience, an example of the twitter dataset is available in the ```sample-input directory```. Larger datasets are available in the private HDFS deployment of the laboratory: look up the following two files ```/laboratory/twitter-small.txt``` and ```/laboratory/twitter-big.txt```
The format of the dataset (both local and cluster) is the following:
```
USER_ID \t FOLLOWER_ID \n
```
where USER_ID and FOLLOWER_ID are represented by numeric ID (integer)
Example:
```
12 13
12 14
12 15
16 17
```
+ Users 13, 14 and 15 are followers of user 12.
+ User 17 is a follower of user 16.
### Counting the number of "followers" per Twitter user
Problem statement: for each user, calculate the total number of followers of that user.
Open the pig script ```./pig-lab/sample-solutions/OSN/tw-count.pig``` in your favorite editor. Your goal is to fill-in the TODOs and produce the desired output: in the example above, we would like to have that user 12 has 3 followers and user 16 has 1 follower.
The output format should be like:
```
USER_ID \t No. of FOLLOWERs \n
```
Example:
```
12 3
16 1
```
[tw-data]: http://an.kaist.ac.kr/traces/WWW2010.html "Twitter datasets"
**IMPORTANT NOTE 1**: This applies for EURECOM students. Despite the twitter graph stored in the laboratory HDFS deployment is not huge, it is strongly advised to be 'gentle' and try to avoid running your Pig Latin program using the large file. The main problem is disk space: we cannot guarantee that all the output generated by your script will fit the space we granted to HDFS.
**IMPORTANT NOTE 2**: You surely noticed that there is a directory with solved exercises (for this and all other exercises of the lab). It is important that you come up with your own solution to the exercises, and use the solved pig latin script as a reference once you're done.
#### Sub-exercises:
+ E2.1: **Follower distribution**: For each user ID, count the number of users she follows. [Optional] Use your favourite plotting tool
+ E2.2: **Outliers Detection**: find outliers (users that have a number of followers above an arbitrary threshold -- which you have to manually set)
#### Questions:
+ Q1: Is the output sorted? Why?
+ Q2: Can we impose an output order, ascending or descending? How?
+ Q3: Related to job performance, what kinds of optimization does Pig provide in this exercise? Are they useful? Can we disable them? Should we?
+ Q4: What should we do when the input has some noise? For example: some lines in the dataset only contain USER_ID but the FOLLOWER_ID is unavailable or null.
### Find the number of two-hop paths in the Twitter network
This exercise is related to the JOIN example discussed during the class on relational algebra and Hadoop Pig. You goal is to find all two-hop social relations in the Twitter dataset: for example, if user 12 is followed by user 13 and user 13 is followed by user 19, then there is a two-hop relation between user 12 and user 19. Open the pig script ```./pig-lab/sample-solutions/OSN/tw-join.pig``` and fill-in the TODOs to produce the desired output.
The output format should be like:
```
USER_1_ID \t USER_2_ID
```
**Warning**: Remember to set ```PARALLEL``` appropriately, or you will have to wait for a very long time... If you forgot to do this, or in any case you have been waiting for more than half an hour, please kill your job and go back to your script to check it through, and remember that ```CTRL+c``` is not going to work.
Questions:
+ Q1: What is the size of the input data? How long does it take for the job to complete in your case? What are the cause of a poor performance of a job?
+ Q2: Try to set the parallelism with different number of reducers. What can you say about the load balancing between reducers?
+ Q3: Have you verified your results? Does your result contain duplicate tuples? Do you have loops (tuples that points from one user to the same user)? What operations do you use to remove duplicates?
+ Q4: How many MapReduce jobs does your Pig script generate? Explain why
## Use-case:: Working with Network Traffic Data
Please, follow this link [TSTAT Trace Analysis with Pig][tstat] for this exercise.
**IMPORTANT**: For EURECOM students, although you can work on all exercises, for evaluation purposes you only need to complete exercises 1-4.
[tstat]: tstat-analysis/README.md "TSTAT"
## Use-case: Working with an Airline dataset
Please, go to [AIRLINE Traffic Analysis with Pig][airlines] for this exercise.
**IMPORTANT**: For EURECOM students, although you can work on all exercises, for evaluation purposes you only need to complete exercises queries 1-5. As you may notice, there are no explicit exercise questions here. You will need to **briefly** discuss the design of your Pig Latin script with the teaching assistant, and discuss the results you obtain.
[airlines]: airtraffic-analysis/README.md "AIRLINES"
<!-- ## Optional Exercises:: Iterative Algorithms with Pig
The goal of this exercise is to understand how to embed Pig Latin in Python. This exercise was conceived as a coding example by Julien Le Dem (Data Systems Engineer, Twitter) to illustrate Pig embedding. In short, Pig natively lacks support of control flow statements: if/else, while loop, for loop, etc. Starting with Pig 0.9 it is now possible to write a python (other languages are available as well) program and embed Pig scripts, leveraging all language features provided by Python, including control flow. This is especially important as it simplifies the implementation of **iterative algorithms**.
The original source for this exercise, plus a related post on how to implement *k-*means in Pig are available here:
+ PageRank: http://techblug.wordpress.com/2011/07/29/pagerank-implementation-in-pig/
+ *k*-means: http://hortonworks.com/blog/new-apache-pig-features-part-2-embedding/
The goal of this exercise is to study the PageRank algorithm, and compare implementation and execution details of two approaches: a native MapReduce implementation and the Pig implementation.
### Pig implementation
Official documentation explaining all the details behind embedding is available here: [Link][pig-embedding]. Students are invited to open (in their favorite IDE) the first version of the Python/Pig PageRank implementation, namely ```pg_v1.py```.
With reference to the official documentation, this implementation first ```compile()``` the pig script, then pass parameters using ```bind(params)``` and ```runSingle()``` for each iteration of the PageRank algorithm. The output of each iteration becomes the input of the previous one. Students are invited to first work **locally**, then submit the job to the cluster:
+ Local execution: use the ```./sample-input/PAGE_RANK/pg_simple.txt``` input file.
+ Cluster execution: use the ```/pig-lab/input/PAGE_RANK/web_graph.txt``` input file located in HDFS. Please note that this file is about 6.6 GB.
### Optional exercises
The following is a list of optional exercises:
+ Study a slightly improved version of PageRank, by inspecting ```pg_v2.py```
+ Modify the MapReduce implementation of PageRank described above such that it can accept as input the same format used for the Python/Pig implementation
+ Proceed with an alternative implementation of PageRank in MapReduce, following Chapter 5 of the book **Mining of Massive Datasets**, by *Anand Rajaraman and Jeff Ullman*, Cambridge University Press.
+ Implement the *k*-means algorithm whether in MapReduce or in Python/Pig (use http://hortonworks.com/blog/new-apache-pig-features-part-2-embedding/). **NOTE**: Thanks to Jun Chen (https://github.com/titaniumrain) for contributing this exercise, which is included in the ```sample-solutions``` folder.
[pig-embedding]: http://pig.apache.org/docs/r0.9.2/cont.html#embed-python "Pig Embedding"
-->
## Pig tricks
### Parameters
Put variable stuff into parameters, to allow for easy substitution! Typically this might be input and output files (or grouping parameters!), with sensible defaults, so that you can default to a small local file for rapid testing, and then override this with the big files on the cluster:
```pig
-- Set sensible defaults for local execution
%default input 'airline-sample.txt'
-- Load the data
dataset = LOAD '$input' using PigStorage(',') AS (...);
```
You can specify parameters when executing the script by using the `-p` flag, and `-f` to point to your script:
pig -f airline_q1.pig -p input=/laboratory/airline/2008.csv
For local testing with the default values, you'd just run it without any `-p` or `-f` flags:
pig -x local airline_q1.pig
### Multiple input files
To execute your script on several input files, use Hadoop path expansion to shorten stuff down:
```
dataset = LOAD '/laboratory/airlines/{2005,2006,2007,2008}.csv' using PigStorage(',') AS (...);
```
Wildcards also work:
```
dataset = LOAD '/laboratory/airlines/*.csv' using PigStorage(',') AS (...);
```
You can also specify this as a parameter, but remember to put it in quotes to prevent your shell from doing the expansion:
```
pig -f airline_q1.pig -p 'input=/laboratory/airlines/{2005,2006,2007,2008}.csv
```
**NOTE**: Shell-expansion style `{2005..2008}` does sadly not work on Pig.
# Airline Data Analysis with Pig
+ This exercise is inspired by http://www.datadr.org/doc/airline.html
+ Full information on datasets (optional datasets), and general documentation available here: http://stat-computing.org/dataexpo/2009/
Before we start, here's a description of the dataset "schema". We will work on data that can be downloaded from here: http://stat-computing.org/dataexpo/2009/the-data.html
Note that there is a single CSV file per year, hence the first field below is somehow redundant, although you can imagine to concatenate all files and work on them as a whole (which by the way would make sense when using Hadoop MapReduce / Pig). In summary, there are 29 fields which provide enough information to build Pig scripts that cover Queries 1-5. For the advanced analysis subsection, you need other data, which can be downloaded from the links below.
```
1 Year 1987-2008
2 Month 1-12
3 DayofMonth 1-31
4 DayOfWeek 1 (Monday) - 7 (Sunday)
5 DepTime actual departure time (local, hhmm)
6 CRSDepTime scheduled departure time (local, hhmm)
7 ArrTime actual arrival time (local, hhmm)
8 CRSArrTime scheduled arrival time (local, hhmm)
9 UniqueCarrier unique carrier code
10 FlightNum flight number
11 TailNum plane tail number
12 ActualElapsedTime in minutes
13 CRSElapsedTime in minutes
14 AirTime in minutes
15 ArrDelay arrival delay, in minutes
16 DepDelay departure delay, in minutes
17 Origin origin IATA airport code
18 Dest destination IATA airport code
19 Distance in miles
20 TaxiIn taxi in time, in minutes
21 TaxiOut taxi out time in minutes
22 Cancelled was the flight cancelled?
23 CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
24 Diverted 1 = yes, 0 = no
25 CarrierDelay in minutes
26 WeatherDelay in minutes
27 NASDelay in minutes
28 SecurityDelay in minutes
29 LateAircraftDelay in minutes
```
Other sources of data come from here: http://stat-computing.org/dataexpo/2009/supplemental-data.html. Precisely, we are interested in:
+ Airport IATA Codes to City names and Coordinates mapping: http://stat-computing.org/dataexpo/2009/airports.csv
+ Carrier codes to Full name mapping: http://stat-computing.org/dataexpo/2009/carriers.csv
+ Information about individual planes: http://stat-computing.org/dataexpo/2009/plane-data.csv
+ Weather information: http://www.wunderground.com/weather/api/. You can subscribe for free to the developers API and obtain (at a limited rate) hystorical weather information in many different formats. Also, to get an idea of the kind of information is available, you can use this link: http://www.wunderground.com/history/
## The Data:
For Eurecom students, we have put some csv (comma separate values) files in HDFS for you. They are located in the HDFS directory: ```/laboratory/airlines/``` and they account for four years of data, 2005 to 2008. You can work on individual years, or feed the entire directory to your jobs such that you can process all years.
## Exercises:
In the following, we propose a series of exercises in the form of Queries.
### *Query 1:* Top 20 airports by total volume of flights
What are the busiest airports by total flight traffic. JFK will feature, but what are the others? For each airport code compute the number of inbound, outbound and all flights. Variation on the theme: compute the above by day, week, month, and over the years.
### *Query 2:* Carrier Popularity
Some carriers come and go, others demonstrate regular growth. Compute the (log base 10) volume -- total flights -- over each year, by carrier. The carriers are ranked by their median volume (over the 4 year span).
### *Query 3:* Proportion of Flights Delayed
A flight is delayed if the delay is greater than 15 minutes. Compute the fraction of delayed flights per different time granularities (hour, day, week, month, year).
### *Query 4:* Carrier Delays
Is there a difference in carrier delays? Compute the proportion of delayed flights by carrier, ranked by carrier, at different time granularities (hour, day, week, month year). Again, a flight is delayed if the delay is greater than 15 minutes.
### *Query 5:* Busy Routes
Which routes are the busiest? A simple first approach is to create a frequency table for the unordered pair (i,j) where i and j are distinct airport codes.
## Optional data analysis tasks
Note that the following "queries" are somehow more involved than the preceding ones. They require a more elaborated approach and in some cases they require additional data to be downloaded (see sources above) and copied into HDFS.
+ When is the best time of day/day of week/time of year to fly to minimize delays?
+ Do older planes suffer more delays?
+ How does the number of people flying between different locations change over time?
+ How well does weather predict plane delays?
+ Can you detect cascading failures as delays in one airport create delays in others? Are there critical links in the system?
This diff is collapsed.
Error before Pig is launched
----------------------------
ERROR 2997: Encountered IOException. File sample-solutions/WORD_COUNT/word_count.pig does not exist
java.io.FileNotFoundException: File sample-solutions/WORD_COUNT/word_count.pig does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)
at org.apache.pig.impl.io.FileLocalizer.fetchFilesInternal(FileLocalizer.java:744)
at org.apache.pig.impl.io.FileLocalizer.fetchFile(FileLocalizer.java:688)
at org.apache.pig.Main.run(Main.java:549)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
================================================================================
Pig Stack Trace
---------------
ERROR 2244: Job failed, hadoop does not return any error message
org.apache.pig.backend.executionengine.ExecException: ERROR 2244: Job failed, hadoop does not return any error message
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:148)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:202)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:607)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
================================================================================
Pig Stack Trace
---------------
ERROR 6000:
<file ./sample-solutions/WORD_COUNT/word_count.pig, line 15, column 0> Output Location Validation Failed for: 'file:///home/group37/CLOUDS-LAB/labs/pig-lab/local-output/WORD_COUNT More info to follow:
Output directory file:/home/group37/CLOUDS-LAB/labs/pig-lab/local-output/WORD_COUNT already exists
org.apache.pig.impl.plan.VisitorException: ERROR 6000:
<file ./sample-solutions/WORD_COUNT/word_count.pig, line 15, column 0> Output Location Validation Failed for: 'file:///home/group37/CLOUDS-LAB/labs/pig-lab/local-output/WORD_COUNT More info to follow:
Output directory file:/home/group37/CLOUDS-LAB/labs/pig-lab/local-output/WORD_COUNT already exists
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:95)
at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator.validate(InputOutputFileValidator.java:45)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:311)
at org.apache.pig.PigServer.compilePp(PigServer.java:1380)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1305)
at org.apache.pig.PigServer.execute(PigServer.java:1297)
at org.apache.pig.PigServer.executeBatch(PigServer.java:375)
at org.apache.pig.PigServer.executeBatch(PigServer.java:353)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:140)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:202)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:607)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/home/group37/CLOUDS-LAB/labs/pig-lab/local-output/WORD_COUNT already exists
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:80)
... 27 more
================================================================================
Pig Stack Trace
---------------
ERROR 2997: Encountered IOException. org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1426847610111_2382' doesn't exist in RM.
at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:288)
at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
java.io.IOException: org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1426847610111_2382' doesn't exist in RM.
at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:288)
at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:348)
at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:419)
at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:553)
at org.apache.hadoop.mapreduce.Cluster.getJob(Cluster.java:183)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:582)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:580)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.mapred.JobClient.getJobUsingCluster(JobClient.java:580)
at org.apache.hadoop.mapred.JobClient.getTaskReports(JobClient.java:635)
at org.apache.hadoop.mapred.JobClient.getMapTaskReports(JobClient.java:629)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:150)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:428)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1322)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1307)
at org.apache.pig.PigServer.execute(PigServer.java:1297)
at org.apache.pig.PigServer.executeBatch(PigServer.java:375)
at org.apache.pig.PigServer.executeBatch(PigServer.java:353)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:140)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:202)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:607)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
================================================================================
Pig Stack Trace
---------------
ERROR 2997: Encountered IOException. org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1426847610111_2407' doesn't exist in RM.
at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:288)
at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
java.io.IOException: org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1426847610111_2407' doesn't exist in RM.
at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:288)
at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:348)
at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:419)
at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:553)
at org.apache.hadoop.mapreduce.Cluster.getJob(Cluster.java:183)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:582)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:580)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.mapred.JobClient.getJobUsingCluster(JobClient.java:580)
at org.apache.hadoop.mapred.JobClient.getTaskReports(JobClient.java:635)
at org.apache.hadoop.mapred.JobClient.getMapTaskReports(JobClient.java:629)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:150)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:428)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1322)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1307)
at org.apache.pig.PigServer.execute(PigServer.java:1297)
at org.apache.pig.PigServer.executeBatch(PigServer.java:375)
at org.apache.pig.PigServer.executeBatch(PigServer.java:353)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:140)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:202)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:607)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
================================================================================
Pig Stack Trace
---------------
ERROR 6000:
<file ./sample-solutions/WORD_COUNT/word_count.pig, line 15, column 0> Output Location Validation Failed for: 'hdfs://bigdoop-1.vms.bigfoot.eurecom.fr:8020/user/group37/local-output/WORD_COUNT More info to follow:
Output directory hdfs://bigdoop-1.vms.bigfoot.eurecom.fr:8020/user/group37/local-output/WORD_COUNT already exists
org.apache.pig.impl.plan.VisitorException: ERROR 6000:
<file ./sample-solutions/WORD_COUNT/word_count.pig, line 15, column 0> Output Location Validation Failed for: 'hdfs://bigdoop-1.vms.bigfoot.eurecom.fr:8020/user/group37/local-output/WORD_COUNT More info to follow:
Output directory hdfs://bigdoop-1.vms.bigfoot.eurecom.fr:8020/user/group37/local-output/WORD_COUNT already exists
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:95)
at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator.validate(InputOutputFileValidator.java:45)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:311)
at org.apache.pig.PigServer.compilePp(PigServer.java:1380)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1305)
at org.apache.pig.PigServer.execute(PigServer.java:1297)
at org.apache.pig.PigServer.executeBatch(PigServer.java:375)
at org.apache.pig.PigServer.executeBatch(PigServer.java:353)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:140)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:202)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:607)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://bigdoop-1.vms.bigfoot.eurecom.fr:8020/user/group37/local-output/WORD_COUNT already exists
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:80)
... 27 more
================================================================================
Pig Stack Trace
---------------
ERROR 1000: Error during parsing. Encountered " <IDENTIFIER> "exit "" at line 11, column 1.
Was expecting one of:
<EOF>
"cat" ...
"clear" ...
"fs" ...
"sh" ...
"cd" ...
"cp" ...
"copyFromLocal" ...
"copyToLocal" ...
"dump" ...
"\\d" ...
"describe" ...
"\\de" ...
"aliases" ...
"explain" ...
"\\e" ...
"help" ...
"history" ...
"kill" ...
"ls" ...
"mv" ...
"mkdir" ...
"pwd" ...
"quit" ...
"\\q" ...
"register" ...
"rm" ...
"rmf" ...
"set" ...
"illustrate" ...
"\\i" ...
"run" ...