ambari
maria_dev
말고 다른 계정으로 로그인 해야함su root
→ 비밀번호 → ambarii-admin-password-reset
→ 비밀번호 입력Pig
Pig 실습
ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS
(userID: int, movieID: int, rating:int, ratingTime: int);
metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING
PigStore('|') AS (movieID: int, movieTitle: chararray,
releaseDate: chararray, videoRelease: chararray,
imbdLink: chararray);
DUMP metadata;
chararray
== string
nameLookup = FOREACH metata GENERATE movieID, movieTitle,
ToUnixTime(ToDate(releaseDate, 'dd-MMM-yyyy')) AS releaseTime;
relation
을 생성함
u.item
에서 u.data
에 없는 데이터를 가져오는듯ratingsByMovie = GROUP ratings BY movieID;
DUMP ratingByMovie;
avgRatings = FOREACH ratingsByMovie GENERATE group AS movieID,
AG(ratings.rating) AS avgRating;
DUMP avgRatings;
DESCRIBE
쓰면 됨DESCRIBE ratings;
DESCRBE ratingsByMovie;
fiveStarMoves = FILTER avgRatings BY avgRating > 4.0;
fiveStarsWithData = JOIN fiveStarMovies BY movieID, nameLookup BY movieID;
DUMP fiveStarsWithData;
oldestFiveStarMovies = ORDER fiveStarsWithData BY
nameLookup::releaseTime;
DUMP oldestFiveStarMovies;
sandbox에서 스크립트 돌려봄
execute on tez
체크하면 됨Pig Latin 심화과정
LOAD
STORE -> 저장함
DUMP
FILTER DISTINCT FOREACH/GENERATE MAPREDUCE
STREAM -> 결과 스트리밍 가능
SAMPLE -> random sample
JOIN
COGROUP -> separate tuples for each key
GROUP
CROSS -> 다양한 combination을 가능하게 함
CUBE -> column 2개 넘게 사용해서 다양한 combination
ORDER
RANK -> 순서는 안바꾸고 랭킹만
LIMIT -> 새로운 관계를 만들고 필요 없는거 제끼고 가능
UNION
SPLIT
DESCRIBE
EXPLAIN -> SQL explain이랑 유사함. 어떻게 돌릴건지
ILLUSTRATE -> 좀 더 디테일하게 확인 가능.
REGISTER -> UDF -> java코드 써야함
DEFINE -> UDF에 이름 줄 수 있음
IMOORT -> 다른 코드 가져다 쓰는 거
AVG, CONCAT, COUNT, MAX, MIN, SIZE, SUM
PigStore
TextLoader
JsonLoader
AvroStorage -> hadoop에서 많이씀
ParquetLoader -> column-oriented
OrcStorage
HBaseStorage
실습
ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS
(userID: int, movieID: int, rating:int, ratingTime: int);
metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING
PigStore('|') AS (movieID: int, movieTitle: chararray,
releaseDate: chararray, videoRelease: chararray,
imbdLink: chararray);
nameLookup = FOREACH metadata GENERATE movieID, movieTitle;
groupedRatings = GROUP ratings BY movieID;
averageRatings = FOREAH groupedRatings GENERATE group AS movieId,
AVG(ratinigs.rating) AS avgRating, COUNT(ratings.rating)) AS numRATINGS;
badMovies = FILTER average B?Y avgRating < 2.0;
namdedBadMovies = JOIN badMovies BY movieID, nameLookup BY movieID;
finalResults = FOREACH namedBadMovies GENERATE namelLookup::movieTitle AS movieNAme,
badMovies::avgRAting AS avgRating, badMovies::numRatings AS numRatings;
finalResultsStorted = ORDER finalResults BY numRatings DES;
DUMP finalResultsSorted;