[Week 9- Day 1] 회고

[Week 9- Day 1] 회고

2022. 3. 15. 00:06ㆍAbout Me/AI Tech

# Byte Pair Encoding

BERT 기반 알고리즘들은 모두 BPE를 사용한다고 함.

Word embedding으로 하였을 때, 단어 수가 너무 많아지고 Out-of Vocabulary 문제가 빈번히 발생함.

-> Word도 나눠서 subword로 embedding 하자!

자주 등장하는 subword는 독립적인 의미를 가질 것이라는 가정.

import re, collections

def get_stats(vocab):
    pairs = collections.defaultdict(int)
    for word, freq in vocab.items():
    	symbols = word.split()
    for i in range(len(symbols)-1):
    	pairs[symbols[i],symbols[i+1]] += freq
    return pairs

def merge_vocab(pair, v_in):
    v_out = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    for word in v_in:
        w_out = p.sub(''.join(pair), word)
        v_out[w_out] = v_in[word]
    return v_out

vocab = {'l o w </w>' : 5, 'l o w e r </w>' : 2,
	     'n e w e s t </w>':6, 'w i d e s t </w>':3}

num_merges = 10
for i in range(num_merges):
    pairs = get_stats(vocab)
    best = max(pairs, key=pairs.get)
    vocab = merge_vocab(best, vocab)
    print(best)

참고 : https://arxiv.org/pdf/1508.07909.pdf

https://wikidocs.net/22592

# binary classification model

logistic regression
SVC
decision tree
random forest
catboost

## 피어 세션

Furthre Question으로 'BLEU score의 단점'에 대해 나누었다.

한 문장에 대해 다양한 번역이 존재할 수 있다. 의미적인 중요도에 따라 단어들의 가중치가 설정되어 있지 않다.

번역문이 직역이냐 의역이냐에 따라 문장 길이가 달라지는 데, 이에 따라 점수 값에 영향이 있다.

Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

Automatic metrics are fundamental for the development and evaluation of machine translation systems. Judging whether, and to what extent, automatic metrics concur with the gold standard of human evaluation is not a straightforward problem. We show that cur

arxiv.org

다음 주면 다시 대회가 시작한다. 이에 따라 그라운드 룰을 재정비하기로 했다.

부캠에서 살아남기에서도 좋은 협업을 하기 위해 그라운드 룰을 제대로 정하는 것이 중요하다고 하였다.

내가 인상적이었던 부분은 모델이나 작업한 것에 대해 평가할 때 감정은 빼고 말하자, 의사결정을 하기 위해 시각화 자료를 꼭 준비하자 다.

또 부캠에서는 함께 할 사람이 많으니 꼭 잘 활용해먹어서 좋은 프로젝트를 많이 쌓자는 것이다.

'About Me > AI Tech' 카테고리의 다른 글

[Week 9- Day 3] 회고 (0)	2022.03.17
[Week 9- Day 2] 회고 - 깃허브 특강 (0)	2022.03.15
[Week 8- Day 5] 회고 (0)	2022.03.11
[Week 8- Day 4] 회고 (0)	2022.03.10
[Week 8- Day 2] 회고 (0)	2022.03.10

Baked Toast

Baked Toast

태그

최근글

댓글

공지사항

아카이브

# Byte Pair Encoding

# binary classification model

## 피어 세션

'About Me > AI Tech' 카테고리의 다른 글

관련글

티스토리툴바