ANCE, microsoft, arXiv:2007.00808

ANCE, microsoft, arXiv:2007.00808

8/16/2021 05:55:00 오후

Lee Xiong et al., Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval, arxiv 2007.00808

microsoft에서 작성한 논문.

ANCE

= Approximate nearest neighbor Negative Contrastive Estimation
ANCE는 dense retrieval (DR)에서 negative sampling을 개선하기 위한 방법이다.

dense retrieval은 주어진 질의와 관련있는 문서를 찾기 위해서, 질의와 벡터를 dense vector로 표현하고, 벡터간의 유사도를 이용한다. 수식 (1)이 dense vector를 이용한 (질의, 문서)간의 유사도 계산 수식이다.

dense retrieval과 상반되는 개념으로, 논문에서는 sparse retrieval을 언급한다. sparse retrieval 모델 예시로 BM25와 같이 텀 매칭으로 검색하는 모델을 언급한다.

ANCE는 현재 학습된 DR 모델을 이용해서 질의와 유사도가 높은 문서를 찾고, positive 문서가 아닌 문서를 negative 문서로 사용한다.

전체 문서를 대상으로 질의와 유사도가 높은 문서를 찾는다. batch 내에 있는 소규모 문서만을 대상으로 하는 in-batch sampling과는 이점에서 다르다.
질의와 유사도 높은 문서를 찾기 위해서 ANN (approximate nearest neighbor) 모델로 faiss를 사용한다.
Asynchronous Index Refresh

inference와 ANN 기반 검색을 수행하는데 시간이 많이 걸리므로, 모델 업데이트마다 negative 문서를 갱신하지는 않는다. 대신 m번의 batch를 수행한 후에 negative 문서를 갱신한다. (ANN보다는 inference에 많은 시간이 걸린다고 함)

ANCE 작동 방식을 설명하는 그림2:

DR을 위한 기존의 negative sampling 방법 (= related works)

BM25 sampling

BM25로 검색한 상위 문서 중에서 positive 문서가 아닌 것을 negative 문서로 사용

in-batch sampling

동일 batch에 있는 데이터 중에서 positive 문서가 아닌 것을 negative 문서로 사용

in-batch sampling 시에 hard negative sampling을 위한 방법들이 제안되고 있지만, BM25 sampling보다 품질이 좋지 않다고 함.

검색 모델 학습을 위한 데이터

신경망 기반 검색 모델을 위한 학습데이터는 질의 q와 관련된 문서 d+와 질의와 관련없는 문서 d-의 triple인 (q, d+, d-)로 주로 구성된다.

검색 모델 학습을 위한 pairwise loss

질의와 positive 문서와의 유사도가, 질의와 negative 문서와의 유사도보다 높게끔 모델을 학습한다.
질의와 positive 문서와의 유사도가 질의와 negative 문서와의 유사도보다 작으면, loss가 발생한다.
수식 (2)는 pairwise loss를 기반으로 한 인자 선정 수식이다.

uninformative negative instances의 문제점

diminishing gradient norms
large stochastic gradient variances
slow learning convergence

ANCE는 informative negative instance를 추출함으로써, gradient norm이 큰 값이 되고, 그래서 fast learning convergence가 가능하다.

ANCE는 in-batch sampling 대신, 전체 문서 집합에서 negative sampling을 한다. 그래서 global negative를 만든다.
gradient norm이 클수록 training convergence가 빨리 된다.
논문에서 언급하길, gradient norm이 큰 negative instance는 training loss를 더 많이 줄이는 효과가 있다고 한다.
실험적으로 BERT fine-tuning에서 gradient norm과 training convergence간의 관계는 밝혀졌다. (Mosbach et al., 2020)

DR 학습에서 convergence에 대한 분석 (논문의 3장 내용)

convergence rate and gradient norms

큰 gradient norm을 생성하는 negative instance가 training loss를 더 많이 줄여준다. 그래서 이는 유용한 negative, informative negative이다.
반면에, diminishing gradient를 생성하는 negative instance는 informative 하지 않다.

diminishing gradients of uninformative negatives

0에 가까운 loss를 생성하는 negative instance는 0에 가까운 gradient를 만들고, model convergence에 거의 기여하지 않는다.
DR 모델의 convergence는 constructed negative 들에 의존한다.

inefficacy of local in-batch negatives

batch에 있는 instance가 informative negative일 가능성은 매우 낮다.
그래서 in-batch negative는 효과가 별로 없다.

실험 설정

TREC 2019 Deep Learning (DL) Track 환경에서 평가

retrieval 과 re-ranking 모두 평가 : retrieval은 전체 문서를 대상으로 검색하는 것이고, re-ranking은 bm25로 검색된 상위 100개의 문서를 재순화(re-ranking)하는 것이다.
negative sampling과 관련된 baseline models

random sampling in batch (Rand Neg)
random sampling from BM25 top 100 (BM25 Neg) (Lee et al., 2019; Gao et al., 2020b)
1:1 combination of BM25 and Random negatives (BM25 + Rand Neg)

implementation

MARCO passage training labels cleaner (Yan et al., 2019) 와 BM25 negatives가 DR 학습에 도움이 된다고 함.
MARCO official BM25 Negatives를 이용해서 모든 모델을 warm up함. ANCE도 동일하게 warm up 한다.

Q&A 데이터에서 평가

평가 데이터: Natural Questions (NQ) 와 TriviaQA
평가 척도

Coverage@20: 검색된 상위 20개 문단에 정답이 있는지 여부
Coverage@100: 검색된 상위 100개 문단에 정답이 있는지 여부
검색을 ANCE로 변경했을 때, Q&A 시스템의 성능

baseline models

DPR, BM25, and their combinations (Karpukhin et al., 2020).

implementation

공개된 DPR checkpoints (Karpukhin et al., 2020)를 가지고 ANCE를 warm up한다.

commercial search engine에서 평가

DR 모델의 학습을 ANCE로 변경했을 때의 성능 확인

긴 문서 처리를 위해서 firstP와 MaxP로 실험

firstP는 문서의 처음 512개 토큰으로 문서를 표현
maxP는 문서를 512개 토큰의 문단으로 구분해서, 문서를 최대 4개의 문단으로 표현. 문서는 최대4개의 representation을 가짐

ANCE에서 negative 문서 선정:

각 positive doc마다, ANN으로 검색된 상위 200개 문서 중에서, 1개를 선택해서 negative 문서로 사용

실험 결과

그림 4: ANCE에서 training loss와 gradient norm가 더 큰 것을 확인할 수 있음

학습 시간 관련 (appendix A.1 내용)

ANCE epoch 당 1 ~ 2시간 소요
10 epoch 정도에서 수렴함 (다른 DR baseline model과 비슷하다고 함)

gradient norm이 더 크고, training loss가 더 크게 줄어들어서 더 빨리 수렴해야 할 거 같지만, 아마도 어려운 negative를 추가함으로써 문제가 더 어려워져서 그런 듯.

sparse retrieval model과 dense retrieval model의 검색 결과는 25% 정도만 겹친다. (appendix A.2에서 표 6의 내용)

TREC DL TRACK dataset을 대상으로 검색된 상위 100개 문서를 대상으로 함
bm25와 ance 모델의 검색 결과에서 겹치는 문서 확인
document retrieval에서 약 24.9%의 문서가 겹침
passage retrieval에서 약 17.4%의 문단이 겹침

DR이 BM25보다 잘 검색한 질의와 잘못 검색한 질의 (appendix A.5)

43개의 TREC 2019 DL Track evaluation queries 중에서, ANCE가 잘한 질의는 29개(67.4%), 못한 질의는 13개(30.2%), 동일한 질의는 1개(2.3%).
ANCE가 잘한 질의 예시 1)

질의: Cost of interior concrete flooring
ANCE 1등 문서의 제목: Concrete network: Concrete Floor Cost
BM25 1등 문서의 제목: Pinterest: Types of Flooring

ANCE가 잘한 질의 예시 2)

질의: What is the most popular food in Switzerland
ANCE 1등 문서의 제목: Wikipedia: Swiss cuisine
BM25 1등 문서의 제목: Answers.com: Most popular traditional food dishes of Mexico

ANCE가 잘한 질의 예시 3)

질의: Define visceral
ANCE 1등 문서의 제목: Vocabulary.com: Visceral
BM25 1등 문서의 제목: Quizlet.com: A&P EX3 autonomic 9-10

ANCE가 못한 질의 예시 1)

질의: Example of monotonic function
ANCE 1등 문서의 제목: Wikipedia: Monotonic function
BM25 1등 문서의 제목: Explain Extended: Things SQL needs: sargability of monotonic functions

ANCE가 못한 질의 예시 2)

질의:What is a active margin
ANCE 2등 문서의 제목: Wikipedia: Margin (finance)
BM25 2등 문서의 제목: Yahoo Answer: What is the difference between passive and active continental margins

ANCE가 못한 질의 예시 3)

질의: How long to hold bow in yoga
ANCE 3등 문서의 제목: Yahoo Answer: How long should you hold a yoga pose for
BM25 3등 문서의 제목: yogaoutlet.com: How to do bow pose in yoga

댓글