RocketQA, Baidu, NAACL2021

RocketQA, Baidu, NAACL2021

12/19/2021 08:43:00 오전

Yingqi Qu, et al., RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering, NAACL 2021

neural ranking model을 위한 dual encoder를 효과적으로 학습하는 방법 제안

기존 dual encoder의 문제점

학습 데이터와 실 데이터간의 차이
학습 데이터에 label 되지 않은 positive 문서가 다수 있음
MS Marco dataset에서 상위 검색된 문서의 70%가 unlabeled positive 문서임
학습 데이터가 충분하지 않음. (구축에 많은 비용이 들기 때문)

제안하는 방법

cross-batch negatives

in-batch negatives는 동일 gpu 안에 있는 다른 질의의 positive를 negative로 사용
cross-batch negatives는 모든 gpu 안에 있는 다른 질의의 positive를 negative로 사용
그래서 cross-batch negatives가 더 많은 negative를 사용하게 됨.

denoised hard negatives

hard negative에서 false negative를 제외함
cross-encoder를 이용해서 false negative를 판단해서 제외함

pseudo labled data를 자동으로 구축함

cross-encoder를 이용해서 pseudo labeled data를 구축함

4단계 학습

첫번째 단계: dual encoder 학습 (학습된 모델을 $M_D^{(0)}$라고 칭함)

labeled data와 cross-batch negatives로 dual encoder 학습

두번째 단계: cross encoder 학습 (학습된 모델을 $M_C$라고 칭함)

positive data는 labeled data에 있는 것을 사용
negative data는 $M_D^{(0)}$으로 상위 검색된 문서를 사용
cross-encoder를 dual-encoder로 찾은 hard negative로 학습하는 효과

다만, false negative가 포함된 상태로 학습함

세번째 단계: dual encoder를 다시 학습 (학습된 모델을 $M_D^{(1)}$라고 칭함)

positive data는 labeled data에 있는 것을 사용 (아마도)
negative data는 $M_D^{(0)}$으로 상위 검색된 문서이면서, $M_C$에서 확실히 negative로 판단한 것을 사용

즉, $M_D^{(0)}$로 찾은 negative 중에서 false negative 일거 같은 것을 제거함

cross-batch negative를 함께 사용함

네번째 단계: dual encoder를 다시 학습 (학습된 모델을 $M_D^{(2)}$라고 칭함)

pseudo labeled datat를 추가해서 학습함

학습 데이터 = labeled data + pseudo labeled data

pseudo labeled data 구축 방법

$M_D^{(1)}$으로 top-k 문서를 찾고, 이를 $M_C$로 labeling 해서 pseudo labeled data 구축
labeled data가 없는 질의를 대상으로 pseudo labeled data 구축
labeled data가 없는 질의는 다른 dataset에서 가지고 옴

cross-batch negative를 함께 사용함

implementations

pre-trained language model (PLM)

dual encoder는 ERNIE 2.0 base 사용
cross-encoder는 ERNIE 2.0 large 사용

Denoised hard negatives and data augmentation

상위 검색된 문서 중에서 점수가 0.1 이하는 negative로 사용하고, 0.9 이상은 positive로 사용
수동 검수했을 때 정확도 90% 이상이었음.

The number of positives and negatives

cross-encoder

positive와 negative 의 비율이 MS MARCO에서는 1:4, Natural questions(NQ)에서는 1:1로 함
negatives는 $M_D^{(0)}$ 모델로 검색된 상위 1000(MS MARCO), 상위 100(NQ) 문서에서 무작위로 추출함

dual-encoder

positive와 negative 의 비율이 MS MARCO에서는 1:4, Natural question(NQ)에서는 1:1로 함
$M_D^{(1)}$와 $M_D^{(2)}$ 모두 동일한 비율로 함

batch size

cross-encoder

MS MARCO: 64 * 4
NQ: 64

dual-encoder

MS MARCO: 512 * 8
NQ: 512 * 2

training epochs

cross-encoder

2 epochs : MS MARCO와 NQ 동일

dual-encoder

MS MARCO: 40 ($M_D^{(0)}$), 10 ($M_D^{(1)}$), 10 ($M_D^{(2)}$)
NQ: 모든 모델에서 30

Maximal length

질의: 32
문단: 128

실험 결과

table 3 (MS MARCO), table 7 (NQ)

cross-batch negatives, denoised hard negative, data augmentation이 품질 향상에 기여함
기여도: denoised hard negative >> data augmentation ~ cross-batch negatives

data augmentation과 cross-batch negatives의 기여도는 데이터에 따라서 다름.
denoised hard negatives는 데이터와 무관하게 기여도가 큼
denoising을 하지 않고 그대로 사용하면 품질 하락이 있음.

MS MARCO에서는 $M_D^{(0)}$의 상위 문서를 negative로 사용하면 망하는 수준임. NQ에서도 품질 하락이 있지만, 망하는 수준은 아님
신기한 것이 denoising 되지 않은 데이터로 cross-encoder를 학습하는데, cross-encoder는 괜찮은거. cross-encoder는 오류는 견고한가?

figure 4 : cross-batch negatives에서 negatives의 개수에 따른 실험 같음

negatives가 많을수록 품질이 향상됨.
그러나 너무 많았을 때는 품질 하락이 있음. batch size가 너무 커져서 모델이 학습을 잘 못했을 것이라고 저자는 추측함
figure 4에서 512에서 품질이 너무 낮음. in-batch negative가 512일텐데, table 3의 in-batch negative와 비교했을 때도 수치가 낮음.
이 그림은 순전히 negative 개수로만 실험을 한건지..

figure 6

augmentation을 했을 때, 품질 향상이 많지는 않은거 같음.
mrr@10은 36.5에서 37.0, recall@50은 85에서 85.5 로 향상됨

기타

cross-encoder 학습시의 loss에 대한 설명이 없음
dual encoder 학습시에 negative data로 정해진 데이터를 어떻게 학습에 사용하는지에 대한 설명이 없음.

cross-batch negative에 대한 설명은 있지만.

faiss 이용

IndexFlatIP로 indexing
검색은 exact maximum inner product search 이용

댓글