EMBER issueshttps://gitlab.eurecom.fr/cappuzzo/ember/-/issues2019-10-30T14:02:43Zhttps://gitlab.eurecom.fr/cappuzzo/ember/-/issues/1merging in 'construction_movies_dataset.ipynb'2019-10-30T14:02:43Zmontecotmerging in 'construction_movies_dataset.ipynb'Hi, I noticed some error running a notebook pretty similar to the 'construction_movies_dataset.ipynb' , I noticed that during the merging 'merged_pairs = df1.merge(df2, on=\['title', 'director'\]).drop_duplicates()' in the 'Aligned' sect...Hi, I noticed some error running a notebook pretty similar to the 'construction_movies_dataset.ipynb' , I noticed that during the merging 'merged_pairs = df1.merge(df2, on=\['title', 'director'\]).drop_duplicates()' in the 'Aligned' section, there is a change in column names of df1 and df 2 in every matching column which is not director and title. Because original df1 and df2 have common names like 'actor_0', and you don't merge on it, it changes column names to 'actor_0_x' for df1 and 'actor_0_y' in df2. Which leads to a problem in the cell 'df1b = merged_pairs[df1.columns].copy()' because df1.columns have an '_x' added to column names in merged_pairs. So I propose to use instead df1_copy and def2_copy and use this cell in the 'aligned' section to have the expected behavior solving it.
df1_copy = df1.copy()
df2_copy = df2.copy()
concat_SM = pd.concat([df1_copy, df2_copy], ignore_index=True)
df1_copy.rename(columns={'movie_title': 'title', 'director_name': 'director'}, inplace=True)
df1_copy.columns = ['imdb_' + str(_) for _ in range(len(df1.columns))]
df2_copy.columns = ['movielens_' + str(_) for _ in range(len(df2.columns))]
df1_copy.rename(columns={'imdb_4': 'title', 'imdb_1': 'director'}, inplace=True)
df2_copy.rename(columns={'movielens_5': 'title', 'movielens_10': 'director'}, inplace=True)
merged_pairs = df1_copy.merge(df2_copy, on=['title', 'director']).drop_duplicates()
##now the merging is possible
And change the cell
'df1b = merged_pairs\[df1.columns\].copy()
df2b = merged_pairs\[df2.columns\].copy()"
for a cell
'df1b = merged_pairs\[df1_copy.columns\].copy()
df2b = merged_pairs\[df2_copy.columns\].copy()'