merging in 'construction_movies_dataset.ipynb'

Hi, I noticed some error running a notebook pretty similar to the 'construction_movies_dataset.ipynb' , I noticed that during the merging 'merged_pairs = df1.merge(df2, on=['title', 'director']).drop_duplicates()' in the 'Aligned' section, there is a change in column names of df1 and df 2 in every matching column which is not director and title. Because original df1 and df2 have common names like 'actor_0', and you don't merge on it, it changes column names to 'actor_0_x' for df1 and 'actor_0_y' in df2. Which leads to a problem in the cell 'df1b = merged_pairs[df1.columns].copy()' because df1.columns have an '_x' added to column names in merged_pairs. So I propose to use instead df1_copy and def2_copy and use this cell in the 'aligned' section to have the expected behavior solving it.

df1_copy = df1.copy() df2_copy = df2.copy() concat_SM = pd.concat([df1_copy, df2_copy], ignore_index=True) df1_copy.rename(columns={'movie_title': 'title', 'director_name': 'director'}, inplace=True) df1_copy.columns = ['imdb_' + str() for _ in range(len(df1.columns))] df2_copy.columns = ['movielens' + str(_) for _ in range(len(df2.columns))]

df1_copy.rename(columns={'imdb_4': 'title', 'imdb_1': 'director'}, inplace=True) df2_copy.rename(columns={'movielens_5': 'title', 'movielens_10': 'director'}, inplace=True) merged_pairs = df1_copy.merge(df2_copy, on=['title', 'director']).drop_duplicates() ##now the merging is possible

And change the cell

'df1b = merged_pairs[df1.columns].copy() df2b = merged_pairs[df2.columns].copy()"

for a cell

'df1b = merged_pairs[df1_copy.columns].copy() df2b = merged_pairs[df2_copy.columns].copy()'

Edited Oct 21, 2019 by montecot