Parsing URLs in Pandas DataFrame

2018-03-09 08:41:06

My client needs their Google AdWords destination URL query parsed and the values spell checked to eliminate any typos ("use" instead of "us", etc).

I'm pulling the data using the AdWords API and putting it into a dateframe for manipulation. Everything works, but there are over 100,000 records every pull and sometimes the code takes hours and hours to run. Is there a way to optimize the following code blocks?

def parse_url(df):

for index, row in df.iterrows():

parsed = urlparse(str(row['Destination URL'])).query

parsed = parse_qs(parsed)

for k, v in parsed.iteritems():

df.loc[index, k.strip()] = v[0].strip().lower()

return df

def typo_correct(urlparams, df, dictionary):

for index, row in df.iterrows():

for w in urlparams:

if df.loc[index,w] == None or len(df.loc[index,w])<2 or w == 'account':

pass

else:

high = 0.0

word = None

  • idk if it's still relevant for you, but I can see some optimization that you could in your code.

    As I see you are passing all datafreame object and then parsing specific column in that dataframe:

    def parse_url(df):

    for index, row in df.iterrows():

    parsed = urlparse(str(row['Destination URL'])).query #<==

    parsed = parse_qs(parsed)

    for k, v in parsed.iteritems():

    df.loc[index, k.strip()] = v[0].strip().lower()

    return df

    It would be faster if you pass only the column that you need to parse.

    E.g.

    def parse_url(df):

    for index, row in df.iterrows():

    parsed = urlparse(str(row)).query

    parsed = parse_qs(parsed)

    for k, v in parsed.items(): #use items() in Python3 and iteritems() in python3

    df.loc[index, k.strip()] = v[0].strip().lower()

    return df

    parse_url(df['columnName'])

    Then first your function would have less work to do and performance would increase. At least slightly.

    2018-03-09 09:04:14