0

I have a json file with list of urls.

After reading documentation, I figured, multiproccessing.pool is the best option for me.

I ran 10 urls, with multiprocessing.Pool(10), I was expecting the results would pretty much be instant, but it takes me about 12 seconds to complete everything, not sure if I am using it correctly, but below is my code.

def download_site(data, outputArr, proxyArr=None):
    session = requests.Session()
    # print("Scraping last name {lastName}".format(lastName=data['lastName']))
    userAgents = open('user-agents.txt').read().split('\n')
    params = (
        ('Name', data['lastName']),
        ('type', 'P'),
    )
    url = 'someurl'
    if not proxyArr:
        proxyArr = {
            'http': data['proxy']['http']
        }
    try:
        with session.get(url, params=params, proxies=proxyArr, headers=headers) as response:
            name = multiprocessing.current_process().name
            try:
                content = response.json()
                loadJson = json.loads(content)['nameBirthDetails']
                for case in loadJson:
                    dateFile = loadJson[case]['dateFiled']
                    year = int(dateFile.split('/')[-1])
                    if year > 2018:
                        profileDetailUrlParams = (
                            ('caseId',loadJson[case]['caseYear']),
                            ('caseType', 'WC'),
                            ('caseNumber', loadJson[case]['caseSeqNbr']),
                        )
                        loadJson[case]['caseDetail'] = getAttorneyData(profileDetailUrlParams, session, proxyArr)
                        outputArr.append(loadJson[case])
                # print("Total Scraped Results so far ", len(outputArr))
            except (requests.exceptions.ConnectionError, json.decoder.JSONDecodeError):
                print("Error Found JSON DECODE ERROR - passing for last name", data['lastName'])
            except simplejson.errors.JSONDecodeError:
                print("Found Simple Json Error", data['lastName'])
                pass
                # newProxy = generate_random_proxy()
                # download_site(data, outputArr, newProxy)
           
    except:
        raise

def queueList(sites):
    manager = multiprocessing.Manager()
    outputArr = manager.list()
    functionMain = partial(download_site, outputArr = outputArr)
    p = multiprocessing.Pool(10)
    records = p.map(functionMain, sites)
    p.terminate()
    p.join()


if __name__ == "__main__":
    outputArr = []
    fileData = json.loads(open('lastNamesWithProxy.json').read())[:10]
    start_time = time.time()
    queueList(fileData)
    duration = time.time() - start_time
    print(f"Downloaded {len(fileData)} in {duration} seconds")

The function download_site, is the function where I fetch a list via requests library - then for each item in the list, I make another requests aka function getAttorneyData

How can I further hone this to run faster? I have a high-end computer so CPU shouldn't be an issue, I want to use it to it's max potential.

My goal is to be able to spawn 10 workers and consume each worker with each request. So, 10 requests would really take me 1-2 seconds instead of 12, which is currently.

Biplov
  • 1,136
  • 1
  • 20
  • 44
  • Go for the ```chunksize``` parameter in Pool to improve speed. There is a bit of trial and error way you will have to decide what you should actually keep. – Simplecode Oct 20 '20 at 07:18
  • Wanted to make sure if I am doing anything wrong, as of now when I run with Pool(10), and proccess 10 URls, it takes me 12 seconds whereas 1-2 seconds, am I wrong? – Biplov Oct 20 '20 at 07:19
  • I generally follow the syntax for pool I mentioned in https://stackoverflow.com/a/64295485/12472346, Not sure about the queue – Simplecode Oct 20 '20 at 07:26
  • Hey Dilli! I was confused regarding why you are using multiprocessing instead of asynchronous requests. Since making a request to a server is basically an I/O bound operation, and not a CPU bound operation. – Mooncrater Oct 20 '20 at 08:30
  • 1
    @Mooncrater How can I make asynchronouse requests? Using library like aiohttp? – Biplov Oct 20 '20 at 08:35
  • @Dilli Yep. Take a look at [this](https://docs.aiohttp.org/en/stable/). – Mooncrater Oct 20 '20 at 08:41

0 Answers0