RSS
 
 

Archive for the ‘Email’ Category

Email Scraping the Insane way

29 Jun

Well,
Insane as far as an individual goes:
I have had to shut it down because it was a bit too scaleable.
Any computer it was loaded on, contacted the distributed data-storage mechanism and off it went…..
(Over 20 million HTTP addresses in less than 24 hours) with each and every HTTP address recorded.

Caviets
There is still an issue related to the HTTP storage mechanism, in that if they get corrupted they cannot be recovered and have to be started from scratch, possibly something related to partitioning would be good to solve this problem.

The Story
After looking around at various “email scraping” programs and having a bit of spare time on my hands, I decided to build an email web scraper out of Java.

Yep…. I know Java is supposed to be slow and not as fast as “insert current lame language here
But the crawler is cross platform and that is enough for now.

Key requirements:
1. Don’t duplicate email addresses (actually checking for this is a waste of processing power, since a simple sort and merge will remove duplicates)
2. Don’t crawl the same web addresses unless specifically required to.
(Experienced programmers will already see a problem with #1 & #2, even with 16GB of real ram)
3. Make it fast.
4. Make the physical HTML parser a plugin.
5. Make it ‘seriously’ scaleable.
6. Completely restartable after a crash/shutdown of the Control server.
7. Fluid enough to deal with crawling nodes popping up/shutting down randomly
8. Crawls in the direction of most profitable harvesting.

#1 & #2 actually require that you either store:
A. All the web addresses (HTTP:xxx)
B. All the hashes of the Addresses
Looking at the length of some web addresses (>150 characters) and we can see some design criteria regarding memory starting to creep in.

Storage
We actually started testing by implementing a Hashmap in internal memory , that maxed out at about 500,000 entries, but did it in a BAD way: the JVM physically crashed and exited (thanks OSX).
This resulted in the complete loss of the whole crawl session.

VIRI
You will also be surprised at the number of websites attempting to do IFRAME injections of viri linked into the “mailto:” tags.
This also resulted in JVM crashes, but this was more to do with the antivirus program on the computer.
Basically the antivirus was looking at the stream of data from the web port and then “Clamping” the port by ending the communication session in an abrupt way.
It might have been better if the Antivirus program just inserted “dummy” 0x00 bytes.

To be continued