I am working on generating a large password cracking database that I decided not to use rainbow tables on as I would like to expand on the database over time without having to recompute anything. I am using a different technique for which I forget the name but I am basically generating and storing every possible hash in a database that I will be creating a script to optimize for quicker searching. I just finished the first gen code today and started generating hashes. My system generates hashes is a series of steps each one including a bunch of hashes (about 7k each) and can be set to generate a whole series of steps at once. It is designed to be able to run in multiple instances at once but in my basic testing it seems that mysql was the bottleneck probably based on disk performance so multiple instances dis not help much. It works on md5 and sha1 hashes at the same time and generates somewhere between 1k and 5k of each per second. It pre-generates the steps based on a separate script and it generates the hashes for one length of password (actually 2) at a time then I generate a new set of steps and start generating the next longer set of passwords, it generated all 1 and 2 character passwords in my charset (about 85 characters) in very short order and I am now working on all 3 and 4 character passwords. I expect those will be done within a day or 2 mostly based on how much of my computer power I allow it to use. In the 1-3 hours I have been generating hashes I have gotten about 2.6M of each hash totaling about 300MB before I noticed an issue in my table that stored extra data about each password that was causing data to not be saved in it. It would not have really affected cracking but I will be regenerating the data to keep it consistent and allow for better stats.


EDIT: I just timed about how long a step takes and it is about 3 seconds for the ~7k md5 and ~7k sha1 hashes to be created and stored.
This sounds interesting. You could probably save space on storing plaintext by using a radix tree.
elfprince13 wrote:
This sounds interesting. You could probably save space on storing plaintext by using a radix tree.


I'm actually using a MySQL database to store this so I'm not sure how well that would work but while I had not heard of that before that is similar to how I was planning to speed up searches, grouping all hashes that share the same first x characters into an encoded piece of data that would then be searched or maybe having ~256 tables for each hash, one to store each 2 character beginning to the hash for faster searching. Thanks for the suggestion though, I had not encountered those before by name, only be concept and mostly for organization, not storage efficiency.
MySQL has it's own indexing strategy, but I'm not sure what type of data structure they use. Have you created an index on your plaintext column?
elfprince13 wrote:
MySQL has it's own indexing strategy, but I'm not sure what type of data structure they use. Have you created an index on your plaintext column?


No, Why? I actually just noticed I had taken out the indexes temporarily for figuring out a MySQL error and never put them back, I was only going to have one on the hash column though as that is the one I will be searching with to find the associated plaintext.
Glenn wrote:
I was only going to have one on the hash column though as that is the one I will be searching with to find the associated plaintext.


Ah, yeah, that makes sense. I wasn't think when I said that. I'm not sure how you could mix the plaintext space-saving of a radix tree with your database method.
I found some issues in my code that were messing up my database, #1 I was only generating even length passwords, #2 I was forgetting to escape certain input causing some queries to fail. This was discovered as I sought to improve speed by doing multi-insert queries instead of one at a time and as there were a few sets of the values that had issues the query failed. I have since fixed it and started rebuilding the DB and I found out that my fairly simple change in operation made my generation process many times faster which is very helpful in the timeframes we are looking at. I still am probably disk bound and I know I am bound somewhere in mysql as my program is only taking about 25% of a core and I have free processor time. I know generate a step (~7k each md5 and sha1 passwords) in about 1/4 to 1/2 a second (as opposed to 3-5 seconds before).
I have already handily passed my previous level of password generation in a tiny fraction of the time (<2hr as opposed to more than 14) I currently have about 86M (now up to 100M as I write this) passwords in the database and I am looking for someone who can give me tips on MySQL clusteriong as that is the slow point as I start adding threads to compute hashes, I've done what optimization I know how to do in my queries so I'm moving to looking at clustering the MySQL database. I am currently doing ~7k inserts per query and I don't really want to even think about doing more as that would require much changing of my code. I'm currently maxing out my MySQL server at somewhere between 1 and 2 simultaneous compute processes. I don't really see an effective way to improve disk access and I no longer believe that that is the bottleneck as the MySQL process has very little disk IO wait time according to iotop. It is on a RAID 5 disk and that seems to be doing well with the new multi inserts, I'm generating 5-10MB of data per second I believe and each password takes up about 200-250 Bytes. I could also use if anyone can give me tips on optimizing mysql to run faster on a given machine, I'm just using the distro (opensuse) package right now but I'm willing to play around with compiling it if that will make it faster. My MySQL is just about fully taking up a core on my 2 core machine. If anyone has ideas to speed this up I would appreciate it.
How big is the database? Why not use RAM tables?
KermMartian wrote:
How big is the database? Why not use RAM tables?


It is already 20 or 40GB depending on how you count it, I record each hash twice to keep very small lookup tables (one per hash type) along with one to keep more metadata and both hashes. I have an 85 character charset for this and so these tables will be getting very large. I think the bigger issue is the processing the very large sql queries into the right format to store inside of the mysql daemon. I'm currently doing about 28k passwords a second though so I don't think I'm doing to poorly in terms of speed. Will MySQL use more than one core effectively? I could try loading up more processing processes but MySQL is basically using a full core right now. I'm currently guessing I'm about 0.033% done with 5 and 6 character passwords and I'm done with all of them up to 4 characters.
Here's an idea that I just came up with for fun, as an alternative to using MySQL. Buy a good sized SSD, and format it with ZFS, or something else that can deal with a very large number of files, set to use a very small block size. Create a directory tree simulating a radix tree structure, and have two top-level directories called "md5" and "sha1" containing symlinks whose file name is the hash and contains a symlink to the directory whose path (stripped of slashes) corresponds to the text used to generate the hash. You could add a .n file extension to the links for the nth collision with a particular hash. Obviously it would be worth doing some size calculations first, but my intuition says that could take up significantly less space than your database, and still be pretty darn fast.
elfprince13 wrote:
Here's an idea that I just came up with for fun, as an alternative to using MySQL. Buy a good sized SSD, and format it with ZFS, or something else that can deal with a very large number of files, set to use a very small block size. Create a directory tree simulating a radix tree structure, and have two top-level directories called "md5" and "sha1" containing symlinks whose file name is the hash and contains a symlink to the directory whose path (stripped of slashes) corresponds to the text used to generate the hash. You could add a .n file extension to the links for the nth collision with a particular hash. Obviously it would be worth doing some size calculations first, but my intuition says that could take up significantly less space than your database, and still be pretty darn fast.


I find that idea interesting but my rough calculations say that would take up more space, the average space used for directories across the hashes plus the size of the simlink would have to be <125 bytes or better <70 bytes as if you include my table with metadata there is 250 bytes of data per set of one md5 and one sha1 hash of which half is for the metadata table which isn't really needed except to give me more information on how my system is working. My understanding is that a symlink would have to take up at least one sector of the harddrive or 512 bytes, much more than I use this way. I might be able to save space though by creating files 3 levels deep, each one containing all hashes and their passwords starting with those 3 letters and having each one be bzip2 compressed, for that number of files to be fast I could store them first in the mysql database then pull all of the ones destined for the first file and store/compress them, then delete them from the DB, then continue onto the second file, etc. That along with getting rid of the metadata would likely result in a 50-90% reduction in space usage but would take a lot more effort to generate and could very easily be slower at retrieval as I would have to decompress and read the entire file back to find which line contained the needed hash though I could stop early if I found it early. I have a day or two to figure out the initial things to do as there is at least 500GB of storage left and my first step will probably be regular pruning of the metadata table. My other first step will be to clean out the drive of the hundreds of GB of data I generated testing other things. (This is one a software RAID 5 array of 3 1.8TB partitions so it can hold a LOT of data.) The next thing I might do is remove the >1TB of temporary data I was using to try to speed up my online backup that keeps messing up and just backup directly to the internet instead of caching the backup on the drive and streaming it up later. In short I have a fair bit of time until I need to really worry about space. At the moment I have ~350M passwords in the DB totaling ~80GB. I'm about 0.07% finished with my current set of passwords (5 and 6 character) as of 1-2 hours ago.
I'm sure the individual entries aren't taking up that much space, but I'm imagining that any indexes on your sha1 and md5 columns are taking up significantly more space than that, unless you've already taken them into consideration. It could also be possible to implement something similar to what I describe but with variations to make more efficient use of sectors. Implementing radix trees for both the hashed text and the plaintext, then storing the trees across multiple files to prevent having to load the whole structure at once.
elfprince13 wrote:
I'm sure the individual entries aren't taking up that much space, but I'm imagining that any indexes on your sha1 and md5 columns are taking up significantly more space than that, unless you've already taken them into consideration. It could also be possible to implement something similar to what I describe but with variations to make more efficient use of sectors. Implementing radix trees for both the hashed text and the plaintext, then storing the trees across multiple files to prevent having to load the whole structure at once.


Yeah, I can see that idea, the idea I had was basically what you are describing, and I don't have indexes yet, so I don't know how much space that will be. I like my version of the radix style tree (though I would not be using it for space saving) but I am having a hard time figuring out how many levels deep I should go before going to a file as the more levels there are the longer it takes to get them from the initial DB location to the files but the faster it is to search the set of files, does anyone know a way to figure that out? I have a calculator for the number of entries I will have but I don't have a good idea how many levels deep to go. I'm thinking somewhere between 4 and 8, leaning towards 4, three levels of folders each representing one character of the hash and one level of files for the 4th character. Does that sound really off to anyone? I could then compress each file to save space and it would be fairly quick to search as there would only be a small subset of the hashes there, I'm concerned that that would be to many files for getting them in and to few hashes per file limiting compression. I'm thinking I can leave the hash creation process the same as that is far faster than using the files and then have another process moving them after they are created and deleting the source. One big concern I have is how to structure the file to keep it safe from misreads but very quick to read. I'm inclined to use a comma space to separate the hash (first) from the password (second) and then a newline to separate hash/password groups. I have plenty of inodes free (238,826,293 as of writing to be exact). Thoughts??

Thanks,
Glenn
Glenn wrote:
Yeah, I can see that idea, the idea I had was basically what you are describing, and I don't have indexes yet, so I don't know how much space that will be. I like my version of the radix style tree (though I would not be using it for space saving) but I am having a hard time figuring out how many levels deep I should go before going to a file as the more levels there are the longer it takes to get them from the initial DB location to the files but the faster it is to search the set of files, does anyone know a way to figure that out?

It's a multivariable optimization problem depending on your hardware configuration. I did something like it in O/S last semester when calculating the optimum number of bits to use in a paging scheme or something like that.

Quote:
One big concern I have is how to structure the file to keep it safe from misreads but very quick to read. I'm inclined to use a comma space to separate the hash (first) from the password (second) and then a newline to separate hash/password groups.

The hashes are fixed length, so it should be easy to do that.
elfprince13 wrote:
Glenn wrote:
Yeah, I can see that idea, the idea I had was basically what you are describing, and I don't have indexes yet, so I don't know how much space that will be. I like my version of the radix style tree (though I would not be using it for space saving) but I am having a hard time figuring out how many levels deep I should go before going to a file as the more levels there are the longer it takes to get them from the initial DB location to the files but the faster it is to search the set of files, does anyone know a way to figure that out?

It's a multivariable optimization problem depending on your hardware configuration. I did something like it in O/S last semester when calculating the optimum number of bits to use in a paging scheme or something like that.

Quote:
One big concern I have is how to structure the file to keep it safe from misreads but very quick to read. I'm inclined to use a comma space to separate the hash (first) from the password (second) and then a newline to separate hash/password groups.

The hashes are fixed length, so it should be easy to do that.


I figured out a relatively fast method for generating the tables that relies heavily on available RAM to determine how fast it is. I first create a ramdisk (tmpfs style), then I can generate a bunch of sets of hashes (on my ~1GB ramdisk I can do about 5000 sets (each ~1k hashes of each type) with my current charset which has been downsized due to time issues for this version of the project) and then store them in csv to a file per set. I then run the sorting program part 1 which sorts them by the fist 2 characters of the hash into 256 files per hash type (also stored on the ramdisk). I then run the second sort part which sorts each of those files into 256 files that go onto a harddrive with heavy caching of the writes in program to limit the number of writes to the harddrive to improve speed. This has really helped my speed and it is going fairly fast right now, while I don't know any accurate times I would guess about 3-10 minutes per iteration of this process which is about 5M hashes per type. I'm probably going to move this to a new machine I built this weekend today that has more ram and try to use a 2 or 3GB ramdisk as it is the second part of the sorting which takes most of the time and I think most of that is due to having to load the program 256 times to complete it, I will also likely make the program do more internally so it will be loaded fewer times and hopefully that will speed things up a lot. I went with 4 character deep sorting due to the limitations of my sorting method which made that far faster than anything deeper. I may try for more levels in a later version but I need this version done for a science fair next Wednesday, which is also why I downsized the character set so I could generate more of the parts. My very rough estimate on cracking speed based on no code actually written to do it is about one second per 120k-600k hashes in the full dataset though I will know better when I actually have that code written.
Doing some testing on an even smaller character set due to further time restrictions I calculated that in one second of combined hash generation and both sets of sorting I get about 18k hashes a second from nothing done to fully sorted. I do sets lasting about 20 minutes a piece to get that full efficiency and I fully max out one core. This is using a different computer with a 3.5GB tmpfs cache for the parts that are still in progress.
If you're CPU bound rather than I/O bound, I recommend writing a multithreaded task dispatcher. Count the number of available cores, then set a worker thread per core, and hand out a range of a few hundred thousand at a time to each thread.

Or try out some CUDA/OpenCL =)
elfprince13 wrote:
If you're CPU bound rather than I/O bound, I recommend writing a multithreaded task dispatcher. Count the number of available cores, then set a worker thread per core, and hand out a range of a few hundred thousand at a time to each thread.

Or try out some CUDA/OpenCL =)
Oooh, I heartily second the GPU-based suggestion. I love GPGPU code. Smile
elfprince13 wrote:
If you're CPU bound rather than I/O bound, I recommend writing a multithreaded task dispatcher. Count the number of available cores, then set a worker thread per core, and hand out a range of a few hundred thousand at a time to each thread.

Or try out some CUDA/OpenCL =)


I'm actually bound by both I believe though I can't be certain yet, for generation and first sort of the hashes I am certainly CPU bound but that is <10% of the time in my estimate, the second sorting is most of the time and I think that is IO bound at least partially. My main issue with threading is that I did not really write my system to be thread safe in its operation due to time limitations, I think I won't have the time to do what I really need which is a rewrite, I wrote the code in PHP originally because I was rewriting half the code several times a day to work out how to do this quickly and efficiently and rewriting that much C code did not look like a fun task or an easy one as well as I have never done hashing in C and still haven't gotten around to learning C++. I hope to write a version in C at some point but that may be a little off. I may try doing compiled PHP though as that may speed things up a bit. Threading would be another option but there is only one threadsafe portion of the code and that is the part that takes the least time (probably <3%). It is already fairly fast though in <6 hours (probably more like 3-4) I generated all numeric passwords up to 8 characters long. I plan on finishing all numeric passwords up to 10 characters long then work on lower alpha numeric passwords. The one other part of my code that is partially thread safe, the second sort, is very disk intensive and so I intentionally didn't thread it to avoid to much seeking to try to speed it up as it was having speed issues when I threaded it more than without. The second sorting writes between 5 and 10 MB per second to disk unthreaded and so I think threading it will likely cause issues with the disk. I like the ideas I just don't think I have time to implement them yet. I also think I'm at or near the limit of where more RAM improves performance.
  
Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.

» Go to Registration page
Page 1 of 2
» All times are UTC - 5 Hours
 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

 

Advertisement