selfishchimp’s posterous

 
Filed under

scrits

 

Linux: Extracting Data Embedded in Filenames

Say you have a large number of files (hundred or thousands) where the names of the files contain useful data  — like dates or names — that you would would like to extract and use elsewhere (for example, as metadata). I recently found myself in this situation and found some neat tricks to get the job done and while I am sure there are more elegant ways to do this, the following worked great for me.

Lets say your files have the following format:

word1_number1_number2_number3.txt

Conceptually you could think of such filenames as having 4 information-containing fields delineated by an underscore symbol. What we want to do is extract the data contained in these fields and store it in a seperate file as a columns so that it can can be manipulated or imported into a database or spreadsheet.

This can be done using standard linux commands like “echo” and “cut” in a shell script. Here is the shell script, then I will explain what each part means and how to change it to suite your needs:

#!/bin/bash
for f in *.txt do;
s=`echo $f | cut -d “_” -f1`;
echo “$s” » newdatafile.out;
done

The above will:

  • #!/bin/bash A requirement for all BASH shell scripts.
  • for f in *.txt do; Find all files in the directory with the extension .txt and give each one the variable name $f (change the extension if your files have a different format)
  • s= Give the output of this line the variable name $s
  • `echo $f | print out each file name individually and pass it along to the cut command
  • cut -d “_” tell the cut command to treat each segment between underscores as a field (change the underscore here to another character to change field delineation)
  • -f1`; tells the cut command to extract the first field from the filename (change this to f2 to extract the second field, f3, f4, etc. etc. etc. )
  • echo “$s” » newdatafile.out; : Take the output from the third line (which was given the variable name $s) and append it to a new line in the file newdatafile.out (change the output file name as required)


What you should end up with is a file called newdatafile.out that has a single column like so:

word1
word2

Then you can change the script slightly to extract your second field (into a new file) and so on until you’ve extracted all the information you need from the filenames. Then use the paste command (or whatever, cut and paste, etc.) to combine the files. You should then have a single file with columns for each field where the fields in each column come from the same filename

Loading mentions Retweet
Filed under  //   data   linux   scrits  

Comments [0]