HTML predominates in the Catalan web
20-12-2007
The PADICAT project (Digital heritage of Catalonia), led by the Biblioteca de Catalunya (National Library of Catalonia) with the support of the Centre de Supercomputació de Catalunya (CESCA), has carried out an exhaustive analysis of the formats and technology used on the Catalan web, based in a sample of 1.000 websites of all kinds.
The radiography of these 1.000 websites included in the repository of the project's web allows us to confirm that each website has an average of 1,33 GB volume and 33.942 files. Never before has an analysis of the Catalan web been carried out with such a significant sample.
Webs included in PADICAT/research sample | 1.004 |
---|---|
Web pages captured in different editions | 2.720 |
Total number of files | 34.077.807 |
File's average for each web page | 33.942 |
Total volume of PADICAT's archive | 1.339,24 GB |
Volume's average for each web page | 1,33 GB |
Otherwise, this research confirms that the most usual formats in the Catalan net are html (71,69%), jpeg (7,09%), gif (2,45%) and pdf (1,32%), followed by other not so usual kinds. For the project’s leaders, the majority presence of such popular formats, which altogether comes to 82,5% of the whole existing formats in the Catalan web, allows to predict an encouraging future for the preservation of digital resources on the internet.
Format | Files | Volume (GB) | % Files | % Volume |
---|---|---|---|---|
text/html | 24.429.679 | 592,45 | 71,69% | 55,83% |
image/jpg | 2.416.055 | 123,81 | 7,09% | 11,67% |
image/gif | 834.019 | 6,79 | 2,45% | 0,64% |
application/pdf | 449.983 | 167,34 | 1,32% | 15,77% |
no-type | 75.070 | 0,16 | 0,22% | 0,02% |
image/png | 72.905 | 1,51 | 0,21% | 0,14% |
application/x-shockwave- flash | 68.379 | 5,62 | 0,20% | 0,53% |
application/msword | 42.150 | 5,31 | 0,12% | 0,50% |
text/plain | 39.962 | 15,77 | 0,12% | 1,49% |
text/css | 35.668 | 0,17 | 0,10% | 0,02% |
text/xml | 35.583 | 0,46 | 0,10% | 0,04% |
application/x-javascript | 23.882 | 0,18 | 0,07% | 0,02% |
image/pjpeg | 14.514 | 0,38 | 0,04% | 0,04% |
audio/mpeg | 10.319 | 41,1 | 0,03% | 3,87% |
application/atom+xml | 10.264 | 0,05 | 0,03% | 0,00% |
image/bmp | 10.202 | 2,23 | 0,03% | 0,21% |
audio/x-ms-wma | 8.869 | 25,78 | 0,03% | 2,43% |
application/download | 8.122 | 0,3 | 0,02% | 0,03% |
application/zip | 5.730 | 11,49 | 0,02% | 1,08% |
application/xml | 5.396 | 0,05 | 0,02% | 0,00% |
application/vnd.ms-excel | 5.222 | 0,55 | 0,02% | 0,05% |
The Biblioteca de Catalunya, which participates in the International Internet Preservation Consortium along with 26 other institutions, aims, through the PADICAT project, to preserve Catalan websites and to guarantee their open and permanent access. The project enjoys the agreement of 287 institutions of all kinds.